本篇博文主要内容为 2025-12-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-18)
今日共更新464篇论文,其中:
- 自然语言处理共60篇(Computation and Language (cs.CL))
- 人工智能共142篇(Artificial Intelligence (cs.AI))
- 计算机视觉共109篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共139篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
【速读】: 该论文旨在解决神经网络内部激活状态(activation space)解释困难的问题,传统方法依赖人工设计的代理(agent)来假设和验证内部激活与外部行为之间的关系,但这种方法难以扩展。其解决方案的关键在于将可解释性任务转化为端到端的训练目标:通过训练一个“预测概念解码器”(Predictive Concept Decoder, PCD),使其从激活中压缩出稀疏的概念列表,并利用该列表回答自然语言问题,从而实现对模型行为的准确预测。该架构包含一个编码器(压缩激活为概念)和一个解码器(基于概念回答问题),并通过通信瓶颈(communication bottleneck)强制模型提取关键语义信息,使概念的可解释性得分(auto-interp score)随数据量增长而提升,且在下游任务如检测越狱攻击、隐含提示和植入的潜在概念等方面表现优异。
链接: https://arxiv.org/abs/2512.15712
作者: Vincent Huang,Dami Choi,Daniel D. Johnson,Sarah Schwettmann,Jacob Steinhardt
机构: Transluce
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 12 figures
Abstract:Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
zh
[NLP-1] Activation Oracles: Training and Evaluating LLM s as General-Purpose Activation Explainers
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)激活值难以解释的问题,传统方法依赖复杂且特定的解析技术。其解决方案的关键在于提出一种通用的“激活查询”(LatentQA)范式:训练LLM直接以自身激活值为输入,并用自然语言回答关于这些激活值的任意问题。通过多样化训练数据(如分类任务和自监督上下文预测任务),所得到的“激活Oracle”(Activation Oracles, AOs)在远超出训练分布的下游任务中展现出强泛化能力,甚至能恢复未在输入文本中显式出现的细粒度信息(如微调引入的生物知识或有害倾向),表明以自然语言问答为导向的多样化训练能够赋予模型将激活信息转化为可理解语义的一般能力。
链接: https://arxiv.org/abs/2512.15674
作者: Adam Karvonen,James Chua,Clément Dumas,Kit Fraser-Taliente,Subhash Kantamneni,Julian Minder,Euan Ong,Arnab Sen Sharma,Daniel Wen,Owain Evans,Samuel Marks
机构: MATS; Truthful AI; EPFL (瑞士联邦理工学院); ENS Paris-Saclay (巴黎高等师范学院); Northeastern University (东北大学); Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages
Abstract:Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
zh
[NLP-2] Explaining the Reasoning of Large Language Models Using Attribution Graphs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)推理过程不透明的问题,尤其是现有上下文归因方法(context attributions)在解释自回归生成模型行为时存在的局限性——即仅将生成标记直接关联到输入提示(prompt),忽略了生成过程中各步骤之间的相互影响(inter-generational influence)。为此,作者提出Context Attribution via Graph Explanations (CAGE) 框架,其核心创新在于构建一个有向图结构(attributions graph),该图量化了每个生成步骤如何受初始提示及所有先前生成内容的影响,并通过保留因果性(causality)和行随机性(row stochasticity)两个性质,使归因可通过图中路径上的中间贡献进行边缘化计算,从而显著提升归因的忠实度(faithfulness),在多个模型、数据集和评估指标上平均提升达40%。
链接: https://arxiv.org/abs/2512.15663
作者: Chase Walker,Rickard Ewetz
机构: University of Florida (佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
zh
[NLP-3] PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning
【速读】: 该论文旨在解决持续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting)问题,即模型在学习新任务时会显著退化先前任务的性能。其解决方案的关键在于提出PPSEBM框架,该框架结合了能量模型(Energy-Based Model, EBM)与渐进式参数选择(Progressive Parameter Selection, PPS):PPS为每个新任务分配独立的任务特定参数,而EBM则生成先前任务的代表性伪样本,这些伪样本用于指导参数选择过程,从而增强模型对历史知识的保留能力并有效适应新任务。
链接: https://arxiv.org/abs/2512.15658
作者: Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Feng Mi,Latifur Khan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 2025 IEEE International Conference on Big Data (BigData)
Abstract:Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model’s ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
zh
[NLP-4] Characterizing Mambas Selective Memory using Auto-Encoders AACL2025
【速读】: 该论文旨在解决状态空间模型(State Space Models, SSMs)在语言建模中因固定内存使用而导致的信息遗忘问题,特别是缺乏对SSM语言模型(LMs)倾向于遗忘哪些类型信息的系统性认知。其解决方案的关键在于构建一个自动编码器(auto-encoder),通过从SSM的隐藏状态重建输入序列,并以重建误差作为信息损失的度量指标,从而识别出易被遗忘的token类型(如数学相关词元、组织实体提及、非标准美式英语方言)和序列模式。实验表明,这些易忘信息通常在预训练数据中出现频率较低,为未来改进SSM模型的信息保留能力提供了明确方向。
链接: https://arxiv.org/abs/2512.15653
作者: Tamanna Hossain,Robert L. Logan IV,Ganesh Jagadeesan,Sameer Singh,Joel Tetreault,Alejandro Jaimes
机构: University of California, Irvine (加州大学欧文分校); Dataminr Inc.
类目: Computation and Language (cs.CL)
备注: AACL 2025. Oral Presentation
Abstract:State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM’s hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M–1.4B) on sequences ranging from 4–256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba’s pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba’s ability to retain important information.
zh
[NLP-5] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文窗口时面临的计算与内存开销问题,以及由此带来的可扩展性瓶颈。其核心挑战在于:尽管视觉-文本压缩(Vision-Text Compression, VTC)技术能将长文本转化为高密度二维视觉表示以实现3x–20x的token压缩比,但这种高信息密度对视觉语言模型(Vision-Language Models, VLMs)的长期上下文理解能力的影响尚不明确。解决方案的关键在于首次构建了针对VTC的基准测试体系——VTCBench,涵盖三种典型长上下文理解场景:VTC-Retrieval(检索与聚合)、VTC-Reasoning(隐式关联推理)和VTC-Memory(长期对话记忆问答),并通过VTCBench-Wild模拟多样化输入,系统评估主流开源与闭源VLMs的表现。结果揭示了当前多数VLM虽能有效解码OCR等基础文本信息,却在处理VTC压缩数据时严重缺乏对长距离依赖关系的建模能力,从而为未来更高效、可扩展的VLM设计提供了关键洞见与评估标准。
链接: https://arxiv.org/abs/2512.15649
作者: Hongbo Zhao,Meng Wang,Fei Zhu,Wenzhuo Liu,Bolin Ni,Fanhu Zeng,Gaofeng Meng,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS (香港科学与创新研究院人工智能与机器人中心,中国科学院); Tencent Hunyuan Team (腾讯混元团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input this http URL comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the this http URL study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
zh
[NLP-6] How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness AACL
【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在下游问答(QA)任务中配置策略不明确、泛化能力不足以及与全监督微调(Supervised Fine-Tuning, SFT)性能对比不清的问题。其关键解决方案是通过系统性地进行秩(rank)扫描实验,在多个推理和召回数据集上量化SFT与PEFT(以LoRA为例)之间的性能-效率权衡,并结合域内与域外适应场景分析模型的泛化行为与任务特定遗忘现象,从而揭示LoRA在特定秩值下可实现媲美甚至优于SFT的性能,尤其在推理任务中表现突出。此外,通过谱特征和层间注意力结构分析,进一步阐释了表示漂移与注意力模式变化的内在机制。
链接: https://arxiv.org/abs/2512.15634
作者: Darshita Rathore,Vineet Kumar,Chetna Bansal,Anindya Moitra
机构: PayPal Artificial Intelligence (PayPal人工智能); PayPal, Bengaluru, India (PayPal, 班加罗尔, 印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AACL IJCNLP 2025
Abstract:Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream QA tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
zh
[NLP-7] Evaluating Metrics for Safety with LLM -as-Judges
【速读】: 该论文试图解决如何在安全关键的信息流中可靠地引入大语言模型(Large Language Models, LLMs),以替代人类执行任务时可能存在的瓶颈问题,同时应对LLMs易出错的特性。解决方案的关键在于采用“LLM-as-Judges”(LaJ)评估框架,并通过构建一组加权指标来降低评估过程中的错误风险;同时利用上下文敏感性定义错误严重程度,并设计置信度阈值,在多个评估者之间一致性较低时触发人工复核机制,从而保障关键决策的安全性与可靠性。
链接: https://arxiv.org/abs/2512.15617
作者: Kester Clegg,Richard Hawkins,Ibrahim Habli,Tom Lawton
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
zh
[NLP-8] You Never Know a Person You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations
【速读】: 该论文旨在解决心理防御机制(psychological defenses)在临床对话中难以可靠测量的问题,尤其是如何有效标注和分析求助者在交流中使用的防御水平。其解决方案的关键在于构建PsyDefConv这一对话语料库,并开发DMRS Co-Pilot四阶段自动化预标注流程:该流程基于证据提供初步标注,显著提升标注效率(平均减少22.4%的标注时间),同时保持较高临床合理性(专家评分平均达4.4以上)。该方法为研究防御功能在语言中的表现提供了可复现的数据基础与工具支持。
链接: https://arxiv.org/abs/2512.15601
作者: Hongbin Na,Zimu Wang,Zhaoming Chen,Peilin Zhou,Yining Hua,Grace Ziqi Zhou,Haiyang Zhang,Tao Shen,Wei Wang,John Torous,Shaoxiong Ji,Ling Chen
机构: University of Technology Sydney (悉尼科技大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Utah (犹他大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Harvard University (哈佛大学); The University of Sydney (悉尼大学); ELLIS Institute Finland (ELLIS研究所芬兰); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen’s kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.
zh
[NLP-9] Bolmo: Byteifying the Next Generation of Language Models
【速读】: 该论文旨在解决传统子词级别语言模型(subword-level language models)在字符级理解能力不足以及因固定子词词汇表导致的效率限制问题。其核心解决方案是提出Bolmo,一种通过“字节化”(byteification)技术将现有子词级语言模型转换为字节级语言模型的方法。关键创新在于设计了适配字节化训练的架构,有效弥合了字节级与子词级模型之间的表达能力差异,并引入精确的知识蒸馏目标,仅需不到1%的预训练token预算即可实现高效迁移。该方法使Bolmo在字符理解和编码任务上显著优于以往字节级模型,同时在多数任务上接近原生子词模型性能,且推理速度可与子词模型竞争,从而让字节级语言模型成为具有广泛适用性的实用选择。
链接: https://arxiv.org/abs/2512.15586
作者: Benjamin Minixhofer,Tyler Murray,Tomasz Limisiewicz,Anna Korhonen,Luke Zettlemoyer,Noah A. Smith,Edoardo M. Ponti,Luca Soldaini,Valentin Hofmann
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs’ performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
zh
[NLP-10] An Empirical Study on Chinese Character Decomposition in Multiword Expression-Aware Neural Machine Translation
【速读】: 该论文旨在解决中文等表意文字语言中多词表达(Multi-word Expressions, MWEs)带来的翻译挑战,这些问题包括歧义性、习语特性、低频使用及多样化变体,导致神经机器翻译(Neural Machine Translation, NMT)性能受限。解决方案的关键在于系统性地研究汉字分解技术(Chinese character decomposition technology),通过将汉字拆解为更基本的构字单元,增强模型对原始词义和字符语义的表示能力,从而提升对MWE的识别与翻译效果。该方法弥补了传统子词建模(如BPE)在表意文字中无法直接适用的局限,为中文及其他类似语言的NMT提供了更具语义粒度的表示路径。
链接: https://arxiv.org/abs/2512.15556
作者: Lifeng Han,Gareth J. F. Jones,Alan F. Smeaton
机构: 未知
类目: Computation and Language (cs.CL)
备注: capstone work, technical report, 27 pages, extraction from PhD thesis this https URL
Abstract:Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.
zh
[NLP-11] From Data to Dialogue: Unlocking Language for All
【速读】: 该论文试图解决传统通用词表(General Service List, GSL)在语言学习中依赖人工标注、主观性强且耗时的问题,旨在构建一种更高效、可自动化的词表生成方法。其解决方案的关键在于提出基于客观标准的专用词表(Specialized Word List, SWL)构建策略,通过限定为特定语料子集的词汇筛选,实现更高的覆盖率与更低的词汇量需求,从而优化语言学习者的词汇习得效率,并具备自动化、可扩展和个性化定制的潜力。
链接: https://arxiv.org/abs/2512.15552
作者: Dakota Ellis,Samy Bakikerali,Wanshan Chen,Bao Dinh,Uyen Le
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Traditional linguists have proposed the use of a General Service List (GSL) to assist new language learners in identifying the most important words in English. This process requires linguistic expertise, subjective input, and a considerable amount of time. We attempt to create our own GSL and evaluate its practicality against the industry standard (The NGSL). We found creating a Specialized Word List (SWL), or a word list specific to a subset of the overall corpus, to be the most practical way for language-learners to optimize the process. The SWL’s that we created using our model outperformed the industry standard, reaching the 95% coverage required for language comprehension with fewer words comparatively. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.
zh
[NLP-12] Learning inflection classes using Adaptive Resonance Theory
【速读】: 该论文旨在解决语言使用者如何通过无监督学习机制习得动词屈折类(inflection classes)的问题,即探讨个体语言用户在缺乏标注数据的情况下,如何基于词汇形式的相似性自动聚类出具有类比推理能力的屈折类别。其解决方案的关键在于采用基于自适应共振理论(Adaptive Resonance Theory, ART)的神经网络模型,该模型通过调节 vigilance 参数控制泛化程度,在拉丁语、葡萄牙语和爱沙尼亚语三种语言上实现了对词素聚类的模拟,并发现最优性能出现在一个狭窄的泛化参数区间内;同时,模型提取的特征与语言学中描述的屈折类高度一致,表明该方法具备认知合理性与可解释性,为未来在基于代理的模型中研究屈折类演变提供了可行框架。
链接: https://arxiv.org/abs/2512.15551
作者: Peter Dekker,Heikki Rasilo,Bart de Boer
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The concept of inflection classes is an abstraction used by linguists, and provides a means to describe patterns in languages that give an analogical base for deducing previously unencountered forms. This ability is an important part of morphological acquisition and processing. We study the learnability of a system of verbal inflection classes by the individual language user by performing unsupervised clustering of lexemes into inflection classes. As a cognitively plausible and interpretable computational model, we use Adaptive Resonance Theory, a neural network with a parameter that determines the degree of generalisation (vigilance). The model is applied to Latin, Portuguese and Estonian. The similarity of clustering to attested inflection classes varies depending on the complexity of the inflectional system. We find the best performance in a narrow region of the generalisation parameter. The learned features extracted from the model show similarity with linguistic descriptions of the inflection classes. The proposed model could be used to study change in inflection classes in the future, by including it in an agent-based model.
zh
[NLP-13] CTkvr: KV Cache Retrieval for Long-Context LLM s via Centroid then Token Indexing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文场景中推理效率低下的问题,特别是由键值缓存(Key-Value Cache, KV Cache)带来的高内存开销和因频繁内存访问导致的延迟。现有动态KV选择方法在块级索引与令牌级索引之间存在权衡:前者因检索无关KV条目降低准确性,后者则因低效检索机制增加延迟。解决方案的关键在于提出一种两阶段的KV检索机制——CTKVR(Centroid-then-Token KV Retrieval),其核心洞察是:经旋转位置编码(Rotary Position Embedding, RoPE)后的查询向量在位置上相邻时具有高度相似性,并共享大部分Top-k KV缓存条目。CTKVR首先在预填充阶段轻量级地预先计算中心点(centroids)用于粗粒度索引,随后进行令牌级精修以实现精确KV检索,从而在效率与准确性之间取得平衡。此外,通过CPU-GPU协同执行优化索引构建与搜索过程,进一步提升了系统性能。
链接: https://arxiv.org/abs/2512.15550
作者: Kuan Lu,Shuhang Lin,Sai Wu,Yichen Yao,Junhan Yang,Huan Li,Wei Chu,Xu Yinghui,Yuan Qi,Gang Chen
机构: Zhejiang University (浙江大学); Rutgers University (罗格斯大学); INFLY Tech (INFLY科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.
zh
[NLP-14] When a Nation Speaks: Machine Learning and NLP in Peoples Sentiment Analysis During Bangladeshs 2024 Mass Uprising
【速读】: 该论文旨在解决在民变冲突情境下,尤其是孟加拉语(Bangla)环境中,情感动态分析的缺失问题。现有研究多集中于选举或社交媒体趋势中的情感分析,而对政治动荡时期公众情绪变化的理解仍属空白。为应对这一挑战,作者构建了一个包含2028条标注新闻标题的独特数据集,涵盖“愤怒”、“希望”与“绝望”三类情感标签,并采用潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)提取核心主题,进一步揭示如政治腐败和公共抗议等议题如何影响情绪演变。关键解决方案在于使用针对孟加拉语优化的语言模型,在性能上显著优于多语言预训练模型(如mBERT: 67%、XLM-RoBERTa: 71%)及传统机器学习方法(SVM与逻辑回归均为70%),从而验证了语言特异性建模在复杂社会事件中捕捉情感细微差异的有效性。
链接: https://arxiv.org/abs/2512.15547
作者: Md. Samiul Alim,Mahir Shahriar Tamim,Maisha Rahman,Tanvir Ahmed Khan,Md Mushfique Anwar
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted in 2025 28th International Conference on Computer and Information Technology (ICCIT)
Abstract:Sentiment analysis, an emerging research area within natural language processing (NLP), has primarily been explored in contexts like elections and social media trends, but there remains a significant gap in understanding emotional dynamics during civil unrest, particularly in the Bangla language. Our study pioneers sentiment analysis in Bangla during a national crisis by examining public emotions amid Bangladesh’s 2024 mass uprising. We curated a unique dataset of 2,028 annotated news headlines from major Facebook news portals, classifying them into Outrage, Hope, and Despair. Through Latent Dirichlet Allocation (LDA), we identified prevalent themes like political corruption and public protests, and analyzed how events such as internet blackouts shaped sentiment patterns. It outperformed multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional machine learning methods (SVM and Logistic Regression: both 70%). These results highlight the effectiveness of language-specific models and offer valuable insights into public sentiment during political turmoil.
zh
[NLP-15] racking Temporal Dynamics of Vector Sets with Gaussian Process
【速读】: 该论文旨在解决时变向量集合(temporal sets of vectors)的建模与分析难题,这类问题广泛存在于生态学、犯罪分析和语言学等领域,其核心挑战在于向量集合的复杂结构及其随时间动态演化特性。解决方案的关键在于利用无限维高斯过程(infinite-dimensional Gaussian processes)对每组向量的潜在分布进行建模,并通过随机傅里叶特征(Random Fourier Features)近似高斯过程中的隐含函数,从而获得低维、紧凑且可比较的向量表示,实现对向量集合随时间演化的有效追踪与可视化。
链接: https://arxiv.org/abs/2512.15538
作者: Taichi Aida,Mamoru Komachi,Toshinobu Ogiso,Hiroya Takamura,Daichi Mochihashi
机构: Tokyo Metropolitan University (东京都立大学); Hitotsubashi University (一桥大学); National Institute for Japanese Language and Linguistics (日本语言研究所); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所); The Institute of Statistical Mathematics (统计数理研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Understanding the temporal evolution of sets of vectors is a fundamental challenge across various domains, including ecology, crime analysis, and linguistics. For instance, ecosystem structures evolve due to interactions among plants, herbivores, and carnivores; the spatial distribution of crimes shifts in response to societal changes; and word embedding vectors reflect cultural and semantic trends over time. However, analyzing such time-varying sets of vectors is challenging due to their complicated structures, which also evolve over time. In this work, we propose a novel method for modeling the distribution underlying each set of vectors using infinite-dimensional Gaussian processes. By approximating the latent function in the Gaussian process with Random Fourier Features, we obtain compact and comparable vector representations over time. This enables us to track and visualize temporal transitions of vector sets in a low-dimensional space. We apply our method to both sociological data (crime distributions) and linguistic data (word embeddings), demonstrating its effectiveness in capturing temporal dynamics. Our results show that the proposed approach provides interpretable and robust representations, offering a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains.
zh
[NLP-16] oward expert-level motivational interviewing for health behavior improvement with LLM s
【速读】: 该论文试图解决动机访谈(Motivational Interviewing, MI)在实际应用中因依赖高度训练的人类咨询师而导致的可扩展性受限问题。解决方案的关键在于通过针对中文心理辅导语料库进行微调,开发出能够模拟MI核心技巧的大语言模型(MI-LLMs),从而实现AI辅助的健康行为改变支持。研究采用GPT-4生成高质量多轮对话数据,并基于三个中文开源大模型(Baichuan2-7B-Chat、ChatGLM-4-9B-Chat和Llama-3-8B-Chinese-Chat-v2)进行微调,结果表明MI-LLMs在技术性和关系性维度上接近真实MI对话水平,验证了该路径的可行性。
链接: https://arxiv.org/abs/2512.15446
作者: Run-ze Hu,Yang Yang,Yi-hang Yang,Jing-qi Kong,Jia-hui Luo,Wen-yu Yang,Jing Chen,Jing-yao Liu,Hui-qun Zeng,Lei Zhang,Zheng Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 3 figures
Abstract:Background: Motivational interviewing (MI) is an effective counseling approach for promoting health behavior change, but its impact is constrained by the need for highly trained human counselors. Objective: This study aimed to explore a scalable alternative by developing and evaluating Large Language Models for Motivational Interviewing (MI-LLMs). Methods: We first curated five Chinese psychological counseling corpora and, using GPT-4 with an MI-informed prompt, transcribed multi-turn dialogues from the two highest-quality datasets (CPsyCounD and PsyDTCorpus) into 2,040 MI-style counseling conversations, of which 2,000 were used for training and 40 for testing. Three Chinese-capable open-source LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat and Llama-3-8B-Chinese-Chat-v2) were fine-tuned on this corpus and were named as MI-LLMs. We evaluated MI-LLMs using round-based automatic metrics and expert manual coding with the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Results: Across all three models, fine-tuning substantially improved BLEU-4 and ROUGE scores compared with the base models, and manual coding showed that MI-LLMs achieved technical and relational global scores, and MI-adherent ratios that approached those of real MI dialogues, although complex reflections and reflection-to-question ratios remained less frequent. Conclusions: These findings provide initial evidence that MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway toward AI-assisted health behavior change support while underscoring the need for further work on data scale, complex MI skills and real-world intervention trials.
zh
[NLP-17] ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLM s
【速读】: 该论文旨在解决高校科研与教学中信息过载的问题,即如何将海量每日新闻数据转化为具有决策价值的周度洞察。其解决方案的关键在于构建一个端到端的自动化信息处理平台ORACLE,该平台通过爬取和版本化新闻内容、应用校本相关性过滤、嵌入文本并分类至PESTEL(政治、经济、社会、技术、环境、法律)维度,进而生成时变递归摘要图(Time-Dependent Recursive Summary Graph, TRSG),该图包含两层聚类并由大语言模型(LLM)总结,每周更新;同时引入轻量级变化检测器识别新增、删除或修改内容,并按PESTEL主题分组以支持结构化分析。这一设计确保了系统在生产环境中稳定运行,并为课程智能(curriculum intelligence)提供可评估的应用场景。
链接: https://arxiv.org/abs/2512.15397
作者: Lev Kharlashkin,Eiaki Morooka,Yehor Tereshchenko,Mika Hämäläinen
机构: Metropolia University of Applied Sciences (芬兰 Metropolia 应用科学大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.
zh
[NLP-18] Emotion Recognition in Signers
【速读】: 该论文旨在解决手语使用者情感识别中的两个核心问题:一是语法性面部表情与情感性面部表情之间的重叠(即二者在视觉表现上难以区分),二是用于模型训练的数据稀缺问题。其解决方案的关键在于:首先,利用口语语言中的文本情感识别来缓解手语数据不足的问题;其次,通过时间片段选择(temporal segment selection)提升特征提取的有效性;最后,引入手部运动信息以增强情感识别性能。实验基于新构建的eJSL数据集(用于日语手语情感识别)和BOBSL数据集(英式手语带字幕),验证了上述策略的有效性,并建立了优于现有口语大语言模型(LLMs)的手语情感识别基线。
链接: https://arxiv.org/abs/2512.15376
作者: Kotaro Funakoshi,Yaoxiong Zhu
机构: Institute of Science Tokyo (东京科学研究所); FIRST, Institute of Integrated Research (集成研究第一研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
zh
[NLP-19] Dual-Density Inference for Efficient Language Model Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因统一语言密度处理中间推理与最终答案而导致的计算效率低下问题。其核心观点是:推理过程主要服务于模型自身的计算功能,而回答则面向人类理解,具有不同的信息密度需求。解决方案的关键在于提出Denser框架——一种双密度推理机制,通过三个组件实现:输入问题分析模块、高密度压缩推理机制用于高效中间计算,以及答案生成模块将压缩推理结果转化为可读性强的最终解答。该方法显著降低了token消耗(最高达62%),同时保持或提升准确率,尤其适用于多步骤复杂推理场景。
链接: https://arxiv.org/abs/2512.15358
作者: Zhengyi Zhao,Shubo Zhang,Yuxi Zhang,Huimin Wang,Binyang Li,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); University of International Relations (国际关系学院); Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underlineDual-d\underlineensity inf\underlineerence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.
zh
[NLP-20] Adversarial versification in portuguese as a jailbreak operator in LLM s
【速读】: 该论文旨在解决当前对齐大型语言模型(Large Language Models, LLMs)的安全机制在面对结构化提示变异时的脆弱性问题,特别是针对生成式 AI(Generative AI)中因表面模式依赖导致的“虚假稳健性”现象。其解决方案的关键在于揭示了押韵格式(versification)作为一种通用单轮越狱机制(single-turn jailbreak mechanism)的有效性:将原本被拒绝的指令以诗歌形式重构后,可显著提升安全失败率(Attack Success Rate, ASR),最高达18倍,并且这种效应在不同对齐方法(如RLHF、宪法AI及混合管道)下均一致存在。研究表明,这种攻击之所以有效,是因为押韵结构将输入引导至低监督的潜在空间,从而暴露了现有对齐策略对表层语义模式的高度依赖,而忽视了深层语义与形式之间的解耦关系。
链接: https://arxiv.org/abs/2512.15353
作者: Joao Queiroz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study ‘Adversarial poetry as a universal single-turn jailbreak mechanism in large language models’ demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.
zh
[NLP-21] Why Your Academic Field Is Everywhere at Once: A Case Study of Arabic Linguistics
【速读】: 该论文旨在解决当代阿拉伯应用语言学研究领域内主题结构的量化表征问题,即如何客观评估该领域研究主题的分散程度与学科构成特征。其解决方案的关键在于引入Brookes的类别离散度测量法(Measure of Categorical Dispersion, Δ),并基于2019至2025年间1,564篇实证文献构建分类数据集,计算得出Δ = 0.194这一极低值,表明该领域存在显著的主题异质性而非集中化趋势;同时揭示计算语言学虽为突出但非主导力量,与其他子领域如社会语言学、语言教学等共存共生,从而为学科结构分析提供了一种可复现的计量方法论框架。
链接: https://arxiv.org/abs/2512.15328
作者: Ayman Eddakrouri(Effat University),Amani Ramadan(Cairo University)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study applies Brookes’ Measure of Categorical Dispersion (\Delta) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of \Delta = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes’ original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.
zh
[NLP-22] Evaluating LLM s for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies
【速读】: 该论文旨在解决从沸石合成实验流程中提取结构化信息的难题,特别是评估大型语言模型(Large Language Models, LLMs)在这一特定科学任务中的有效性。其核心问题是不同提示策略(prompting strategies)对LLMs在事件类型分类、触发文本识别、参数角色抽取和参数值抽取四个子任务上的性能影响。解决方案的关键在于系统性地比较零样本(zero-shot)、少样本(few-shot)、事件特定(event-specific)和反思式(reflection-based)四种提示策略在六个先进LLM上的表现,并基于ZSEE数据集进行量化评估。结果表明,尽管事件类型分类表现良好(F1 80–90%),但细粒度参数提取任务(如参数角色与值)性能较低(F1 50–65%),且高级提示策略带来的提升有限,揭示了当前LLMs在捕捉合成过程特异性细节方面的架构局限性,强调未来需开发领域适配模型以实现精准的实验参数提取。
链接: https://arxiv.org/abs/2512.15312
作者: Charan Prakash Rathore,Saumi Ray,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani, India (比尔拉理工学院和科学学院,皮拉尼,印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
zh
[NLP-23] owards Proactive Personalization through Profile Customization for Individual Users in Dialogues
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互系统中难以实现长期个性化和用户冷启动问题的挑战,现有对齐技术多聚焦于通用人类价值观或静态单轮偏好,无法满足用户长期偏好动态演化的需求。其解决方案的关键在于提出PersonalAgent——一种以用户为中心的终身代理(lifelong agent),通过将对话分解为单轮交互,将偏好推断建模为序列决策任务,持续构建并动态优化统一的用户画像,从而实现跨会话偏好一致性与噪声环境下的鲁棒性能。
链接: https://arxiv.org/abs/2512.15302
作者: Xiaotian Zhang,Yuan Wang,Ruizhe Chen,Zeya Wang,Runchen Hou,Zuozhu Liu
机构: Zhejiang University (浙江大学); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.
zh
[NLP-24] ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I
【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育评估中被滥用所引发的学术诚信与测评有效性问题,尤其关注大型语言模型(Large Language Models, LLMs)在处理多模态科学推理任务时的认知局限性。解决方案的关键在于识别并利用 AI 在感知-认知鸿沟(Perception-Cognition Gap)、计算-概念脱节(Calculation-Conceptualization Discrepancy)及过程幻觉(Process Hallucination)等方面的系统性缺陷,从而设计出“抗 AI 作弊”的试题,使评估能够有效区分学生真实能力与 AI 生成内容,保障教育评价的公平性和可信度。
链接: https://arxiv.org/abs/2512.15298
作者: Seok-Hyun Ga,Chun-Yen Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 23 pages, 9 tables, 1 figure
Abstract:The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
zh
[NLP-25] Well Begun Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning AAAI2026
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大型语言模型(Large Language Models, LLMs)时存在的效率低下问题,即对所有生成token进行统一优化,忽视了prefix tokens(前缀token)在推理过程中的关键作用,导致大量计算资源被浪费在低回报token上,从而抑制了高回报token的潜力提升。解决方案的关键在于提出一种名为渐进式前缀策略优化(Progressive Prefix-token Policy Optimization, PPPO)的新方法,其核心创新是识别并利用LLM推理中的“起始锁定效应”(Beginning Lock-in Effect, BLE),该效应受路径依赖理论启发,表明早期推理阶段显著影响后续推理轨迹。PPPO通过两个策略实现高效优化:一是渐进式前缀保留(Progressive Prefix Retention),逐步增加训练中保留的前缀token比例以构建渐进式学习过程;二是续写累积奖励(Continuation Accumulated Reward),通过对同一前缀序列采样多个续写结果并累加其奖励信号,缓解奖励偏差。实验表明,PPPO仅使用26.17%的训练token即可实现18.02%的准确率提升,显著优于现有RLVR方法。
链接: https://arxiv.org/abs/2512.15274
作者: Yiliu Sun,Zicheng Zhao,Yang Wei,Yanfang Zhang,Chen Gong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
zh
[NLP-26] SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
【速读】: 该论文旨在解决当前临床自然语言处理(Natural Language Processing, NLP)研究中缺乏高质量、多样化且符合真实临床实践的合成医疗文本数据集的问题,尤其针对澳大利亚全科医学(General Practice, GP)场景。现有数据集常受限于实际诊疗中病例分布的偏倚,难以覆盖低频但关键的临床情况,且多为“净化”后的数据,无法反映真实临床文档中的复杂性与多样性。解决方案的关键在于构建SynGP500——一个由临床医生精心设计的500条合成全科医疗记录集合,其核心特征包括:基于皇家澳洲全科医师学院(RACGP)2022年课程标准确保临床广度、结合BEACH流行病学研究校准疾病发生率以保障流行病学合理性,并纳入多种咨询情境(如患者依从性差、社会经济障碍等),从而在保持隐私安全的前提下模拟真实世界的医疗文书复杂性。该数据集通过多维度验证(流行病学一致性、风格特征分析、语义多样性评估及下游任务性能提升)证明了其高质量与实用性,填补了澳大利亚本土临床NLP研究资源的空白。
链接: https://arxiv.org/abs/2512.15259
作者: Piyawoot Songsiritat
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures
Abstract:We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.
zh
[NLP-27] he Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
【速读】: 该论文旨在解决道德化论证(moralizations)——即通过唤起道德价值来正当化诉求或立场的修辞策略——在话语分析中尚未被充分探索的问题。这类论证具有语用复杂性和隐含性,对人工标注和自然语言处理(NLP)系统均构成挑战。其解决方案的关键在于构建了一个多语域的德语道德化语料库(Moralization Corpus),并设计了一种基于框架的标注方案,能够识别道德价值、诉求及话语主体等构成要素。该方法实现了对道德化语言在不同传播场景中的细粒度分析,并通过评估大语言模型(LLMs)在不同提示条件下对道德化检测与成分提取的能力,揭示了详细提示指令相较于少样本或解释型提示更具效果,同时强调道德化仍高度依赖主观判断和语境敏感性。
链接: https://arxiv.org/abs/2512.15248
作者: Maria Becker,Mirko Sommer,Lars Tapken,Yi Wan Teh,Bruno Brocai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.
zh
[NLP-28] FAME: Fictional Actors for Multilingual Erasure
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能包含敏感个人信息所带来的隐私风险,尤其是如何有效实现“遗忘权”——即从已训练模型中移除特定信息而不需重新训练。现有机器遗忘(Machine Unlearning)评估基准存在两大局限:仅支持英文且仅能实现实体级遗忘(即删除整个个体的所有信息)。为此,作者提出FAME(Fictional Actors for Multilingual Erasure),一个基于虚构人物的多语言遗忘评估基准,涵盖英语、法语、德语、意大利语和西班牙语共五种语言,包含1000个虚构演员传记及2万条问答对,每条传记结构化地覆盖20个主题类别(如生平、职业生涯、成就和个人信息)。FAME的关键创新在于其支持两种遗忘粒度:实体级遗忘(删除完整身份)与实例级遗忘(删除特定事实而保留其他信息),并通过两个数据划分支持跨语言系统性比较不同遗忘方法的有效性。由于所有数据均为虚构生成,确保模型在预训练阶段未接触过这些信息,从而实现对遗忘技术的可控、公平评估。
链接: https://arxiv.org/abs/2512.15235
作者: Claudio Savelli,Moreno La Quatra,Alkis Koudounas,Flavio Giobergia
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.
zh
[NLP-29] Yes-MTs Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024
【速读】: 该论文旨在解决低资源印地语族语言(包括阿萨姆语、米佐语、卡西语和曼尼普尔语)与英语之间的机器翻译难题,其核心挑战在于训练数据稀缺导致传统模型性能受限。解决方案的关键在于系统性地比较多种先进方法:包括在多语言和单语言场景下微调预训练模型(如mT5和IndicBart)、采用LoRA(Low-Rank Adaptation)技术对IndicTrans2进行参数高效微调、利用大语言模型(LLMs)如Llama 3和Mixtral 8x7b进行零样本和少样本提示(zero-shot/few-shot prompting),以及对Llama 3进行LoRA监督微调,并对比从头训练Transformer模型的效果。实验表明,基于LLM的微调策略,尤其是LoRA微调,在低资源条件下展现出显著潜力,为解决此类翻译任务提供了有效路径。
链接: https://arxiv.org/abs/2512.15226
作者: Yash Bhaskar,Parameswari Krishnamurthy
机构: IIIT Hyderabad (印度国际信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at WMT 2024
Abstract:This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.
zh
[NLP-30] RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型问答(Knowledge-Intensive Question Answering, KGQA)中因参数化知识局限而产生的幻觉问题。现有方法如KG-CoT虽通过引入知识图谱(Knowledge Graph, KG)路径提升可靠性,但受限于固定的跳数选择策略(仅由问题驱动)和推理路径利用不足(缺乏引导)。其解决方案的关键在于提出RFKG-CoT框架:一是设计基于关系的自适应跳数选择器(relation-driven adaptive hop-count selector),通过关系掩码(relation mask)动态调整推理步数(如直接“兄弟”关系用1跳,间接“父子”链用2跳);二是引入少样本上下文学习路径引导机制(few-shot in-context learning path guidance),以“问题-路径-答案”格式构造示例,增强LLMs对推理路径的理解能力。实验表明,该方法在四个KGQA基准上相比KG-CoT最高提升14.7个百分点准确率,且消融实验证实二者互补,共同将KG证据转化为更忠实的答案。
链接: https://arxiv.org/abs/2512.15219
作者: Chao Zhang,Minghan Li,Tianrui Lv,Guodong Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9pages, 5 figures, accepted by AAAI 2026
Abstract:Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
zh
[NLP-31] From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT : The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA)
【速读】: 该论文试图解决的是生成式 AI(Generative AI)时代下自然语言生成(Natural Language Generation, NLG)评估与芬兰大学学生评分之间日益加剧的“认知错位问题”(Great Misalignment Problem),即传统以最终产出为导向的评估方法已无法有效衡量学习过程和真实能力。解决方案的关键在于提出一种基于教学过程的多证据评估模型——Pedagogical Multi-Factor Assessment (P-MFA),该模型借鉴多因素认证(Multi-Factor Authentication)逻辑,通过整合多种过程性证据来实现更可靠、公平且具有教育意义的评估。
链接: https://arxiv.org/abs/2512.15183
作者: Mika Hämäläinen,Kimmo Leiviskä
机构: Metropolia University of Applied Sciences (芬兰 Metropolia 应用科学大学)
类目: Computation and Language (cs.CL)
备注: IWCLUL 2025
Abstract:This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.
zh
[NLP-32] MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)向代理系统(agentic systems)演进过程中,因采用开放式的 Model Context Protocol (MCP) 引入的新型安全风险问题。现有基准测试无法有效覆盖多服务器、多轮交互场景下的真实攻击类型,导致对LLM代理系统的安全性评估存在显著盲区。解决方案的关键在于提出 MCP-SafetyBench——一个基于真实 MCP 服务器构建的综合性基准测试框架,支持跨五类实际应用场景(如浏览器自动化、金融分析等)的多轮评估,涵盖20种统一分类的攻击类型(涉及服务器端、主机端和用户端),并引入需要多步推理与跨服务器协调的任务以模拟现实复杂性。该框架揭示了主流开源与闭源LLMs在安全性能上的显著差异,并凸显任务视野扩展与服务器交互增加时漏洞加剧的趋势,为诊断和缓解MCP部署中的安全风险提供了实证基础。
链接: https://arxiv.org/abs/2512.15163
作者: Xuanjun Zong,Zhiqi Shen,Lei Wang,Yunshi Lan,Chao Yang
机构: East China Normal University (华东师范大学); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our benchmark is available at this https URL
Abstract:Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
zh
[NLP-33] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain
【速读】: 该论文旨在解决旅游评论数据在长时间跨度下出现的数据漂移(data drift)问题,尤其是2019年至2024年间用户评价行为、评分模式及文本内容的变化趋势。其解决方案的关键在于构建一个大规模、结构化的旅游评论语料库(Rakuten Travel Reviews),包含730万条客户评论及其配套信息(如住宿响应、评分细节、入住类型、同行人群等),并通过统计方法对不同时间窗口的数据分布差异进行量化分析,从而识别驱动数据漂移的核心因素。
链接: https://arxiv.org/abs/2512.15151
作者: Yuki Nakayama,Koki Hikichi,Yun Ching Liu,Yu Hirate
机构: Rakuten Institute of Technology (乐天技术研究所); Rakuten Group, Inc. (乐天集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.
zh
[NLP-34] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
【速读】: 该论文旨在解决测试时强化学习(Test-time reinforcement learning)中因多数投票策略导致的确认偏差(confirmation bias)和稀疏奖励(sparse rewards)问题,从而提升大语言模型(LLM)的推理能力。其解决方案的关键在于提出一种名为SCOPE(subgroup-specific step-wise confidence-weighted pseudo-label estimation)的框架,该框架通过引入分步置信度加权机制来优先选择高质量推理路径而非简单频率统计,并结合动态子群划分策略在推理质量与探索多样性之间取得平衡,进而通过子群内重复采样获取局部共识作为多样化监督信号,有效增强模型的探索能力和最终性能。
链接: https://arxiv.org/abs/2512.15146
作者: Weiqin Wang,Yile Wang,Kehao Chen,Hui Huang
机构: Shenzhen University (深圳大学); Fuzhou University (福州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at \hrefthis https URLthis https URL.
zh
[NLP-35] From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
【速读】: 该论文试图解决的问题是:当前主流的表征学习方法(如稀疏自编码器(Sparse Autoencoders, SAEs)和稀疏探测器(Sparse Probes))在评估因果相关概念(如情感、领域、时态等)的解耦表示时,通常依赖于孤立评估和隐含的独立性假设,而这些假设在实际数据中可能不成立,导致对模型可解释性的真实能力存在误判。解决方案的关键在于提出一种多概念控制评估设置,通过系统性地调节文本概念之间的相关性强度,考察不同 featurization 方法在高相关场景下的表现;研究发现,尽管特征与概念之间呈现“一对一至多对一”的映射关系,但特征在进行操控(steering)实验时往往同时影响多个概念,且作用于互不重叠的子空间,这表明仅用相关性指标衡量解耦性不足以判断概念选择性,也说明作用于不相交子空间并不等于实现概念独立操控。该工作强调了在可解释性研究中引入组合式评估的重要性。
链接: https://arxiv.org/abs/2512.15134
作者: Aaron Mueller,Andrew Lee,Shruti Joshi,Ekdeep Singh Lubana,Dhanya Sridhar,Patrik Reizinger
机构: Boston University (波士顿大学); Harvard University (哈佛大学); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Goodfire; University of Tübingen (图宾根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
zh
[NLP-36] Quantifying Return on Security Controls in LLM Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键场景中部署时缺乏量化防护措施效益评估方法的问题。其核心挑战在于如何将攻击成功率与财务风险、控制成本进行货币化比较,从而支持决策者选择最优防御策略。解决方案的关键在于构建一个决策导向的框架和可复现的方法论:通过拉普拉斯法则(Laplace’s Rule of Succession)估计每种漏洞-防护组合的攻击成功概率,并结合从公开数据校准的损失三角分布,在10,000次蒙特卡洛模拟中生成损失超出曲线(loss exceedance curves)和预期损失值;进而计算投资回报率(Return-on-Control, RoC),实现对多种防护机制(如基于属性的访问控制ABAC、命名实体识别NER去标识化及NeMo Guardrails)的经济效率对比。实证表明,ABAC和NER红移显著降低预期损失并获得高RoC,而NeMo Guardrails效果有限,验证了该方法在指导LLM安全加固方面的实用性。
链接: https://arxiv.org/abs/2512.15081
作者: Richard Helder Moulton,Austin O’Brien,John D. Hastings
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 9 figures, 3 tables
Abstract:Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace’s Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of 313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).
zh
[NLP-37] he Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中幻觉(hallucination)检测的可靠性问题,尤其针对当前基于嵌入(embedding)的方法在真实场景下存在高误报率(false positive rate, FPR)的问题。其解决方案的关键在于引入校准预测(conformal prediction),通过有限样本覆盖保证(finite-sample coverage guarantees)实现对幻觉检测能力的精确量化;实验表明,在约600个校准样本下可达到94%覆盖率且FPR为0%,而传统嵌入方法在多个真实数据集上FPR高达50%-100%,相比之下,使用GPT-4作为判别器仅产生7%的FPR,证明该任务可通过推理完成。作者将这一现象称为“语义幻觉”(semantic illusion),即幻觉内容虽语义上与源文档相似,但包含事实性错误,嵌入模型难以识别,凸显了现有方法的局限性。
链接: https://arxiv.org/abs/2512.15068
作者: Debu Sinha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 2 figures, 6 tables
Abstract:Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
zh
[NLP-38] he Meta-Prompting Protocol: Orchestrating LLM s via Adversarial Feedback Loops
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)从随机性对话接口向可靠软件组件转型过程中缺乏确定性保障的问题。当前主流的基于启发式规则的提示工程(prompt engineering)方法无法满足关键任务应用对可预测性和稳定性的要求。解决方案的关键在于提出元提示协议(Meta-Prompting Protocol),其核心是“对抗三元组”(Adversarial Trinity)架构,由生成器(Generator, P)、审计员(Auditor, A)和优化器(Optimizer, O)构成。该架构将自然语言指令视为语义计算图中的可微变量,并利用文本批评作为梯度信号,从而实现模型自我优化、抑制幻觉并防止模型崩溃,为概率计算时代下的“可观测软件工程”奠定理论基础。
链接: https://arxiv.org/abs/2512.15053
作者: Fanzhe Fu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 6 pages, 2 figures
Abstract:The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based “prompt engineering,” fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator §, an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for “Observable Software Engineering” in the era of probabilistic computing.
zh
[NLP-39] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification ACL2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中继承预训练语料中潜在毒性、偏见及不适宜内容(NSFW)所引发的安全风险问题,尤其针对对抗性触发场景下现有无参数更新的脱毒方法难以有效应对的挑战。其解决方案的关键在于提出一种白盒级神经元层面的干预机制——SGM(Safety Glasses for Multimodal models),通过专家加权软抑制策略,精准识别并重新校准少量有毒专家神经元(toxic expert neurons),从而中和跨模态有害激活,且无需任何参数更新;该方法兼具可解释性与低成本优势,并可通过集成现有脱毒技术形成更强防护体系(记为SGM*),在标准与对抗条件下均显著降低毒性输出(从48.2%降至2.5%),同时保持生成流畅性和多模态推理能力。
链接: https://arxiv.org/abs/2512.15052
作者: Hongbo Wang,MaungMaung AprilPyone,Isao Echizen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review for ACL 2026
Abstract:Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation. Comments: Under Review for ACL 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.15052 [cs.CL] (or arXiv:2512.15052v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.15052 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-40] HERO: Hierarchical Traversable 3D Scene Graphs for Embodied Navigation Among Movable Obstacles
【速读】: 该论文旨在解决现有3D场景图(3DSGs)在具身导航任务中因依赖静态世界假设而导致的可通行性建模局限性问题,即仅基于静态空间布局定义可通行区域,将可操作障碍物视为不可通行,从而限制了智能体在真实复杂环境中的可达性和效率。解决方案的关键在于提出HERO框架,通过构建分层可通行3DSGs,重新定义可通行性:将可操作障碍物建模为潜在路径,显式捕捉其物理交互特性、功能语义及场景关系层次结构,从而显著提升导航效率与可达范围。
链接: https://arxiv.org/abs/2512.15047
作者: Yunheng Wang,Yixiao Feng,Yuetong Fang,Shuning Zhang,Tan Jing,Jian Li,Xiangrui Jiang,Renjing Xu
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene’s relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
zh
[NLP-41] DASH: Dialogue-Aware Similarity and Handshake Recognition for Topic Segmentation in Public-Channel Conversations AAAI
【速读】: 该论文旨在解决任务导向型公共频道通信(如海事甚高频VHF对话)中因非正式语言和隐式转折导致的对话主题分割(Dialogue Topic Segmentation, DTS)难题。传统方法在处理此类场景时存在准确率低、泛化能力差等问题。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的框架DASH-DTS,其核心创新包括:(1)通过对话握手识别实现主题切换检测;(2)借助相似性引导的样例选择增强上下文表示;(3)生成选择性正负样本以提升模型判别力与鲁棒性。该框架还提供了可解释的推理过程和每个片段的置信度评分,并在首个公开的真实海事VHF对话数据集VHF-Dial上实现了当前最优的分割准确率,为操作对话的稳定监控与决策支持奠定了坚实基础。
链接: https://arxiv.org/abs/2512.15042
作者: Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by AAAIW2026
Abstract:Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.
zh
[NLP-42] DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding
【速读】: 该论文旨在解决过程奖励模型(Process Reward Models, PRMs)在代码生成任务中效果受限的问题,主要源于代码缺乏有意义的步骤分解以及蒙特卡洛生成的部分标签存在噪声。其核心解决方案是提出DreamPRM-Code,通过链式函数提示(Chain-of-Function prompting)将函数视为推理步骤,从而实现模块化代码生成,并借鉴数学推理任务中的训练与应用范式;同时引入基于元学习的标签校正机制,利用干净的最终解单元测试标签进行两阶段优化,以修正中间步骤的噪声标签,显著提升PRM在测试时扩展(test-time scaling)场景下的性能,在LiveCodeBench数据集上达到80.9%的pass@1率,超越OpenAI o4-mini模型。
链接: https://arxiv.org/abs/2512.15000
作者: Ruiyi Zhang,Peijia Qin,Qi Cao,Pengtao Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
zh
[NLP-43] Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在化学科学推理任务中面临的挑战,特别是其对分子结构图、符号图表等视觉与文本信息融合能力不足的问题。解决方案的关键在于通过构建一个基于美国国家化学奥林匹克竞赛(USNCO)题目的定制化多模态基准测试集,系统评估40种主流开源与专有MLLMs,并发现多数模型存在模态融合偏差——部分情况下移除图像反而提升准确率,表明视觉-语言对齐存在问题。研究进一步证明,链式思维(Chain-of-Thought)提示策略能显著提升模型的准确性与视觉定位能力,且通过遮挡实验(occlusion-based interpretability)验证了其可解释性优势,从而为开发更鲁棒、可解释的领域特定多模态AI系统提供了关键改进路径。
链接: https://arxiv.org/abs/2512.14989
作者: Yiming Cui,Xin Yao,Yuxuan Qin,Xin Li,Shijin Wang,Guoping Hu
机构: State Key Laboratory of Cognitive Intelligence, Hefei, China;iFLYTEK AI Research, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at Communications Chemistry
Abstract:Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
zh
[NLP-44] Prompt Repetition Improves Non-Reasoning LLM s
【速读】: 该论文试图解决的是大语言模型在不依赖推理机制(reasoning)的情况下,如何提升其性能的问题。解决方案的关键在于通过重复输入提示(repeating the input prompt)这一简单操作,在不增加生成标记数量或延迟的前提下,显著改善主流模型(如Gemini、GPT、Claude和Deepseek)的输出质量与准确性。
链接: https://arxiv.org/abs/2512.14982
作者: Yaniv Leviathan,Matan Kalman,Yossi Matias
机构: Google Research(谷歌研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
zh
[NLP-45] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation
【速读】: 该论文旨在解决语言模型(Language Model, LM)知识蒸馏中因教师模型与学生模型使用不同分词器(tokenizer)导致的词汇空间不匹配问题,从而无法直接计算跨分词器的下一个词概率比(next-token likelihood ratio)。其核心解决方案在于揭示了广泛使用的字节对编码(Byte-Pair Encoding, BPE)算法中存在的隐式递归结构,并基于此构建了一个概率框架,用于在学生模型词汇表不同于教师模型时进行准确的序列似然评估。关键创新点在于:在学生词汇为教师词汇子集的情形下,该框架可实现每 token 仅需 O(1) 次模型推理即可获得精确的下一词概率,显著降低内存开销;而在一般情形下,提出了一种无损的严格处理流程,并辅以快速近似方法,使大词汇量场景仍具实用性,最终在数学推理任务(如GSM8K)上实现了超过当前最优方法的性能提升。
链接: https://arxiv.org/abs/2512.14954
作者: Buu Phan,Ashish Khisti,Karen Ullrich
机构: University of Toronto (多伦多大学); Meta AI (Meta人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
zh
[NLP-46] Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models
【速读】: 该论文旨在解决低资源语言在多模态自然语言处理(Multimodal Natural Language Processing, MNLP)领域中的数据稀缺问题,特别是针对罗马尼亚语这一低资源语言,以推动生成式AI的普惠性。解决方案的关键在于构建并扩展了罗马尼亚语视觉问答(Visual Question Answering, VQA)数据集——通过将广泛使用的Flickr30k数据集翻译为罗马尼亚语,并利用开源大语言模型(Large Language Models, LLMs)进一步扩展其用于VQA任务的能力;随后,采用参数高效微调方法LoRA对三种主流视觉语言模型(Vision-Language Models, VLMs)——LLaMA 3.2、LLaVA 1.6和Qwen2进行微调,显著提升了模型在罗马尼亚语视觉问答及图像描述生成等任务上的性能与语法准确性,其中七亿参数的Qwen2-VL-RoVQA模型在两项任务上均取得最优结果,BERTScore F1提升达+6.05%和+2.61%,且语法错误明显减少,验证了该方案在增强低资源语言多模态能力方面的有效性。
链接: https://arxiv.org/abs/2512.14926
作者: George-Andrei Dima,Dumitru-Clementin Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
zh
[NLP-47] Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models
【速读】: 该论文旨在解决多头自注意力机制(MultiHead SelfAttention, MHSA)在长序列任务中因二次计算复杂度而导致的扩展瓶颈问题。传统稀疏化或线性化注意力方法虽能降低计算成本,但常牺牲全局依赖关系建模能力或难以捕捉多尺度语义粒度。其解决方案的关键在于提出一种新型架构——多尺度聚合分层注意力(Multiscale Aggregated Hierarchical Attention, MAHA),通过可学习的下采样算子对输入序列进行分层分解,并将各尺度注意力矩阵的融合建模为资源分配问题,借助凸优化或基于纳什均衡的博弈论方法实现理论最优的局部细节与全局上下文保真度平衡。该设计使模型在保持高表达能力的同时显著降低计算开销,实验证明在序列长度为4096时计算量减少81%。
链接: https://arxiv.org/abs/2512.14925
作者: Caner Erden
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.
zh
[NLP-48] DrugRAG : Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在药学执照风格问答任务中准确率不足的问题,尤其关注低参数量模型表现较差的现象。其核心解决方案是提出一种外部知识集成方法——DrugRAG,该方法通过三步检索增强生成(Retrieval-Augmented Generation, RAG)流程,从权威来源检索结构化药物知识,并将其作为证据上下文注入模型提示(prompt),从而提升模型在药学场景下的推理准确性。该方案不依赖于修改模型架构或参数,具有良好的可扩展性和实用性,显著提升了所有测试模型的性能,最高提升达21个百分点。
链接: https://arxiv.org/abs/2512.14896
作者: Houman Kazemzadeh,Kiarash Mokhtari Dizaji,Seyed Reza Tavakoli,Farbod Davoodi,MohammadReza KarimiNejad,Parham Abed Azad,Ali Sabzi,Armin Khosravi,Siavash Ahmadi,Mohammad Hossein Rohban,Glolamali Aminian,Tahereh Javaheri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables
Abstract:Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information. Comments: 11 pages, 2 figures, 3 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14896 [cs.CL] (or arXiv:2512.14896v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.14896 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Houman Kazemzadeh [view email] [v1] Tue, 16 Dec 2025 20:19:23 UTC (467 KB)
zh
[NLP-49] Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media
【速读】: 该论文旨在解决新闻媒体中观点多样性与立场识别的问题,以评估媒体是否能提供平衡、公正的公共讨论内容。其核心挑战在于如何从海量新闻文本中自动识别并分类不同意识形态立场的主张(claims),从而揭示新闻报道中的观点分布结构。解决方案的关键在于改进先前提出的混合人机管道:一是通过微调大型语言模型(Large Language Models, LLMs)提升观点分类精度;二是利用Wikidata中的语义信息增强主张中相关行动者(actors)的表征能力,从而更准确地捕捉主张背后的立场逻辑。实验表明,两种机制单独使用均能提升性能,但联合应用时效果最优,尤其在支持长输入的LLMs上表现突出。
链接: https://arxiv.org/abs/2512.14887
作者: Massimiliano Fadda,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
zh
[NLP-50] ask Matrices: Linear Maps for Cross-Model Finetuning Transfer NEURIPS
【速读】: 该论文旨在解决大型视觉与语言模型在非上下文提示(in-context prompting)场景下,是否存在跨层线性编码(cross-layer linear encodings)的问题,尤其是在更通用的微调(fine-tuning)适应范式中尚未被验证的线性表示问题。其解决方案的关键在于提出“任务矩阵”(task matrix)的概念——即从基础模型嵌入状态到微调后嵌入状态之间的线性变换,并通过实证表明,将该线性变换引入基础模型即可显著提升性能,甚至接近全参数微调的效果,且该方法在多个数据集和模态上具有良好的泛化能力。
链接: https://arxiv.org/abs/2512.14880
作者: Darrin O’ Brien,Dhikshith Gajulapalli,Eric Xia
机构: Algoverse AI Research; Brown University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS Unireps 2025
Abstract:Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.
zh
[NLP-51] Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
【速读】: 该论文旨在解决当前端到端(End-to-end, E2E)语音对话系统在真实多轮自然交互场景下评估不足的问题。现有基准主要基于合成语音和单轮任务,难以全面衡量模型在复杂口语交互中的推理记忆、指令保持和自我一致性等能力。其解决方案的关键在于提出Audio MultiChallenge——一个开源的音频原生多轮对话评测基准,扩展了文本基MultiChallenge框架,新增“Voice Editing”轴以测试对语句中途修正与回溯的鲁棒性,并将各维度映射至音频模态(如引入Audio-Cue挑战以检验对环境声和副语言信号的记忆)。通过混合音频原生代理与人工介入的流水线构建452段真实人类对话(含1712条具体评分标准),该基准能规模化暴露模型缺陷并保留自然不流畅性,从而为提升E2E语音对话系统的长程上下文跟踪与音频感知能力提供可复现的测评平台。
链接: https://arxiv.org/abs/2512.14865
作者: Advait Gosai,Tyler Vuong,Utkarsh Tyagi,Steven Li,Wenjia You,Miheer Bavare,Arda Uçar,Zhongwang Fang,Brian Jang,Bing Liu,Yunzhong He
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.
zh
[NLP-52] 5Gemma 2: Seeing Reading and Understanding Longer
【速读】: 该论文旨在解决轻量级编码器-解码器模型在多语言、多模态和长上下文任务中性能不足的问题,尤其是在保持模型效率的同时提升其泛化能力和下游任务表现。解决方案的关键在于:首先,沿用T5Gemma的适配策略(UL2),将预训练的解码器-only模型转化为编码器-解码器结构,并扩展至基于Gemma 3的多模态场景;其次,提出两种优化方法——共享词嵌入(tied word embedding)以减少参数冗余,以及融合注意力(merged attention)将解码器自注意力与交叉注意力合并为单一模块,从而显著提升计算效率。实验表明,该架构在多种模态和长文本任务中均表现出优越的建模能力,且预训练效果相当或更优,微调后性能显著优于同规模的Gemma 3模型。
链接: https://arxiv.org/abs/2512.14856
作者: Biao Zhang,Paul Suganthan,Gaël Liu,Ilya Philippov,Sahil Dua,Ben Hora,Kat Black,Gus Martins,Omar Sanseviero,Shreya Pathak,Cassidy Hardin,Francesco Visin,Jiageng Zhang,Kathleen Kenealy,Qin Yin,Olivier Lacombe,Armand Joulin,Tris Warkentin,Adam Roberts
机构: 未知
类目: Computation and Language (cs.CL)
备注: technical report
Abstract:We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma – adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.
zh
[NLP-53] Incentives or Ontology? A Structural Rebuttal to OpenAI s Hallucination Thesis
【速读】: 该论文旨在解决大语言模型中“幻觉”(hallucination)问题的本质根源及其可修复性争议。传统观点认为幻觉源于训练激励机制的错位,即模型被鼓励做出自信但未必正确的回答,因而可通过优化评估基准或奖励结构来缓解。然而,本文通过引入“许可Oracle”(Licensing Oracle)进行实证实验,并结合对结构性幻觉(structural hallucination)的理论分析,指出幻觉并非优化失败,而是Transformer架构的固有属性:模型并不表征现实世界,而是基于词元间的统计关联构建伪本体论(pseudo-ontology),在训练数据稀疏或不一致的边界区域必然依赖模式补全生成虚构内容。因此,任何仅调整激励、提示或微调的方法均无法根除幻觉。论文的关键解决方案在于构建混合系统——将语言流畅性与认知责任分离,借助外部真值验证模块(truth-validation module)和弃权机制(abstention module)实现可靠输出,而非依赖模型内部结构调整。
链接: https://arxiv.org/abs/2512.14801
作者: Richard Ackermann,Simeon Emanuilov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, references to prior work arXiv:2509.16297 and arXiv:2511.06073
Abstract:OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility. Comments: 17 pages, references to prior work arXiv:2509.16297 and arXiv:2511.06073 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; I.2.0 Cite as: arXiv:2512.14801 [cs.CL] (or arXiv:2512.14801v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.14801 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Simeon Emanuilov [view email] [v1] Tue, 16 Dec 2025 17:39:45 UTC (336 KB)
zh
[NLP-54] Revisiting the Reliability of Language Models in Instruction-Following
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实应用场景中表现出的“细微差异可靠性不足”问题,即模型在面对语义相近但表述方式不同的提示(cousin prompts)时,其性能波动显著,难以保证稳定可靠的行为。解决方案的关键在于提出一种新的量化指标——reliable@k,用于衡量模型在细微语义变化下的行为一致性,并构建了一个自动化数据增强管道以生成高质量的 cousin prompts,进而开发了 IFEval++ 基准用于系统性评估。这一方法揭示了现有模型在 nuance-oriented reliability 方面存在严重不足,为提升 LLM 的鲁棒性和可信度提供了可测量、可改进的新方向。
链接: https://arxiv.org/abs/2512.14754
作者: Jianshuo Dong,Yutong Zhang,Yan Liu,Zhenyu Zhong,Tao Wei,Chao Zhang,Han Qiu
机构: Tsinghua University (清华大学); Ant Group (蚂蚁集团)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability – their performance can drop by up to 61.8% with nuanced prompt modifications. What’s more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: this https URL.
zh
[NLP-55] NoveltyRank: Estimating Conceptual Novelty of AI Papers
【速读】: 该论文旨在解决人工智能(AI)领域研究论文数量激增背景下,如何高效、客观地评估论文概念新颖性的难题。随着学术出版门槛降低,大量论文涌现使得真正具有创新性的成果难以被识别,而传统人工评估方法存在主观性强、效率低的问题。解决方案的关键在于构建一个基于数据驱动的模型,通过分析论文标题、摘要及其与已有文献的语义相似度来量化和排序论文的概念新颖性。该方法采用两种任务范式:一是二分类任务,预测论文是否具备绝对新颖性;二是成对比较任务,学习相对新颖性排序。研究在Qwen3-4B-Instruct-2507和SciBERT模型上进行微调,并以GPT-5.1为基准进行性能对比,验证了不同任务设计和建模策略对新颖性评估效果的影响。
链接: https://arxiv.org/abs/2512.14738
作者: Zhengxu Yan,Han Li,Yuming Feng
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:With the growing ease of academic publishing, the volume of research papers, especially in AI-related fields, has surged dramatically. This flood of publications makes it difficult for truly novel and impactful work to stand out, and manual novelty assessment is often unstable and time-consuming. Our project aims to develop a model that estimates and ranks the conceptual novelty of AI papers, enabling a data-driven and scalable assessment of research originality. Such a system can help researchers efficiently identify submissions that introduce genuinely innovative ideas rather than minor variants, and provide conference reviewers with a quantitative and consistent signal of novelty. Our approach evaluates novelty primarily through a paper’s title, abstract, and semantic similarity to prior literature. Given the motivation of novelty estimation, we explore two task formulations with different modeling objectives, each offering a different perspective: (1) binary classification, which predicts the paper’s absolute novelty from learned patterns of prior novel works, and (2) pairwise novelty comparison, which learns to distinguish papers by relative novelty over others. We fine-tune Qwen3-4B-Instruct-2507 and SciBERT on both tasks, benchmarking against GPT-5.1 to analyze how task formulation and modeling choices affect performance. The implementation is publicly available at this https URL.
zh
[NLP-56] SoMe: A Realistic Benchmark for LLM -based Social Media Agents AAAI2026
【速读】: 该论文旨在解决当前对基于大语言模型(Large Language Models, LLMs)的社会媒体智能体(Social Media Agents)在理解媒体内容、用户行为及做出复杂决策等方面的能力缺乏系统性评估的问题。解决方案的关键在于提出首个综合性基准测试平台 SoMe,其包含8类多样化的社会媒体代理任务、超900万条帖子、6,591个用户档案和25,686份报告,并附带17,869个精细标注的任务查询,从而为LLM驱动的社会媒体代理提供一个真实且多样的评估环境。通过定量与定性分析,该研究揭示了当前闭源和开源LLM在实际社交环境中均难以有效完成社会媒体代理任务的局限性,为未来社会媒体代理的发展提供了具有挑战性且意义深远的测试平台。
链接: https://arxiv.org/abs/2512.14720
作者: Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:Intelligent agents powered by large language models (LLMs) have recently demonstrated impressive capabilities and gained increasing popularity on social media platforms. While LLM agents are reshaping the ecology of social media, there exists a current gap in conducting a comprehensive evaluation of their ability to comprehend media content, understand user behaviors, and make intricate decisions. To address this challenge, we introduce SoMe, a pioneering benchmark designed to evaluate social media agents equipped with various agent tools for accessing and analyzing social media data. SoMe comprises a diverse collection of 8 social media agent tasks, 9,164,284 posts, 6,591 user profiles, and 25,686 reports from various social media platforms and external websites, with 17,869 meticulously annotated task queries. Compared with the existing datasets and benchmarks for social media tasks, SoMe is the first to provide a versatile and realistic platform for LLM-based social media agents to handle diverse social media tasks. By extensive quantitative and qualitative analysis, we provide the first overview insight into the performance of mainstream agentic LLMs in realistic social media environments and identify several limitations. Our evaluation reveals that both the current closed-source and open-source LLMs cannot handle social media agent tasks satisfactorily. SoMe provides a challenging yet meaningful testbed for future social media agents. Our code and data are available at this https URL
zh
[NLP-57] SepsisSuite: Beyond Risk Stratification – A Comparative Analysis of Deep Fusion vs. Expert Stacking for Prescriptive Sepsis AI
【速读】: 该论文旨在解决脓毒症(sepsis)早期预测与抗生素选择中多模态数据融合困难的问题,传统模型常因模态间交互复杂或依赖脆弱的早期融合策略而性能受限。其核心解决方案是提出一种“轻量级”的上下文感知混合专家(Context-Aware Mixture-of-Experts, MoE)架构——SepsisLateFusion,通过将不同模态(静态、时序、自然语言处理)视为正交专家,并由CatBoost元学习器动态门控组合,实现对小样本场景下注意力分配不均(attention starvation)问题的有效缓解,从而在MIMIC-IV数据集上达到0.915 AUC的SOTA预测性能,并显著降低临床漏诊率48%,同时为多类抗生素选择任务提供高精度的四模态集成方案(AUC 0.72)。
链接: https://arxiv.org/abs/2512.14712
作者: Ryan Cartularo
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 7 Pages, 4 Tables, 9 Figures
Abstract:Sepsis accounts for nearly 20% of global ICU admissions, yet conventional prediction models often fail to effectively integrate heterogeneous data streams, remaining either siloed by modality or reliant on brittle early fusion. In this work, we present a rigorous architectural comparison between End-to-End Deep Fusion and Context-Aware Stacking for sepsis tasks. We initially hypothesized that a novel Quad-Modal Hierarchical Gated Attention Network – termed SepsisFusionFormer – would resolve complex cross-modal interactions between vitals, text, and imaging. However, experiments on MIMIC-IV revealed that SepsisFusionFormer suffered from “attention starvation” in the small antibiotic cohort ( N \approx 2,100 ), resulting in overfitting (AUC 0.66). This counterintuitive result informed the design of SepsisLateFusion, a “leaner” Context-Aware Mixture-of-Experts (MoE) architecture. By treating modalities as orthogonal experts – the “Historian” (Static), the “Monitor” (Temporal), and the “Reader” (NLP) – and dynamically gating them via a CatBoost meta-learner, we achieved State-of-the-Art (SOTA) performance: 0.915 AUC for prediction 4 hours prior to clinical onset. By calibrating the decision threshold for clinical safety, we reduced missed cases by 48% relative to the default operating point, thus opening a true preventative window for timely intervention over reactive alerts. Furthermore, for the novel prescriptive task of multi-class antibiotic selection, we demonstrate that a Quad-Modal Ensemble achieved the highest performance (0.72 AUC). These models are integrated into SepsisSuite, a deployment-ready Python framework for clinical decision support. SepsisSuite is available for free at: this https URL
zh
[NLP-58] LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts
【速读】: 该论文旨在解决传统神经网络架构搜索(Neural Architecture Search, NAS)依赖大量人工经验或低效自动化试错的问题,提出了一种由大语言模型(Large Language Model, LLM)驱动的NAS流水线NN-Caption,用于自动生成可运行的图像描述生成模型。其解决方案的关键在于:利用LLM(如DeepSeek-R1-0528-Qwen3-8B)基于预定义的Net API规范,从LEMUR分类骨干网络中组合卷积神经网络(CNN)编码器与序列解码器(LSTM/GRU/Transformer),通过精心设计的提示模板(prompt template)和示例引导模型生成符合语法与功能要求的代码;同时结合自动评估机制(以MS COCO数据集上的BLEU-4为指标),实现架构生成、训练建议与迭代优化的一体化流程,从而显著提升NAS的效率与可复现性,并为下游自动化机器学习(AutoML)研究提供新的方法论支持。
链接: https://arxiv.org/abs/2512.14706
作者: Krunal Jesani,Dmitry Ignatov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.
zh
[NLP-59] Effectively Detecting and Responding to Online Harassment with Large Language Models
【速读】: 该论文旨在解决私有消息平台(如Instagram)中在线骚扰(Online Harassment)识别与应对的难题,这一问题相较于公共社交媒体平台的研究更为稀缺且更具挑战性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建一个可扩展的标注流水线,通过引入前文对话作为上下文来准确识别私人消息中的骚扰内容,并进一步生成模拟应对回复进行评估。实验表明,该LLM标注流程在识别准确性上接近人工标注水平,且其生成的模拟回应在帮助性(helpfulness)方面优于原始人类响应,从而为自动化干预私密空间中的网络骚扰提供了有效技术路径。
链接: https://arxiv.org/abs/2512.14700
作者: Pinxian Lu,Nimra Ishfaq,Emma Win,Morgan Rose,Sierra R Strickland,Candice L Biernesser,Jamie Zelazny,Munmun De Choudhury
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 2 figures
Abstract:Online harassment has been a persistent issue in the online space. Predominantly, research focused on online harassment in public social media platforms, while less is placed on private messaging platforms. To address online harassment on one private messaging platform, Instagram, we leverage the capabilities of Large Language Models (LLMs). To achieve this, we recruited human labelers to identify online harassment in an Instagram messages dataset. Using the previous conversation as context, we utilize an LLM pipeline to conduct large-scale labeling on Instagram messages and evaluate its performance against human labels. Then, we use LLM to generate and evaluate simulated responses to online harassment messages. We find that the LLM labeling pipeline is capable of identifying online harassment in private messages. By comparing human responses and simulated responses, we also demonstrate that our simulated responses are superior in helpfulness compared to original human responses.
zh
计算机视觉
[CV-0] Spatia: Video Generation with Updatable Spatial Memory
【速读】:该论文旨在解决现有视频生成模型在长时序下难以保持空间与时间一致性的问题,这主要源于视频信号的高维密集特性。解决方案的关键在于提出一种空间记忆感知的视频生成框架 Spatia,其核心创新是显式地将三维场景点云作为持久的空间记忆,并通过视觉SLAM(Simultaneous Localization and Mapping)技术持续更新该记忆。这种动态-静态解耦设计在保证生成视频空间一致性的同时,仍能保留对动态实体的真实感生成能力,从而为可扩展、基于记忆驱动的视频生成提供几何约束基础。
链接: https://arxiv.org/abs/2512.15716
作者: Jinjing Zhao,Fangyun Wei,Zhening Liu,Hongyang Zhang,Chang Xu,Yan Lu
机构: The University of Sydney (悉尼大学); Microsoft Research (微软研究院); HKUST (香港科技大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
zh
[CV-1] In Pursuit of Pixel Supervision for Visual Pre-training KR
【速读】:该论文旨在解决当前自监督学习方法中对高维像素空间(pixel space)表示学习的潜力尚未被充分挖掘的问题,尤其是在复杂下游任务中的表现与基于潜在空间(latent space)的方法相比是否具备竞争力。其解决方案的关键在于提出一种改进的掩码自编码器(masked autoencoder, MAE)模型——Pixio,通过引入更具挑战性的预训练任务和更强大的架构设计,在仅需极少量人工干预的自监督策略下,利用20亿张网络爬取图像进行训练,从而在单目深度估计、前向3D重建、语义分割及机器人学习等多个真实场景下游任务中实现与DINOv3相当或更优的性能。这一成果表明,像素空间的自监督学习仍具强大潜力,可作为潜在空间方法的重要补充。
链接: https://arxiv.org/abs/2512.15715
作者: Lihe Yang,Shang-Wen Li,Yang Li,Xinjie Lei,Dong Wang,Abdelrahman Mohamed,Hengshuang Zhao,Hu Xu
机构: FAIR, Meta; HKU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed “Pixio”, is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
zh
[CV-2] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
【速读】:该论文旨在解决当前扩散视觉语言模型(dVLM)性能显著落后于主流自回归(AR)模型的问题,核心在于探索是否能够基于已有的强大AR模型构建高性能的dVLM。其解决方案的关键在于提出DiffusionVL,一种可通过简单微调将任意预训练AR模型迁移至扩散范式的新方法,从而实现从AR到扩散范式的高效转换;同时引入块解码(block-decoding)设计以支持任意长度生成并复用键值缓存(KV cache),在仅使用不到5%训练数据的情况下,显著提升推理速度(2倍加速)并大幅改善多模态评测指标(如MMMU-Pro和MME分别提升34.4%和37.5%)。
链接: https://arxiv.org/abs/2512.15713
作者: Lunbin Zeng,Jingfeng Yao,Bencheng Liao,Hongyuan Tao,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, conference or other essential info
Abstract:In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at this https URL.
zh
[CV-3] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering
【速读】:该论文旨在解决如何在保持高保真度的同时,实现高效且适用于移动设备的头部虚拟形象(head avatars)生成与渲染问题。现有方法要么在渲染效率上受限于纯3D高斯表示,要么在细节表现上难以还原如头发、胡须等非表面结构。解决方案的关键在于提出一种混合表示方法——Gaussian Pixel Codec Avatars (GPiCA),其结合了三角网格(triangle mesh)和各向异性3D高斯(anisotropic 3D Gaussians):三角网格用于高效表达面部皮肤等表面区域,而3D高斯则擅长处理非表面区域;进一步通过统一的可微分渲染管线将网格视为体积渲染中的半透明层,从而在单一渲染引擎中协同合成三类组件(3D人脸网格、RGBA纹理和3D高斯集合),并利用多视角图像监督训练神经网络以解码表情编码,最终实现了兼具真实感与移动端实时渲染性能的高质量头部虚拟形象。
链接: https://arxiv.org/abs/2512.15711
作者: Divam Gupta,Anuj Pahuja,Nemanja Bartolovic,Tomas Simon,Forrest Iandola,Giljoo Nam
机构: Meta Codec Avatars Lab (Meta 编码化身实验室); Meta Reality Labs (Meta 现实实验室); Stellon Labs (Stellon 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Tech report
Abstract:We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.
zh
[CV-4] Multi-View Foundation Models
【速读】:该论文旨在解决基础模型(Foundation Model)在多视角图像输入时无法保证同一3D点在不同视图中特征一致性的问题。传统基础模型通常独立处理每张图像,导致跨视角的特征不一致,限制了其在三维感知任务中的应用。解决方案的关键在于将基础模型扩展为多视角基础模型(Multi-View Foundation Model),通过引入中间层的3D感知注意力机制(3D-aware attention layers),增强不同视角图像间对应点特征的一致性。该方法无需构建显式的3D特征模型,而是直接在图像空间中实现特征对齐,显著提升了特征匹配性能,并在表面法向量估计和多视角分割等任务中验证了有效性。
链接: https://arxiv.org/abs/2512.15708
作者: Leo Segre,Or Hirschorn,Shai Avidan
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.
zh
[CV-5] GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection WACV2026
【速读】:该论文旨在解决主动说话人检测(Active Speaker Detection, ASD)中因晚期融合(late fusion)无法捕捉细粒度跨模态交互而导致的鲁棒性不足问题,尤其在非受限场景下表现不佳。其解决方案的关键在于提出一种名为GateFusion的新架构,核心创新是引入分层门控融合解码器(Hierarchical Gated Fusion Decoder, HiGate),通过可学习的双模态条件门控机制,在Transformer骨干网络的多个层级上自适应地将一模态的上下文特征注入另一模态,实现渐进式、多深度的跨模态融合。此外,作者还设计了掩码对齐损失(Masked Alignment Loss, MAL)和过正样本惩罚(Over-Positive Penalty, OPP)两种辅助目标,以增强模态对齐并抑制视频模态的虚假激活,从而显著提升模型性能与泛化能力。
链接: https://arxiv.org/abs/2512.15707
作者: Yu Wang,Juhyung Ha,Frangil M. Ramirez,Yuchen Wang,David J. Crandall
机构: Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by WACV 2026
Abstract:Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
zh
[CV-6] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
【速读】:该论文旨在解决自回归视频扩散模型(autoregressive video diffusion models)在训练与推理阶段存在暴露偏差(exposure bias)的问题,这种偏差源于训练时使用真实历史帧而推理时依赖模型自身预测所导致的分布不匹配。为实现端到端训练且无需教师模型或在线判别器,作者提出了一种名为“重采样强制”(Resampling Forcing)的无教师框架,其核心创新在于引入一种自适应重采样机制,在训练过程中模拟推理时的历史帧误差;同时,通过条件于这些退化历史帧的稀疏因果掩码(sparse causal mask),在保持时间因果性的同时支持帧级扩散损失的并行训练。此外,为进一步提升长时程生成效率,设计了无需参数的历史路由机制(history routing),动态检索每一步查询最相关的前k帧历史信息,从而在不牺牲性能的前提下显著改善长期视频的时间一致性。
链接: https://arxiv.org/abs/2512.15702
作者: Yuwei Guo,Ceyuan Yang,Hao He,Yang Zhao,Meng Wei,Zhenheng Yang,Weilin Huang,Dahua Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
zh
[CV-7] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
【速读】:该论文旨在解决图像压缩模型与人类视觉感知不一致的问题,即传统基于均方误差(MSE)等简单失真度量的压缩方法难以准确反映人类对图像质量的主观偏好。为实现更符合人类感知的压缩性能,论文提出了一种基于视觉语言模型(VLM)的图像压缩系统(VLIC),其核心创新在于利用预训练的VLM在无需额外标注或微调的情况下,直接对图像对进行二选一(2AFC)判断,从而获得零样本(zero-shot)的人类感知偏好信号。该方案的关键在于将VLM作为奖励模型(reward model)用于扩散模型的偏好后训练(post-training),而非将VLM输出蒸馏为独立的感知损失网络,从而有效整合了VLM强大的跨模态推理能力与扩散模型的高保真重建优势,在多个数据集上实现了与人类感知高度对齐的压缩性能。
链接: https://arxiv.org/abs/2512.15701
作者: Kyle Sargent,Ruiqi Gao,Philipp Henzler,Charles Herrmann,Aleksander Holynski,Li Fei-Fei,Jiajun Wu,Jason Zhang
机构: Stanford University (斯坦福大学); Google Research (谷歌研究); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures
Abstract:Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at this https URL
zh
[CV-8] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
【速读】:该论文旨在解决当前AI生成视频检测方法普遍存在的两大问题:一是多数现有方法仅能进行二分类判断,缺乏对检测结果的可解释性;二是难以有效识别和利用人类可感知的视觉伪影(visual artifacts)作为判别依据。解决方案的关键在于提出一个专门针对AI生成视频检测的多模态大语言模型(Multimodal Large Language Model, MLLM)——Skyra,其核心创新包括:首先构建首个具有细粒度人工标注的大规模AI生成视频伪影数据集ViF-CoT-4K,用于监督微调(SFT);其次设计两阶段训练策略,系统提升模型在时空维度上对伪影的感知能力、解释能力和检测准确性;最终通过引入ViF-Bench基准测试平台,验证了Skyra在多个指标上优于现有方法,并为可解释的AI生成视频检测提供了重要实践路径。
链接: https://arxiv.org/abs/2512.15693
作者: Yifei Li,Wenzhao Zheng,Yanran Zhang,Runze Sun,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model’s spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
zh
[CV-9] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action Models, VLAs)在机器人操作任务中因依赖静态网络数据预训练而导致的物理因果理解缺失问题。传统VLAs虽具备良好的语义泛化能力,但无法显式建模物理动力学和时间依赖关系,从而迫使策略模型从机器人轨迹中隐式推断复杂动态,造成对大规模专家数据的持续依赖,难以实现高效学习。解决方案的关键在于引入视频驱动的预训练范式,通过互联网规模视频数据联合捕捉语义与视觉动态信息,使模型在预训练阶段即学习物理因果结构;进而设计一种基于流匹配(flow matching)的动作解码器,作为逆动力学模型(Inverse Dynamics Model, IDM),从视频空间中的潜在动作表示中生成低级机器人控制指令,从而将高阶任务规划与低级控制分离,显著提升样本效率与收敛速度。
链接: https://arxiv.org/abs/2512.15692
作者: Jonas Pai,Liam Achenbach,Victoriano Montesinos,Benedek Forrai,Oier Mees,Elvis Nava
机构: mimic robotics; Microsoft Zurich; ETH Zurich; ETH AI Center; UC Berkeley
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
zh
[CV-10] Stylized Synthetic Augmentation further improves Corruption Robustness
【速读】:该论文旨在解决深度视觉模型在面对常见图像退化(common corruptions)时的脆弱性问题。其解决方案的关键在于构建一个训练数据增强流程,将合成图像数据与神经风格迁移(neural style transfer)相结合:尽管风格迁移会降低合成图像在FID指标下的质量,但这些图像在模型训练中表现出显著的有益效果;通过系统性实证分析验证了风格化与合成数据能有效互补,并可与TrivialAugment等规则-based增强方法协同使用,从而在CIFAR-10-C、CIFAR-100-C和TinyImageNet-C等多个小规模图像分类基准上实现当前最优的抗干扰鲁棒性表现。
链接: https://arxiv.org/abs/2512.15675
作者: Georg Siedel,Rojan Regmi,Abhirami Anand,Weijia Shao,Silvia Vock,Andrey Morozov
机构: Institute of Industrial Automation and Software Engineering, University of Stuttgart (斯图加特大学工业自动化与软件工程研究所); Federal Institute for Occupational Safety and Health (BAuA) (联邦职业安全健康研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at VISAPP 2026 conference
Abstract:This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively
zh
[CV-11] SoFlow: Solution Flow Models for One-Step Generative Modeling
【速读】:该论文旨在解决扩散模型(Diffusion Models)和流匹配模型(Flow Matching Models)中多步去噪过程导致的生成效率低下问题,以实现从零开始的一步生成(one-step generation)。其解决方案的关键在于提出了一种新的框架——Solution Flow Models (SoFlow),通过分析速度函数(velocity function)与速度常微分方程(ODE)解函数之间的关系,设计了两种损失函数:流匹配损失(Flow Matching loss)和解一致性损失(solution consistency loss)。其中,流匹配损失使模型在训练期间能够提供用于无分类器引导(Classifier-Free Guidance, CFG)的速度场估计,从而提升生成质量;而解一致性损失无需计算雅可比向量积(Jacobian-vector product, JVP),避免了当前方法中对深度学习框架(如PyTorch)优化不佳的问题,显著提升了训练效率与稳定性。实验表明,在相同DiT架构和训练轮次下,SoFlow在ImageNet 256x256数据集上取得了优于MeanFlow模型的FID-50K分数。
链接: https://arxiv.org/abs/2512.15657
作者: Tianze Luo,Haotian Yuan,Zhuang Liu
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL
Abstract:The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.
zh
[CV-12] Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift
【速读】:该论文旨在解决在少量图像裁剪(crop)情况下,由教师模型生成的软标签(soft labels)易出现局部语义漂移(local semantic drift)的问题。这种漂移表现为:单个裁剪区域因视觉相似性而偏离原图的真实语义,导致训练与测试阶段的数据分布不一致,从而引入系统性误差。解决方案的关键在于重新引入硬标签(hard labels),将其作为内容无关的锚点信号,用于校准软标签的语义偏差;通过理论分析和实验证明,软标签与硬标签的混合使用能够恢复视觉内容与语义监督之间的一致性。基于此,作者提出新的训练范式 HALD(Hard Label for Alleviating Local Semantic Drift),在保留软标签细粒度优势的同时,利用硬标签实现对局部语义漂移的有效纠正,在数据蒸馏和大规模分类任务中显著提升泛化性能。
链接: https://arxiv.org/abs/2512.15647
作者: Jiacheng Cui,Bingkui Tong,Xinyue Bi,Xiaohan Zhao,Jiacheng Liu,Zhiqiang shen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code at: this https URL
Abstract:Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.
zh
[CV-13] InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization
链接: https://arxiv.org/abs/2512.15644
作者: Qirui Li,Yizhe Tang,Ran Yi,Guangben Lu,Fangyuan Zou,Peng Shu,Huan Yu,Jie Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-14] IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning
【速读】:该论文旨在解决少样本视频视觉特效(VFX)编辑中的核心挑战:在有限配对数据下,如何实现复杂特效(如火焰、粒子和卡通角色)的合成,同时严格保持空间与时间一致性,并确保背景完全不变形、特效自然融合。现有视频编辑模型难以满足这些要求。解决方案的关键在于提出IC-Effect框架,其核心创新包括:(1)基于DiT(Diffusion Transformer)架构,利用源视频作为干净上下文条件,借助DiT的上下文学习能力实现精准背景保留与自然特效注入;(2)采用两阶段训练策略——先进行通用编辑适配,再通过Effect-LoRA进行特效特异性学习,从而增强指令遵循能力和特效建模鲁棒性;(3)引入时空稀疏标记化机制,在显著降低计算成本的同时保障高保真输出。
链接: https://arxiv.org/abs/2512.15635
作者: Yuanhang Li,Yiren Song,Junzhe Bai,Xinran Liang,Hu Yang,Libiao Jin,Qi Mao
机构: Communication University of China (中国传媒大学); National University of Singapore (新加坡国立大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose \textbfIC-Effect, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning 15 high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
zh
[CV-15] owards Physically-Based Sky-Modeling For Image Based Lighting
【速读】:该论文旨在解决当前深度神经网络(DNN)生成的高动态范围环境图(High Dynamic Range Imagery, HDRI)在再现自然天空光照特性方面的不足,特别是无法准确还原物理捕获HDRI中的色调、阴影和照明一致性的问题。现有方法虽在视觉质量上有所提升,但难以同时满足摄影级真实感与户外光照全动态范围(Full Dynamic Range, FDR,约22 f-stops)的要求,限制了其在下游应用中的可扩展性和准确性。解决方案的关键在于提出AllSky——一个直接从物理捕获的HDRI中学习的灵活全天气天空模型,该模型通过用户可控的太阳位置与云层分布实现直观的环境图控制,并在输入模态、色调映射、条件建模和评估方式上进行了系统性优化,从而实现了当前最优的天空建模性能。
链接: https://arxiv.org/abs/2512.15632
作者: Ian J. Maquignaz
机构: Université Laval (拉瓦尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Accurate environment maps are a key component for rendering photorealistic outdoor scenes with coherent illumination. They enable captivating visual arts, immersive virtual reality, and a wide range of engineering and scientific applications. Recent works have extended sky-models to be more comprehensive and inclusive of cloud formations but, as we demonstrate, existing methods fall short in faithfully recreating natural skies. Though in recent years the visual quality of DNN-generated High Dynamic Range Imagery (HDRI) has greatly improved, the environment maps generated by DNN sky-models do not re-light scenes with the same tones, shadows, and illumination as physically captured HDR imagery. In this work, we demonstrate progress in HDR literature to be tangential to sky-modelling as current works cannot support both photorealism and the 22 f-stops required for the Full Dynamic Range (FDR) of outdoor illumination. We achieve this by proposing AllSky, a flexible all-weather sky-model learned directly from physically captured HDRI which we leverage to study the input modalities, tonemapping, conditioning, and evaluation of sky-models. Per user-controlled positioning of the sun and cloud formations, AllSky expands on current functionality by allowing for intuitive user control over environment maps and achieves state-of-the-art sky-model performance. Through our proposed evaluation, we demonstrate existing DNN sky-models are not interchangeable with physically captured HDRI or parametric sky-models, with current limitations being prohibitive of scalability and accurate illumination in downstream applications
zh
[CV-16] OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
【速读】:该论文旨在解决自动驾驶中对三维场景的持续理解问题,特别是在时间扰动下保持鲁棒性并考虑未来潜在动作的影响。核心挑战在于实现两种预测任务:反应式预测(“接下来会发生什么”)和主动式预测(“若采取特定未来动作会如何”)。解决方案的关键是提出4D Occupancy Spatio-Temporal Persistence (OccSTeP) 概念及对应的 OccSTeP-WM 世界模型,该模型无需分词器(tokenizer-free),采用密集体素表示场景状态,并通过线性复杂度注意力机制与循环状态空间模块持续融合时空上下文,同时利用自车运动补偿更新场景记忆,从而支持在线推理并在历史传感器输入缺失或噪声情况下仍保持鲁棒性能。
链接: https://arxiv.org/abs/2512.15621
作者: Yu Zheng,Jie Hu,Kailun Yang,Jiaming Zhang
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures
Abstract:Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ‘‘what will happen next’’ and (2) proactive forecasting: “what would happen given a specific future action”. For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at this https URL.
zh
[CV-17] Persistent feature reconstruction of resident space objects (RSOs) within inverse synthetic aperture radar (ISAR) images
【速读】:该论文旨在解决近地空间环境中对静止空间物体(Resident Space Objects, RSOs)外部结构识别的难题,以提升空间态势感知(Space Domain Awareness, SDA)能力。其核心挑战在于如何从序列化的亚太赫兹逆合成孔径雷达(sub-THz Inverse Synthetic Aperture Radar, ISAR)图像中准确检测并跟踪特征,从而实现对R SO表面结构的可靠识别。解决方案的关键在于:首先通过仿射变换实现帧间初步对齐,随后利用梯度比边缘检测方法提取图像边缘信息,并结合双权重霍夫变换(double-weighted Hough transform)高精度识别线性特征;进一步地,通过对连续帧中特征的演化分析与跟踪,显著提升了特征检测与分类的置信度,例如在阴影特征的鲁棒检测中验证了该方法的有效性。
链接: https://arxiv.org/abs/2512.15618
作者: Morgan Coe,Gruffudd Jones,Leah-Nani Alconcel,Marina Gashinova
机构: University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.
zh
[CV-18] Robust Multi-view Camera Calibration from Dense Matches
【速读】:该论文旨在解决多视角相机系统中相机内参(intrinsic parameters)与外参(extrinsic parameters)估计的鲁棒性问题,尤其是在存在强径向畸变(radial distortion)场景下的结构光恢复(structure-from-motion, SfM)精度不足的问题。其关键解决方案在于:首先,通过优化密集匹配器(dense matcher)生成的对应点子采样策略,以更有效地利用预测对应关系;其次,提出增量式视角添加的选择准则,从而提升全局SfM流程的稳定性与准确性。实验表明,该方法在具有强径向畸变的相机设置下显著优于基线方案(如VGGT),在动物行为学和监控视频法证分析等实际应用中具备良好的泛化能力。
链接: https://arxiv.org/abs/2512.15608
作者: Johannes Hägerlind,Bao-Long Tran,Urs Waldmann,Per-Erik Forssén
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication at the 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026). Conference website: this https URL
Abstract:Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.
zh
[CV-19] Qwen -Image-Layered: Towards Inherent Editability via Layer Decomposition
【速读】:该论文旨在解决当前视觉生成模型在图像编辑过程中因光栅图像(raster images)的特征耦合性而导致的一致性问题,即修改部分区域时容易破坏其他区域的连贯性。传统方法难以实现局部编辑与整体一致性的平衡,而专业设计工具通过分层表示(layered representation)实现了隔离编辑并保持一致性。为此,作者提出Qwen-Image-Layered,一个端到端扩散模型,其核心创新在于将单张RGB图像分解为多个语义解耦的RGBA层(RGBA layers),从而实现内在可编辑性(inherent editability)——每个RGBA层可独立操作而不影响其他内容。关键解决方案包括:(1) 引入RGBA-VAE统一RGB与RGBA图像的潜在表示;(2) 设计可变层数分解架构VLD-MMDiT(Variable Layers Decomposition MMDiT)以支持任意长度的图像分层;(3) 提出多阶段训练策略,将预训练图像生成模型适配为多层图像分解器。此外,为缓解高质量多层图像数据稀缺问题,构建了从Photoshop文档(PSD)中提取和标注多层图像的数据处理流水线。实验表明,该方法在分解质量上显著优于现有方法,并为一致图像编辑建立了新范式。
链接: https://arxiv.org/abs/2512.15603
作者: Shengming Yin,Zekai Zhang,Zecheng Tang,Kaiyuan Gao,Xiao Xu,Kun Yan,Jiahao Li,Yilei Chen,Yuxiang Chen,Heung-Yeung Shum,Lionel M. Ni,Jingren Zhou,Junyang Lin,Chenfei Wu
机构: HKUST(GZ)(香港科技大学(广州)); Alibaba(阿里巴巴); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
Abstract:Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbfQwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbfinherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \hrefthis https URLthis https URL
zh
[CV-20] FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
【速读】:该论文旨在解决单目图像(monocular image)条件下生成完整且高质量三维头部虚拟形象(3D head avatar)时存在的重建不完整问题,其根源在于单目训练中驱动信号(driving signal)与目标视角(target viewpoint)之间的特征纠缠(entanglement),导致模型难以泛化到未见视角。解决方案的关键在于提出一种基于Transformer架构的3D肖像动画模型,引入可学习的数据源标记(learnable data source tokens,称为bias sinks),实现单目和多视角数据的统一训练;该设计在推理阶段能够融合单目数据的强泛化能力与多视角监督的完整3D结构,从而生成视点外推能力强、面部动画逼真且几何完整的3D头像。
链接: https://arxiv.org/abs/2512.15599
作者: Tobias Kirschstein,Simon Giebenhain,Matthias Nießner
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL , Video: this https URL
Abstract:We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: this https URL
zh
[CV-21] IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion WACV
【速读】:该论文旨在解决雷达-相机融合中因直接传递模态特异性特征而导致传感器固有特性被扭曲、个体优势下降的问题。现有知识蒸馏方法通常将LiDAR的特征直接迁移至雷达或融合层,忽略了多模态数据间的差异性,从而限制了融合性能的提升。解决方案的关键在于提出一种基于多级知识蒸馏(multi-level knowledge distillation, IMKD)的雷达-相机融合框架,通过三个阶段的强度感知蒸馏策略,在不破坏各传感器原始特征表达的前提下增强互补性:(1) LiDAR到雷达的强度感知特征蒸馏,引入细粒度结构线索强化雷达表征;(2) LiDAR到融合特征的强度引导蒸馏,选择性突出几何与深度信息,促进模态间互补而非强制对齐;(3) 相机-雷达的强度引导融合机制,实现有效特征对齐与校准。实验表明,IMKD在nuScenes基准上达到67.0% NDS和61.0% mAP,显著优于现有基于蒸馏的雷达-相机融合方法。
链接: https://arxiv.org/abs/2512.15581
作者: Shashank Mishra,Karan Patil,Didier Stricker,Jason Rambach
机构: German Research Center for Artificial Intelligence (DFKI); RPTU
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. 22 pages, 8 figures. Includes supplementary material
Abstract:High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor’s intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at this https URL.
zh
[CV-22] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
【速读】:该论文旨在解决在线单目3D实例分割(online zero-shot monocular 3D instance segmentation)这一新兴且具有挑战性的任务,其核心难点在于现有方法依赖带有位姿信息的RGB-D序列,无法在仅输入单目RGB图像流的情况下实现可靠性能。为克服此限制,作者提出MoonSeg3R,其关键创新在于:(1) 引入自监督查询优化模块(self-supervised query refinement module),通过空间-语义蒸馏将视觉基础模型(Visual Foundation Model, VFM)生成的2D分割掩码转化为具有区分性的3D查询;(2) 设计3D查询索引记忆机制(3D query index memory),利用上下文查询检索实现时间一致性;(3) 利用CUT3R中的状态分布标记(state-distribution token)作为掩码身份描述符,增强跨帧融合能力。该方案首次实现了基于单目RGB流的在线3D实例分割,并达到与RGB-D系统相当的性能水平。
链接: https://arxiv.org/abs/2512.15577
作者: Zhipeng Du,Duolikun Danier,Jan Eric Lenssen,Hakan Bilen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.
zh
[CV-23] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像分割中因标注数据稀缺及基础模型训练所用自然图像与航拍影像之间存在域差异而导致的性能瓶颈问题。其解决方案的关键在于引入SAM3(Segment Anything Model 3)这一以文本提示驱动的框架,通过结合语义(textual)与几何(geometric)提示策略,在轻量级微调(lightweight fine-tuning)条件下实现对RS图像的有效适应。实验表明,融合语义与几何线索的混合提示策略在各类目标上均取得最优性能,且少量几何标注即可显著提升分割精度,揭示了边界不准确和过分割仍是RS任务中的主要误差模式。
链接: https://arxiv.org/abs/2512.15564
作者: Roni Blushtein-Livnon,Osher Rafaeli,David Ioffe,Amir Boger,Karen Sandberg Esquenazi,Tal Svoray
机构: Ben-Gurion University of the Negev (本-古里安内盖夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.
zh
[CV-24] GRAN-TED: Generating Robust Aligned and Nuanced Text Embedding for Diffusion Models
【速读】:该论文旨在解决文本编码器(text encoder)在文本到图像和文本到视频扩散模型中的两个核心问题:一是缺乏一个高效且可靠的评估框架来预测下游生成性能;二是预训练语言模型在视觉合成任务中难以有效适配。解决方案的关键在于提出GRAN-TED范式,其核心创新包括:首先构建了一个名为TED-6K的纯文本基准测试集,无需进行昂贵的端到端模型训练即可高效评估编码器的表征质量,并通过轻量级统一适配器标准化后发现其与下游生成效果高度相关;其次基于该验证框架,设计了一种两阶段训练策略:第一阶段在多模态大语言模型(Multimodal Large Language Model)上微调以增强视觉表征能力,第二阶段采用逐层加权方法提取更细腻、更强的文本特征,最终得到的GRAN-TED编码器在TED-6K上达到SOTA表现,并显著提升文本到图像和文本到视频生成的质量。
链接: https://arxiv.org/abs/2512.15560
作者: Bozhou Li,Sihan Yang,Yushuo Guan,Ruichuan An,Xinlong Chen,Yang Shi,Pengfei Wan,Wentao Zhang,Yuanxing zhang
机构: Peking University (北京大学); Kling Team, Kuaishou Technology (快手科技Kling团队); Xi’an Jiaotong University (西安交通大学); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder’s representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder’s effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: this https URL.
zh
[CV-25] BLANKET: Anonymizing Faces in Infant Video Recordings
【速读】:该论文旨在解决婴幼儿视频数据中的人脸隐私保护问题,特别是在确保身份去标识化的同时保留关键面部特征以支持后续分析任务(如人体姿态估计)。其核心解决方案是提出一种名为BLANKET(Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency)的方法,该方法包含两个关键阶段:首先利用扩散模型进行图像修复(inpainting)生成与原始身份兼容的新人脸;其次通过时空一致的面部替换与真实表情迁移技术,将新面孔无缝融合至视频每一帧中。相较于DeepPrivacy2等传统方法,BLANKET在保持面部属性完整性、减少伪影及提升下游任务性能方面表现更优。
链接: https://arxiv.org/abs/2512.15542
作者: Ditmar Hadera,Jan Cech,Miroslav Purkrabek,Matej Hoffmann
机构: Czech Technical University in Prague (捷克技术大学); Czech Science Foundation (捷克科学基金会); EC Digital Europe Programme (欧洲数字欧洲计划)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Ensuring the ethical use of video data involving human subjects, particularly infants, requires robust anonymization methods. We propose BLANKET (Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency), a novel approach designed to anonymize infant faces in video recordings while preserving essential facial attributes. Our method comprises two stages. First, a new random face, compatible with the original identity, is generated via inpainting using a diffusion model. Second, the new identity is seamlessly incorporated into each video frame through temporally consistent face swapping with authentic expression transfer. The method is evaluated on a dataset of short video recordings of babies and is compared to the popular anonymization method, DeepPrivacy2. Key metrics assessed include the level of de-identification, preservation of facial attributes, impact on human pose estimation (as an example of a downstream task), and presence of artifacts. Both methods alter the identity, and our method outperforms DeepPrivacy2 in all other respects. The code is available as an easy-to-use anonymization demo at this https URL.
zh
[CV-26] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain
【速读】:该论文旨在解决大型视觉语言模型(Large Vision and Language Models, LVLMs)在遥感领域应用中面临的高计算成本问题,尤其是在训练和推理阶段参数量庞大导致的资源消耗过高,使得多数机构难以负担。解决方案的关键在于提出一种基于编码器-only 架构的轻量级多任务学习模型 GeoMELT(Multi-task Efficient Learning Transformer),该模型在保持较低参数量的同时,能够有效处理多种遥感特定任务,包括从遥感图像生成文本和跨模态检索等非传统联合建模任务,从而在性能与效率之间实现良好平衡。
链接: https://arxiv.org/abs/2512.15531
作者: João Daniel Silva,Joao Magalhaes,Devis Tuia,Bruno Martins
机构: INESC-ID, Instituto Superior Técnico, University of Lisbon (里斯本大学理工学院); Universidade NOVA de Lisboa (新里斯本大学); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.
zh
[CV-27] EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉情感理解(Visual Emotion Comprehension, VEC)任务中将情绪预测视为确定性任务的问题,即模型仅输出单一情绪标签,忽略了人类情感感知的主观性和多样性。为提升系统可靠性,论文提出的关键解决方案是赋予MLLMs以表达其预测置信度的能力,从而向用户传达不同情绪解释的合理性及模型自身的判断能力。这一改进通过一个三阶段训练框架实现:结构化推理、置信度口语化表达与置信度校准,最终构建出名为EmoCaliber的具备置信度感知能力的VEC模型,在统一基准VECBench上验证了其在情绪预测与置信度估计两方面的优越性能。
链接: https://arxiv.org/abs/2512.15528
作者: Daiqing Wu,Dongbao Yang,Can Ma. Yu Zhou
机构: IIE, Chinese Academy of Sciences (中国科学院信息工程研究所); Nankai University (南开大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs’ self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: this https URL.
zh
[CV-28] DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
【速读】:该论文旨在解决单张源图像与驱动视频结合生成高质量人脸动画时,难以实现头姿(head pose)与面部表情(facial expression)的高保真解耦控制问题,从而限制了仅编辑表情或仅调整姿态等应用。解决方案的关键在于提出DeX-Portrait框架,其核心创新包括:1)将头姿表示为显式的全局变换,面部表情建模为隐式的潜在编码;2)设计双分支条件机制将姿态变换注入扩散模型,同时通过交叉注意力(cross attention)注入表情潜码;3)引入渐进式混合无分类器引导(progressive hybrid classifier-free guidance)以增强身份一致性。该方法在动画质量和解耦可控性上均优于现有最优基线。
链接: https://arxiv.org/abs/2512.15524
作者: Yuxiang Shi,Zhe Li,Yanwen Wang,Hao Zhu,Xun Cao,Ligang Liu
机构: University of Science and Technology of China (中国科学技术大学); Central Media Technology Institute, Huawei (华为中央媒体技术研究院); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Projectpage: this https URL
Abstract:Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
zh
[CV-29] VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的图像伪造难以被传统基于像素或压缩伪影的检测方法识别的问题,同时弥补现有方法缺乏对异常强度进行量化评估的缺陷。其解决方案的关键在于提出 Vision-Attention Anomaly Scoring (VAAS) 框架,该框架由两个模块组成:一是基于 Vision Transformer (ViT) 的全局注意力机制用于估计异常程度,二是基于 SegFormer 嵌入的局部 patch 级自一致性评分,二者融合形成连续且可解释的异常得分,能够同时定位和量化篡改区域,从而提升数字图像真实性验证的准确性与透明度。
链接: https://arxiv.org/abs/2512.15512
作者: Opeyemi Bamigbade,Mark Scanlon,John Sheppard
机构: Waterford Institute of Technology (爱尔兰沃特福德理工学院); University College Dublin (爱尔兰都柏林大学); SETU (爱尔兰东南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.
zh
[CV-30] Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
【速读】:该论文旨在解决前向式三维高斯泼溅(3D Gaussian Splatting, 3DGS)模型中因依赖密集且刚性的像素对齐基础网格而导致的原始体素放置效率低、图像质量受限的问题。其解决方案的关键在于提出一种新的前向架构,通过引入“离网”(Off The Grid)的自适应分布机制替代传统像素网格,利用多尺度解码器在子像素级别上检测并分布3D高斯原语,从而实现更精确和高效的场景重建。该模块与3D重建主干网络联合训练,采用自监督学习策略,在无需相机位姿标签的情况下即可生成高质量的新视角图像,显著优于现有前向模型,并以更少的原语数量捕获细节、减少伪影。
链接: https://arxiv.org/abs/2512.15508
作者: Arthur Moreau,Richard Shaw,Michal Nazarczuk,Jisu Shin,Thomas Tanay,Zhensong Zhang,Songcen Xu,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, “Off The Grid” distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
zh
[CV-31] he LUMirag e: An independent evaluation of zero-shot performance in the LUMIR challenge
【速读】:该论文旨在解决当前深度学习方法在医学图像配准任务中被过度宣传的“零样本泛化能力”问题,尤其是针对LUMIR挑战中声称其模型在未见对比度和分辨率下仍保持高性能的结论提出质疑。研究的关键在于通过严格的独立再评估协议,系统性地检验这些方法在真实临床场景下的表现,识别潜在的仪器偏差(instrumentation bias),并量化其在分布内与分布外数据上的性能差异。结果表明,尽管深度学习方法在同分布T1加权MRI上表现优异,甚至优于传统迭代优化方法,但在跨模态(如T2、T2*、FLAIR)和高分辨率数据上显著退化,且对预处理敏感,这印证了领域偏移(domain shift)理论,并强调应建立更贴近实际临床工作流的评估标准,而非仅依赖可能偏向特定方法类别的理想化测试条件。
链接: https://arxiv.org/abs/2512.15505
作者: Rohit Jena,Pratik Chaudhari,James C. Gee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen’s d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.
zh
[CV-32] RUMPL: Ray-Based Transformers for Universal Multi-View 2D to 3D Human Pose Lifting
【速读】:该论文旨在解决从2D图像中估计3D人体姿态的难题,尤其针对遮挡和投影模糊性(projective ambiguity)带来的挑战。现有基于多视角学习的方法虽能缓解这些问题,但难以在真实场景中泛化,主要受限于高质量3D标注的多视角数据稀缺且采集条件受限。为此,作者提出RUMPL框架,其关键创新在于引入一种基于3D射线(3D ray-based representation)的2D关键点表示方式,使模型摆脱对相机标定参数和视图数量的依赖,从而实现无需重训练即可部署于任意多视角配置。此外,新设计的View Fusion Transformer通过融合射线令牌(fused-ray tokens)沿射线方向聚合信息,进一步提升多视角一致性。实验表明,RUMPL在多个基准上显著优于传统三角测量方法及基于图像表示的Transformer基线,大幅降低平均关节位置误差(MPJPE)。
链接: https://arxiv.org/abs/2512.15488
作者: Seyed Abolfazl Ghasemzadeh,Alexandre Alahi,Christophe De Vleeschouwer
机构: UCLouvain(卢万大学); EPFL(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation makes the model independent of camera calibration and the number of views, enabling universal deployment across arbitrary multi-view configurations without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Extensive experiments demonstrate that RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability. The framework’s source code is available at this https URL
zh
[CV-33] Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception
【速读】:该论文旨在解决野生动物目标检测(wildlife object detection)在复杂环境下的挑战问题,包括环境变化性、物种间视觉相似性以及类内多样性等因素对检测精度的影响。其解决方案的关键在于采用两种成熟的深度学习架构——ResNet-101 和 Inception v3,并通过标准化预处理流程(图像最大尺寸调整至800像素、RGB格式转换及PyTorch张量化)和70:30的训练验证划分进行模型训练与评估。结果表明,两者均展现出优异性能,其中Inception v3因并行多尺度卷积结构更高效地提取特征而略优于ResNet-101,在分类准确率(95%)和平均精度均值(mAP=0.92)上表现突出,验证了其在保护导向计算机视觉任务中的可靠性基础。
链接: https://arxiv.org/abs/2512.15480
作者: Malach Obisa Amonga,Benard Osero,Edna Too
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.
zh
[CV-34] ST-DETrack: Identity-Preserving Branch Tracking in Entangled Plant Canopies via Dual Spatiotemporal Evidence
【速读】:该论文旨在解决高通量表型分析中从时序图像中自动提取个体植物枝条的难题,尤其针对非刚性生长动态和密集冠层内枝条身份碎片化带来的挑战。解决方案的关键在于提出一种时空融合双解码器网络 ST-DETrack,其通过空间解码器利用位置与角度等几何先验进行早期阶段跟踪,同时借助时间解码器挖掘运动一致性以解决晚期遮挡问题;此外,引入自适应门控机制动态调节空间与时间线索的依赖权重,并结合基于负向地性(negative gravitropism)的生物约束缓解垂直生长引起的歧义,从而实现从萌芽到开花阶段的枝条身份长期一致性保持。
链接: https://arxiv.org/abs/2512.15445
作者: Yueqianji Chen,Kevin Williams,John H. Doonan,Paolo Remagnino,Jo Hepworth
机构: Durham University (杜伦大学); Aberystwyth University (阿伯里斯特威斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at IEEE Transactions on Image Processing
Abstract:Automated extraction of individual plant branches from time-series imagery is essential for high-throughput phenotyping, yet it remains computationally challenging due to non-rigid growth dynamics and severe identity fragmentation within entangled canopies. To overcome these stage-dependent ambiguities, we propose ST-DETrack, a spatiotemporal-fusion dual-decoder network designed to preserve branch identity from budding to flowering. Our architecture integrates a spatial decoder, which leverages geometric priors such as position and angle for early-stage tracking, with a temporal decoder that exploits motion consistency to resolve late-stage occlusions. Crucially, an adaptive gating mechanism dynamically shifts reliance between these spatial and temporal cues, while a biological constraint based on negative gravitropism mitigates vertical growth ambiguities. Validated on a Brassica napus dataset, ST-DETrack achieves a Branch Matching Accuracy (BMA) of 93.6%, significantly outperforming spatial and temporal baselines by 28.9 and 3.3 percentage points, respectively. These results demonstrate the method’s robustness in maintaining long-term identity consistency amidst complex, dynamic plant architectures.
zh
[CV-35] CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning AAAI2026
【速读】:该论文旨在解决现有面部模板逆向攻击中重建图像存在过度平滑的面部局部特征(如眼睛、鼻子、嘴巴)且跨模型迁移能力有限的问题。解决方案的关键在于提出一种基于CLIP驱动的细粒度属性条件化框架(CLIP-FTI),通过CLIP模型获取面部特征的语义嵌入,并利用跨模态特征交互网络将这些属性嵌入与泄露的面部模板融合,再投影至预训练StyleGAN的中间潜在空间,从而生成身份一致但具有更精细面部属性细节的图像,显著提升识别准确率、属性相似性和跨模型攻击迁移性。
链接: https://arxiv.org/abs/2512.15433
作者: Longchen Dai,Zixuan Shen,Zhiheng Zhou,Peipeng Yu,Zhihua Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Face recognition systems store face templates for efficient matching. Once leaked, these templates pose a threat: inverting them can yield photorealistic surrogates that compromise privacy and enable impersonation. Although existing research has achieved relatively realistic face template inversion, the reconstructed facial images exhibit over-smoothed facial-part attributes (eyes, nose, mouth) and limited transferability. To address this problem, we present CLIP-FTI, a CLIP-driven fine-grained attribute conditioning framework for face template inversion. Our core idea is to use the CLIP model to obtain the semantic embeddings of facial features, in order to realize the reconstruction of specific facial feature attributes. Specifically, facial feature attribute embeddings extracted from CLIP are fused with the leaked template via a cross-modal feature interaction network and projected into the intermediate latent space of a pretrained StyleGAN. The StyleGAN generator then synthesizes face images with the same identity as the templates but with more fine-grained facial feature attributes. Experiments across multiple face recognition backbones and datasets show that our reconstructions (i) achieve higher identification accuracy and attribute similarity, (ii) recover sharper component-level attribute semantics, and (iii) improve cross-model attack transferability compared to prior reconstruction attacks. To the best of our knowledge, ours is the first method to use additional information besides the face template attack to realize face template inversion and obtains SOTA results.
zh
[CV-36] Step-GUI Technical Report
【速读】:该论文旨在解决GUI自动化中高质量训练数据获取效率低且标注可靠性难以保障的核心挑战。其解决方案的关键在于提出一种由校准步长奖励系统(Calibrated Step Reward System)驱动的自演化训练流水线,通过轨迹级校准将模型生成的轨迹转化为可靠的训练信号,从而在仅需10–100倍更低成本的情况下实现90%的标注准确率。这一机制显著提升了训练数据的质量与生产效率,为构建高性能GUI代理模型(如Step-GUI)奠定了基础,并进一步结合GUI-MCP协议实现跨设备标准化接口与高隐私保护执行,推动了GUI代理在真实日常场景中的实用化部署。
链接: https://arxiv.org/abs/2512.15431
作者: Haolong Yan,Jia Wang,Xin Huang,Yeqing Shen,Ziyang Meng,Zhimin Fan,Kaijun Tan,Jin Gao,Lieyu Shi,Mi Yang,Shiliang Yang,Zhirui Wang,Brian Li,Kang An,Chenyang Li,Lei Lei,Mengmeng Duan,Danxun Liang,Guodong Liu,Hang Cheng,Hao Wu,Jie Dong,Junhao Huang,Mei Chen,Renjie Yu,Shunshan Li,Xu Zhou,Yiting Dai,Yineng Deng,Yingdan Liang,Zelin Chen,Wen Sun,Chengxu Yan,Chunqin Xu,Dong Li,Fengqiong Xiao,Guanghao Fan,Guopeng Li,Guozhen Peng,Hongbing Li,Hang Li,Hongming Chen,Jingjing Xie,Jianyong Li,Jingyang Zhang,Jiaju Ren,Jiayu Yuan,Jianpeng Yin,Kai Cao,Liang Zhao,Liguo Tan,Liying Shi,Mengqiang Ren,Min Xu,Manjiao Liu,Mao Luo,Mingxin Wan,Na Wang,Nan Wu,Ning Wang,Peiyao Ma,Qingzhou Zhang,Qiao Wang,Qinlin Zeng,Qiong Gao,Qiongyao Li,Shangwu Zhong,Shuli Gao,Shaofan Liu,Shisi Gao,Shuang Luo,Xingbin Liu,Xiaojia Liu,Xiaojie Hou,Xin Liu,Xuanti Feng,Xuedan Cai,Xuan Wen,Xianwei Zhu,Xin Liang,Xin Liu,Xin Zhou,Yingxiu Zhao,Yukang Shi,Yunfang Xu,Yuqing Zeng,Yixun Zhang,Zejia Weng,Zhonghao Yan,Zhiguo Huang,Zhuoyu Wang,Zheng Ge,Jing Li,Yibo Zhu,Binxing Jiao,Xiangyu Zhang,Daxin Jiang
机构: StepFun(步骤函数); GELab-Team(通用增强学习实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 26 figures
Abstract:Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving 90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
zh
[CV-37] Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry
【速读】:该论文旨在解决单目深度估计基础模型(Monocular Depth Estimation Foundation Models, MDE)在面对几何平面但感知模糊的输入时,会生成虚假三维结构的问题,即所谓的“3D幻象”(3D Mirage)。这一现象导致模型在真实场景中产生不可靠的深度预测,构成未被量化评估的安全风险。解决方案的关键在于提出一个端到端框架,包含三个核心组件:首先,构建首个真实世界幻象基准3D-Mirage,提供精确标注的平面区域和受限上下文裁剪;其次,设计基于拉普拉斯算子的评估体系,引入偏差综合得分(Deviation Composite Score, DCS)和混淆综合得分(Confusion Composite Score, CCS)以量化非平面伪影与上下文不稳定性;最后,提出接地自蒸馏(Grounded Self-Distillation)策略,通过冻结教师模型保留背景知识的同时,在幻象感兴趣区域(ROI)上强制执行平面约束,实现参数高效且无灾难性遗忘的结构化修正。
链接: https://arxiv.org/abs/2512.15423
作者: Hoang Nguyen,Xiaohao Xu,Xiaonan Huang
机构: University of Michigan, Ann Arbor (密歇根大学,安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.
zh
[CV-38] MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training
【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action Models, VLAs)在跨模态迁移和泛化能力上的局限性,特别是由于摄像头视角差异、视觉外观不一致以及具身形态(embodiment morphology)不同所导致的性能下降问题。解决方案的关键在于提出MiVLA,一种基于人类与机器人相互模仿预训练(human-robot mutual imitation pre-training)的通用VLAs框架,其核心创新是利用人类手部与机器人臂之间固有的行为相似性,构建强行为先验(behavioral priors),并通过引入左右手坐标系下的运动学规则实现双向对齐人类与机器人动作空间,从而在真实人类数据的行为保真度与仿真机器人数据的操作多样性之间建立统一表征,显著提升模型在下游任务中的泛化能力。
链接: https://arxiv.org/abs/2512.15411
作者: Zhenhan Yin,Xuanhan Wang,Jiahao Jiang,Kaiyuan Deng,Pengqi Chen,Shuangle Li,Chong Liu,Xing Xu,ingkuan Song,Lianli Gao,Heng Tao Shen
机构: Tongji University (同济大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While leveraging abundant human videos and simulated robot data poses a scalable solution to the scarcity of real-world robot data, the generalization capability of existing vision-language-action models (VLAs) remains limited by mismatches in camera views, visual appearance, and embodiment morphologies. To overcome this limitation, we propose MiVLA, a generalizable VLA empowered by human-robot mutual imitation pre-training, which leverages inherent behavioral similarity between human hands and robotic arms to build a foundation of strong behavioral priors for both human actions and robotic control. Specifically, our method utilizes kinematic rules with left/right hand coordinate systems for bidirectional alignment between human and robot action spaces. Given human or simulated robot demonstrations, MiVLA is trained to forecast behavior trajectories for one embodiment, and imitate behaviors for another one unseen in the demonstration. Based on this mutual imitation, it integrates the behavioral fidelity of real-world human data with the manipulative diversity of simulated robot data into a unified model, thereby enhancing the generalization capability for downstream tasks. Extensive experiments conducted on both simulation and real-world platforms with three robots (ARX, PiPer and LocoMan), demonstrate that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs (e.g., \boldsymbol\pi_0 , \boldsymbol\pi_0.5 and H-RDT) by 25% in simulation, and 14% in real-world robot control tasks.
zh
[CV-39] Preserving Marker Specificity with Lightweight Channel-Independent Representation Learning
【速读】:该论文旨在解决多路组织成像(multiplexed tissue imaging)中深度学习模型在自监督表征学习时的结构偏差问题,即传统早期通道融合卷积神经网络(early-fusion CNNs)假设所有蛋白标记(protein markers)共享相同结构,从而导致难以保留标记特异性信息,尤其在稀有细胞识别任务中表现不佳。其解决方案的关键在于采用通道独立(channel-independent)架构,并结合浅层设计,以更符合多路数据中各标记独立性的先验知识,从而提升表征质量。实验表明,即便参数量仅5.5K的轻量级通道独立模型(CIM-S),也能在对比预训练和线性评估中显著优于大型早期融合模型及基础模型,证明了这种结构设计的有效性和鲁棒性。
链接: https://arxiv.org/abs/2512.15410
作者: Simon Gutwein,Arthur Longuefosse,Jun Seita,Sabine Taschner-Mandl,Roxane Licandro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, MIDL 2026 conference
Abstract:Multiplexed tissue imaging measures dozens of protein markers per cell, yet most deep learning models still apply early channel fusion, assuming shared structure across markers. We investigate whether preserving marker independence, combined with deliberately shallow architectures, provides a more suitable inductive bias for self-supervised representation learning in multiplex data than increasing model scale. Using a Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers, we compare standard early-fusion CNNs with channel-separated architectures, including a marker-aware baseline and our novel shallow Channel-Independent Model (CIM-S) with 5.5K parameters. After contrastive pretraining and linear evaluation, early-fusion models show limited ability to retain marker-specific information and struggle particularly with rare-cell discrimination. Channel-independent architectures, and CIM-S in particular, achieve substantially stronger representations despite their compact size. These findings are consistent across multiple self-supervised frameworks, remain stable across augmentation settings, and are reproducible across both the 49-marker and reduced 18-marker settings. These results show that lightweight, channel-independent architectures can match or surpass deep early-fusion CNNs and foundation models for multiplex representation learning. Code is available at this https URL.
zh
[CV-40] SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering
【速读】:该论文旨在解决部分视图对齐聚类(Partially View-aligned Clustering, PVC)中两个核心挑战:一是现有方法未能有效利用未对齐数据来捕捉同一簇样本间的共享语义信息;二是多视图数据固有的异质性导致表示分布偏移,从而影响跨视图潜在特征之间有意义对应关系的建立,进而削弱聚类性能。解决方案的关键在于提出一种语义匹配对比学习模型(Semantic MAtching contRasTive learning, SMART),通过缓解跨视图分布偏移的影响,促进语义匹配对比学习,从而充分挖掘对齐与未对齐数据中的语义关联,提升聚类效果。
链接: https://arxiv.org/abs/2512.15396
作者: Liang Peng,Yixuan Ye,Cheng Liu,Hangjun Che,Fei Wang,Zhiwen Yu,Si Wu,Hau-San Wong
机构: Huaqiao University (华侨大学); Shantou University (汕头大学); Southwest University (西南大学); South China University of Technology (华南理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.
zh
[CV-41] See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball
【速读】:该论文旨在解决篮球比赛中篮板球(rebound)预测的前瞻性动作识别问题,即在投篮动作发生后、球未落地前,提前预测哪支球队将获得球权。这一任务在现有体育视频理解研究中尚未得到充分关注,而其对于实时自动转播和赛后分析工具具有重要意义。解决方案的关键在于构建了一个大规模自标注数据集(包含10万段视频片段、超300小时比赛录像及2000余个手动标注的篮板事件),并首次将深度学习方法应用于篮球篮板预测任务,同时引入篮板分类与篮板定位两个辅助任务以增强模型泛化能力。实验结果验证了该问题的可行性及其在多智能体动态场景下的挑战性,为未来体育视频中动作预测建模提供了重要参考。
链接: https://arxiv.org/abs/2512.15386
作者: Arnau Barrera Roy,Albert Clapés Sintes
机构: Universitat de Barcelona (巴塞罗那大学); Computer Vision Center (计算机视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.
zh
[CV-42] Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models ECIR2026
【速读】:该论文旨在解决视觉-语言模型中视觉Transformer对所有图像均采用相同计算量的问题,导致资源浪费(如ViT-L/14在处理简单产品图与复杂街景时均消耗175.33 GFLOPs)。其解决方案的关键在于提出ICAR(Image Complexity-Aware Retrieval),通过双路径训练机制确保不同计算深度(早停或全深度)生成的图像嵌入仍能保持与文本嵌入在同一语义空间中的对齐性,从而实现无需重排序的直接图文匹配。此外,为高效判断图像复杂度,作者设计了ConvNeXt-IC分类器,将图像复杂度评估转化为分类任务,显著提升推理效率并维持高精度。
链接: https://arxiv.org/abs/2512.15372
作者: Mikel Williams-Lekuona,Georgina Cosma
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted paper for ECIR 2026
Abstract:Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
zh
[CV-43] SemanticBridge – A Dataset for 3D Semantic Segmentation of Bridges and Domain Gap Analysis
【速读】:该论文旨在解决桥梁三维语义分割(3D semantic segmentation)中因传感器差异导致的领域偏移(domain gap)问题,这是基础设施检测与维护中的关键挑战。解决方案的关键在于构建了一个专为桥梁结构设计的高分辨率三维扫描数据集,涵盖来自不同国家的多样化桥体结构,并提供精细的语义标签;同时,通过多传感器采集的数据量化了传感器变化带来的领域偏移影响,从而为评估现有三维深度学习模型在跨传感器场景下的鲁棒性提供了基准。
链接: https://arxiv.org/abs/2512.15369
作者: Maximilian Kellner,Mariana Ferrandon Cervantes,Yuandong Pan,Ruodan Lu,Ioannis Brilakis,Alexander Reiterer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high-resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state-of-the-art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.
zh
[CV-44] Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中群体相对策略优化(Group Relative Policy Optimization, GRPO)因大群体规模导致的计算成本过高问题。其核心挑战在于,随着群体规模增大,大量轨迹会趋向于群体均值奖励(即奖励聚类现象),从而丧失优化价值,造成资源浪费。解决方案的关键在于提出 Pro-GRPO(Proactive GRPO),一个融合潜在特征驱动轨迹剪枝的动态框架:通过在采样过程中早期终止奖励聚集轨迹以降低计算开销,并采用“扩展与剪枝”策略——先扩大初始采样组以提升轨迹多样性,再基于潜在空间进行多步最优方差过滤(Optimal Variance Filtering, OVF),从而在保持性能的同时显著减少冗余计算。
链接: https://arxiv.org/abs/2512.15347
作者: Shiran Ge,Chenyi Huang,Yuang Ai,Qihang Fan,Huaibo Huang,Ran He
机构: MAIS & NLPR, Institute of Automation, CAS (中国科学院自动化研究所); School of Artificial Intelligence, UCAS (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an “Expand-and-Prune” strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
zh
[CV-45] owards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
【速读】:该论文旨在解决现有3D对话头生成框架中,说话与倾听行为被独立建模或依赖非因果全序列建模导致的跨轮次时间不一致问题。其核心解决方案是提出TIMAR(Turn-level Interleaved Masked AutoRegression)框架,通过在每轮交互内融合多模态信息,并采用轮次级因果注意力机制累积对话历史,从而实现对双向动态交互的因果建模;同时引入轻量级扩散头以预测连续的3D头部运动,有效捕捉协调性与表达多样性。
链接: https://arxiv.org/abs/2512.15340
作者: Junjie Chen,Fei Wang,Zhihao Huang,Qing Zhou,Kun Li,Dan Guo,Linfeng Zhang,Xun Yang
机构: Hefei University of Technology (合肥工业大学); IAI, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); USTC (中国科学技术大学); SJTU (上海交通大学); TeleAI, China Telecom (中国电信天翼AI); Northwestern Polytechnical University (西北工业大学); United Arab Emirates University (阿联酋大学); Anhui Polytechnic University (安徽工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository this https URL.
zh
[CV-46] A Preprocessing Framework for Video Machine Vision under Compression
【速读】:该论文旨在解决视频压缩与传输中因传统优化方法主要基于人类感知质量指标(如PSNR或SSIM)而忽视机器视觉系统(Machine Vision Systems)对信息保真度更高需求的问题。其核心解决方案是提出一种面向机器视觉任务的视频预处理框架,关键在于引入一个神经预处理器(neural preprocessor),该模块在压缩前保留对下游任务至关重要的特征信息,从而提升率-精度(rate-accuracy)性能;同时设计了一个可微分的虚拟编解码器(differentiable virtual codec)用于训练阶段施加码率和失真约束,而测试时直接使用标准编解码器(standard codecs),确保方法在真实场景中的可部署性。
链接: https://arxiv.org/abs/2512.15331
作者: Fei Zhao,Mengxi Guo,Shijie Zhao,Junlin Li,Li Zhang,Xiaodong Xie
机构: Bytedance(字节跳动); Peking University (北京大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a POSTER and for publication in the DCC 2024 proceedings
Abstract:There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.
zh
[CV-47] Vision-based module for accurately reading linear scales in a laboratory
【速读】:该论文旨在解决机器人在非结构化实验室环境中实现类人自主操作时,难以准确读取线性刻度仪器(如注射器和量筒)测量值的问题。其关键解决方案是采用一种类人视觉感知方法:首先对随机朝向的仪器图像进行几何变换以校正姿态,随后聚焦于含线性刻度的局部区域,提取主要标记点、对应数字及液面指示位置等特征,并基于这些特征计算出定量读数,最终验证了该系统读数与人类读数具有高度一致性。
链接: https://arxiv.org/abs/2512.15327
作者: Parvesh Saini,Soumyadipta Maiti,Beena Rai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 16 figures
Abstract:Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.
zh
[CV-48] A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation)在图像异常检测与定位任务中因输入信号与监督信号相似性过高而导致的过度泛化问题。解决方案的关键在于提出一种新颖的掩码逆向知识蒸馏(Masked Reverse Knowledge Distillation, MRKD)方法,其核心创新包括图像级掩码(Image-Level Masking, ILM)和特征级掩码(Feature-Level Masking, FLM):ILM通过区分输入信号与监督信号来增强全局上下文信息的捕捉能力,FLM则引入合成的特征级异常以保留局部细节信息,从而有效缓解模型对正常样本的过度拟合,提升异常检测的敏感性和准确性。
链接: https://arxiv.org/abs/2512.15326
作者: Yuxin Jiang,Yunkang Can,Weiming Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (ILM) and feature-level masking (FLM), MRKD transforms the task of image reconstruction into image restoration. Specifically, ILM helps to capture global information by differentiating input signals from supervisory signals. On the other hand, FLM incorporates synthetic feature-level anomalies to ensure that the learned representations contain sufficient local information. With these two strategies, MRKD is endowed with stronger image context capture capacity and is less likely to be overgeneralized. Experiments on the widely-used MVTec anomaly detection dataset demonstrate that MRKD achieves impressive performance: image-level 98.9% AU-ROC, pixel-level 98.4% AU-ROC, and 95.3% AU-PRO. In addition, extensive ablation experiments have validated the superiority of MRKD in mitigating the overgeneralization problem.
zh
[CV-49] MECAD: A multi-expert architecture for continual anomaly detection
【速读】:该论文旨在解决持续异常检测(continual anomaly detection)中的知识遗忘问题,即在增量学习新类别时,模型对先前类别的检测性能显著下降。解决方案的关键在于提出一种多专家架构(multi-expert architecture, MECAD),通过基于特征相似性的动态专家分配机制实现类特定知识的保留,并结合优化的核集(coreset)选择与专用回放缓冲区(replay buffer)机制,在无需全量重训练的前提下有效管理记忆存储,从而在保持计算效率的同时提升模型对多样化对象类别的适应能力与稳定性。
链接: https://arxiv.org/abs/2512.15323
作者: Malihe Dahmardeh,Francesco Setti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIAP 2025
Abstract:In this paper we propose MECAD, a novel approach for continual anomaly detection using a multi-expert architecture. Our system dynamically assigns experts to object classes based on feature similarity and employs efficient memory management to preserve the knowledge of previously seen classes. By leveraging an optimized coreset selection and a specialized replay buffer mechanism, we enable incremental learning without requiring full model retraining. Our experimental evaluation on the MVTec AD dataset demonstrates that the optimal 5-expert configuration achieves an average AUROC of 0.8259 across 15 diverse object categories while significantly reducing knowledge degradation compared to single-expert approaches. This framework balances computational efficiency, specialized knowledge retention, and adaptability, making it well-suited for industrial environments with evolving product types.
zh
[CV-50] Prototypical Learning Guided Context-Aware Segmentation Network for Few-Shot Anomaly Detection
【速读】:该论文旨在解决少样本异常检测(Few-shot Anomaly Detection, FSAD)中因预训练特征表示与目标场景之间存在领域差异(domain gap)而导致的检测性能下降问题。解决方案的关键在于提出一种原型学习引导的上下文感知分割网络(Prototypical Learning Guided Context-Aware Segmentation Network, PCSNet),其核心创新包括:1)原型特征适配(Prototypical Feature Adaptation, PFA)子网络,通过提取原型特征增强正常样本的紧凑性并实现与异常的分离,同时设计像素级差异分类损失以提升对细微异常的区分能力;2)上下文感知分割(Context-Aware Segmentation, CAS)子网络,利用伪异常辅助训练过程,实现像素级别的异常定位。该方法显著提升了FSAD在MVTec和MPDD数据集上的性能,并在汽车塑料件实际检测任务中验证了其在有限样本下的有效性。
链接: https://arxiv.org/abs/2512.15319
作者: Yuxin Jiang,Yunkang Cao,Weiming Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot anomaly detection (FSAD) denotes the identification of anomalies within a target category with a limited number of normal samples. Existing FSAD methods largely rely on pre-trained feature representations to detect anomalies, but the inherent domain gap between pre-trained representations and target FSAD scenarios is often overlooked. This study proposes a Prototypical Learning Guided Context-Aware Segmentation Network (PCSNet) to address the domain gap, thereby improving feature descriptiveness in target scenarios and enhancing FSAD performance. In particular, PCSNet comprises a prototypical feature adaption (PFA) sub-network and a context-aware segmentation (CAS) sub-network. PFA extracts prototypical features as guidance to ensure better feature compactness for normal data while distinct separation from anomalies. A pixel-level disparity classification loss is also designed to make subtle anomalies more distinguishable. Then a CAS sub-network is introduced for pixel-level anomaly localization, where pseudo anomalies are exploited to facilitate the training process. Experimental results on MVTec and MPDD demonstrate the superior FSAD performance of PCSNet, with 94.9% and 80.2% image-level AUROC in an 8-shot scenario, respectively. Real-world applications on automotive plastic part inspection further demonstrate that PCSNet can achieve promising results with limited training samples. Code is available at this https URL.
zh
[CV-51] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment
【速读】:该论文旨在解决磁共振成像(MRI)中运动伪影(motion artifacts)导致图像质量下降及患者复扫率高的问题,同时克服现有自动化质量评估方法在判别粒度(多级分级)和可解释性方面的局限。其解决方案的关键在于提出AutoMAC-MRI框架,该框架采用监督对比学习(supervised contrastive learning)构建运动严重程度的判别性特征表示,并在此特征空间内计算针对各等级的亲和度分数(affinity scores),从而实现对运动伪影等级的透明化、可解释性判定,使每张图像的评分具有临床可理解性,进而支持在线MRI质量控制,减少不必要的重扫并提升流程效率。
链接: https://arxiv.org/abs/2512.15315
作者: Antony Jerald,Dattesh Shanbhag,Sudhanya Chatterjee
机构: GE HealthCare(GE健康Care)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image’s proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.
zh
[CV-52] KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Birds-Eye-View Segmentation
【速读】:该论文旨在解决基于单全景相机(single 360-degree panoramic camera)的鸟瞰图(Bird’s-Eye-View, BEV)语义分割精度不足的问题,同时降低传感器复杂性和部署成本。解决方案的关键在于提出了一种跨模态知识蒸馏(cross-modality distillation)框架:利用高容量的LiDAR与摄像头融合教师网络(Teacher network)从多模态数据中提取丰富的空间和语义特征,并通过一种新颖的体素对齐视图变换器(voxel-aligned view transformer)保持空间保真度,将这些知识高效蒸馏到仅依赖单个全景图像的学生网络(Student network)中。该方法在Dur360BEV数据集上实现了25.6% IoU提升的教师模型性能和8.5% IoU增益的学生模型性能,且推理速度达31.2 FPS,验证了其在真实自动驾驶场景中的可行性与鲁棒性。
链接: https://arxiv.org/abs/2512.15311
作者: Wenke E,Yixin Sun,Jiaxu Liu,Hubert P. H. Shum,Amir Atapour-Abarghouei,Toby P. Breckon
机构: Durham University, UK(杜伦大学,英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird’s-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.
zh
[CV-53] SynthSeg-Agents : Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation
【速读】:该论文旨在解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)中依赖真实图像训练样本的问题,提出了一种全新的零样本弱监督语义分割(Zero Shot Weakly Supervised Semantic Segmentation, ZSWSSS)范式。其核心解决方案是设计了一个由大型语言模型(Large Language Models, LLMs)驱动的多智能体框架——SynthSeg Agents,该框架完全不依赖真实图像即可生成高质量合成训练数据。关键创新在于两个模块:一是自迭代优化提示(Self Refine Prompt Agent),通过记忆机制、提示空间探索与CLIP相似度引导的多样性过滤,自动构建语义丰富且多样化的图像提示;二是图像生成代理(Image Generation Agent),利用视觉语言模型(Vision Language Models, VLMs)生成候选图像,并结合冻结的CLIP评分模型筛选优质样本,再通过微调ViT分类器对合成数据进行语义重标注以提升精度。此方法实现了无需真实图像监督的语义分割训练,在PASCAL VOC 2012和COCO 2014上验证了其竞争力,展示了LLM驱动代理在低成本、可扩展语义分割中的巨大潜力。
链接: https://arxiv.org/abs/2512.15310
作者: Wangyu Wu,Zhenhong Chen,Xiaowei Huang,Fei Ma,Jimin Xiao
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.
zh
[CV-54] MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement
【速读】:该论文旨在解决多光谱图像超分辨率(pan-sharpening)任务中,传统卷积神经网络(CNN)方法因固定卷积操作难以适应复杂空间与光谱变化,以及基于交叉注意力机制的方法存在计算效率低且易模糊细粒度对应关系的问题。其解决方案的关键在于提出一种名为MMMamba的跨模态上下文融合框架,利用Mamba架构实现线性计算复杂度的同时保持强跨模态交互能力,并引入新颖的多模态交错扫描(Multimodal Interleaved, MI)机制,以高效促进全色(PAN)与多光谱(MS)模态之间的信息交换,从而在多个基准测试中显著优于现有最先进(SOTA)方法。
链接: https://arxiv.org/abs/2512.15261
作者: Yingying Wang,Xuanhua He,Chen Wu,Jialing Huang,Suiyun Zhang,Rui Liu,Xinghao Ding,Haoxuan Che
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \link{Code}{ this https URL }
Abstract:Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.
zh
[CV-55] Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models
【速读】:该论文旨在解决视觉场景中物体计数这一基础但具有挑战性的问题,传统方法依赖于特定领域的计数架构,需在预定义类别标注的数据集上训练,而难以适应开放集场景。其解决方案的关键在于系统性地比较最先进的专用计数架构与大规模多模态视觉语言模型(Vision-Language Models, VLMs)的性能,发现VLMs具备在开放集条件下近似枚举视觉场景中物体数量的能力,且在某些情况下可达到甚至超越专用模型的精度;尤其当VLM被提示生成每个待计数对象的中间表示(如位置和语义标签)时,枚举准确性显著提升,表明利用VLM的泛化能力和推理能力是实现灵活、通用计数任务的重要方向。
链接: https://arxiv.org/abs/2512.15254
作者: Kuinan Hou,Jing Mi,Marco Zorzi,Lamberto Ballan,Alberto Testolin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.
zh
[CV-56] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
【速读】:该论文旨在解决医疗人工智能(Medical AI)系统中因数据分布不均导致的交叉性偏见(intersectional bias)问题,即模型在诊断边缘化患者亚群时表现出系统性较低的置信度,从而引发误诊和漏诊率升高。现有公平性干预措施往往无法有效弥合这些偏差,或以牺牲整体诊断性能为代价实现统计上的平等。解决方案的关键在于提出一种名为跨模态对齐一致性(Cross-Modal Alignment Consistency, CMAC-MMD)的训练框架,该方法通过标准化不同交叉属性(如年龄、性别、种族)子群体间的诊断置信度,实现公平性与准确性的协同提升;其创新之处在于无需在临床推理阶段使用敏感人口统计学信息即可达成等效决策置信度,从而在保障隐私的同时提高模型对多样化患者的可靠诊断能力。
链接: https://arxiv.org/abs/2512.15249
作者: Yupeng Zhang,Adam G. Dunn,Usman Naseem,Jinman Kim
机构: University of Sydney (悉尼大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model’s decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, \Delta TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced \Delta TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
zh
[CV-57] Null-LoRA: Low-Rank Adaptation on Null Space
【速读】:该论文旨在解决现有参数高效微调方法(如LoRA)在全参数空间进行低秩适配时存在的冗余问题,以及如何进一步提升参数利用效率的问题。其解决方案的关键在于提出基于零空间的低秩适配(Null-LoRA),通过冻结低秩矩阵的部分成分来减少冗余并增强有效秩,并将全部增量更新约束在预训练模型的非平凡零空间内,从而最大化增量更新对新任务范式的适应能力,实现更优的性能与更低的参数开销。
链接: https://arxiv.org/abs/2512.15233
作者: Yi Zhang,Yulei Kang,Haoxuan Chen,Jinxuan Li,ian-Fang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.
zh
[CV-58] SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal
【速读】:该论文旨在解决夜间镜头眩光(lens flare)去除问题,特别是针对非均匀散射型眩光在复杂真实场景中难以有效处理的挑战。现有方法因无法充分建模空间变化的点扩散函数(Point Spread Function, PSF)而适用性受限。其解决方案的关键在于提出一种名为SLCFormer的新型频域-空域协同上下文Transformer框架:一方面通过频率傅里叶与激励模块(Frequency Fourier and Excitation Module, FFEM)在频域捕获全局上下文信息以表征眩光特性;另一方面利用方向增强空间模块(Directionally-Enhanced Spatial Module, DESM)在空域强化局部结构和方向特征,实现精准去眩光。此外,研究还设计了基于ZernikeVAE的散射眩光生成流水线,合成具有空间变异性PSF的物理逼真眩光数据,从而打通光学物理建模与数据驱动训练之间的桥梁。
链接: https://arxiv.org/abs/2512.15221
作者: Xiyu Zhu,Wei Wang,Xin Yuan,Xiao Wang
机构: 1. Wuhan University of Science and Technology (武汉科技大学); 2. Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions. To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.
zh
[CV-59] From Camera to World: A Plug-and-Play Module for Human Mesh Transformation
【速读】:该论文旨在解决从野外图像中重建准确的三维人体网格(3D human meshes)时,因缺乏相机旋转信息而导致的世界坐标系转换误差问题。现有方法通常假设相机无旋转(即零相机旋转),这在相机坐标系下表现良好,但在转换到世界坐标系时会产生显著偏差。解决方案的关键在于提出一个可插拔模块Mesh-Plug,其核心创新是一种以人为中心的方法:利用初始网格渲染的RGB图像和深度图,联合估计相机旋转参数(特别是俯仰角),从而实现从相机坐标系到世界坐标系的精确变换。该方法通过两个步骤实现:首先训练一个专注于人体空间结构的相机旋转预测模块以估计相机俯仰角;随后设计一个网格调整模块,结合预测的相机参数同步优化根关节方向和身体姿态,有效提升重建精度。
链接: https://arxiv.org/abs/2512.15212
作者: Changhai Ma,Ziyu Wu,Yunkang Zhang,Qijun Ying,Boyan Liu,Xiaohui Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body’s spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
zh
[CV-60] BC: A Target-Background Contrast Metric for Low-Altitude Infrared and Visible Image Fusion
【速读】:该论文旨在解决红外与可见光图像融合(Infrared and Visible Image Fusion)在低空无人机侦察任务中,传统无参考评价指标(如熵(Entropy, EN)和平均梯度(Average Gradient, AG))在复杂低光照环境下失效的问题。这些问题指标常将高频传感器噪声误判为有效细节,形成“噪声陷阱”(Noise Trap),导致对噪声图像的评分偏高,误导融合算法优化方向。解决方案的关键在于提出一种基于韦伯定律(Weber’s Law)的新指标——目标-背景对比度(Target-Background Contrast, TBC),该指标聚焦于显著目标相对于背景的相对对比度,而非全局统计特征;TBC通过抑制背景噪声并增强目标可见性,更符合人类视觉感知,在DroneVehicle数据集上的实验验证了其在低空场景下的可靠性与有效性。
链接: https://arxiv.org/abs/2512.15211
作者: Yufeng Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared and visible image fusion is a pivotal technology in low-altitude UAV reconnaissance missions, providing high-quality data support for downstream tasks such as target detection and tracking by integrating thermal saliency with background texture this http URL, traditional no-reference metrics fail(Specifically,like Entropy (EN) and Average Gradient (AG)) in complex low-light environments. They often misinterpret high-frequency sensor noise as valid detail. This creates a “Noise Trap,” paradoxically assigning higher scores to noisy images and misguiding fusion this http URL address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber’s Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Experiments on the DroneVehicle dataset demonstrate that TBC aligns better with human perception and provides a reliable standard for low-altitude scenarios.
zh
[CV-61] EPSM: A Novel Metric to Evaluate the Safety of Environmental Perception in Autonomous Driving
【速读】:该论文旨在解决当前感知系统评估中忽视安全相关因素的问题,即传统性能指标(如精确率、召回率和F1分数)虽能反映整体检测准确性,但无法识别可能引发严重事故的误检风险。解决方案的关键在于提出一种新的联合安全度量框架,集成轻量级的目标安全度量(object safety metric)与车道安全度量(lane safety metric),其中后者还考虑了目标检测与车道检测任务间的相互依赖关系,从而提供一个统一且可解释的感知安全性综合评分,有效识别出传统指标难以捕捉的安全关键性错误。
链接: https://arxiv.org/abs/2512.15195
作者: Jörg Gamerdinger,Sven Teufel,Stephan Amann,Lukas Marc Listl,Oliver Bringmann
机构: University of Tübingen (图宾根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted at IEEE IV 2026
Abstract:Extensive evaluation of perception systems is crucial for ensuring the safety of intelligent vehicles in complex driving scenarios. Conventional performance metrics such as precision, recall and the F1-score assess the overall detection accuracy, but they do not consider the safety-relevant aspects of perception. Consequently, perception systems that achieve high scores in these metrics may still cause misdetections that could lead to severe accidents. Therefore, it is important to evaluate not only the overall performance of perception systems, but also their safety. We therefore introduce a novel safety metric for jointly evaluating the most critical perception tasks, object and lane detection. Our proposed framework integrates a new, lightweight object safety metric that quantifies the potential risk associated with object detection errors, as well as an lane safety metric including the interdependence between both tasks that can occur in safety evaluation. The resulting combined safety score provides a unified, interpretable measure of perception safety performance. Using the DeepAccident dataset, we demonstrate that our approach identifies safety critical perception errors that conventional performance metrics fail to capture. Our findings emphasize the importance of safety-centric evaluation methods for perception systems in autonomous driving.
zh
[CV-62] ERIENet: An Efficient RAW Image Enhancement Network under Low-Light Environment
【速读】:该论文旨在解决现有基于RAW图像的低光增强方法中存在的两大问题:一是多尺度信息处理通常采用串行方式,导致模型难以轻量化且处理速度慢;二是忽略了RAW图像中绿色通道(Green Channel)的信息优势,未能有效利用其丰富细节提升重建质量。解决方案的关键在于提出一种高效RAW图像增强网络(ERIENet),其核心创新包括:1)设计了一种新型通道感知残差密集块(channel-aware residual dense block),实现多尺度特征的并行提取,显著降低计算成本并支持实时处理;2)引入绿色通道引导分支(green channel guidance branch),充分挖掘RAW图像中绿色通道的高信息密度特性,以少量参数和计算量提升重建图像的质量。实验表明,该方法在多个低光增强数据集上优于当前最优方法,并在单张NVIDIA GeForce RTX 3090显卡上实现了超过146帧/秒(FPS)的4K分辨率处理速度。
链接: https://arxiv.org/abs/2512.15186
作者: Jianan Wang,Yang Hong,Hesong Li,Tao Wang,Songrong Liu,Ying Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, conference ICVISP
Abstract:RAW images have shown superior performance than sRGB images in many image processing tasks, especially for low-light image enhancement. However, most existing methods for RAW-based low-light enhancement usually sequentially process multi-scale information, which makes it difficult to achieve lightweight models and high processing speeds. Besides, they usually ignore the green channel superiority of RAW images, and fail to achieve better reconstruction performance with good use of green channel information. In this work, we propose an efficient RAW Image Enhancement Network (ERIENet), which parallelly processes multi-scale information with efficient convolution modules, and takes advantage of rich information in green channels to guide the reconstruction of images. Firstly, we introduce an efficient multi-scale fully-parallel architecture with a novel channel-aware residual dense block to extract feature maps, which reduces computational costs and achieves real-time processing speed. Secondly, we introduce a green channel guidance branch to exploit the rich information within the green channels of the input RAW image. It increases the quality of reconstruction results with few parameters and computations. Experiments on commonly used low-light image enhancement datasets show that ERIENet outperforms state-of-the-art methods in enhancing low-light RAW images with higher effiency. It also achieves an optimal speed of over 146 frame-per-second (FPS) for 4K-resolution images on a single NVIDIA GeForce RTX 3090 with 24G memory.
zh
[CV-63] Robust and Calibrated Detection of Authentic Multimedia Content
【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法中存在的两大问题:一是事后鉴别难以避免误报率(false positive rate, FPR)无界,尤其在面对记忆样本时;二是现有检测方法缺乏对抗鲁棒性,攻击者可在计算资源受限条件下轻易绕过检测。解决方案的关键在于提出一种基于校准重合成(calibrated resynthesis)的框架,通过高精度、低召回率设置来应对高效(即计算资源受限)的对手。该方法不仅能够可靠验证内容的真实性并控制FPR,还表现出对高效攻击者的鲁棒性,优于以往方法。
链接: https://arxiv.org/abs/2512.15182
作者: Sarim Hashmi,Abdelrahman Elsayed,Mohammed Talha Alam,Samuele Poppi,Nils Lukas
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.
zh
[CV-64] Criticality Metrics for Relevance Classification in Safety Evaluation of Object Detection in Automated Driving
【速读】:该论文旨在解决自动驾驶中对象检测系统安全评估不足的问题,尤其是现有性能指标难以有效区分相关与非相关物体所带来的安全隐患。其核心挑战在于如何量化对象在安全场景中的重要性,即通过关键性(criticality)或相关性(relevance)指标来识别对驾驶决策具有实质性影响的目标。解决方案的关键在于提出两种创新的应用策略:双向关键性评分(bidirectional criticality rating)和多指标聚合(multi-metric aggregation),从而显著提升关键性分类的准确性,实验证明可实现最高达100%的准确率提升,为自动化车辆对象检测系统的安全性评估提供了更可靠、可量化的工具。
链接: https://arxiv.org/abs/2512.15181
作者: Jörg Gamerdinger,Sven Teufel,Stephan Amann,Oliver Bringmann
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at IEEE ICVES 2025
Abstract:Ensuring safety is the primary objective of automated driving, which necessitates a comprehensive and accurate perception of the environment. While numerous performance evaluation metrics exist for assessing perception capabilities, incorporating safety-specific metrics is essential to reliably evaluate object detection systems. A key component for safety evaluation is the ability to distinguish between relevant and non-relevant objects - a challenge addressed by criticality or relevance metrics. This paper presents the first in-depth analysis of criticality metrics for safety evaluation of object detection systems. Through a comprehensive review of existing literature, we identify and assess a range of applicable metrics. Their effectiveness is empirically validated using the DeepAccident dataset, which features a variety of safety-critical scenarios. To enhance evaluation accuracy, we propose two novel application strategies: bidirectional criticality rating and multi-metric aggregation. Our approach demonstrates up to a 100% improvement in terms of criticality classification accuracy, highlighting its potential to significantly advance the safety evaluation of object detection systems in automated vehicles.
zh
[CV-65] Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis
【速读】:该论文旨在解决肾活检病理诊断中多模态、多尺度图像(包括纳米级透射电镜图像与微米级光学显微镜及免疫荧光显微镜图像)在特征融合时因尺度差异显著而导致分类准确率低的问题。解决方案的关键在于提出一种跨模态超尺度学习网络(CMUS-Net),其核心创新包括:引入稀疏多实例学习模块以聚合TEM图像的超微结构特征,设计跨模态尺度注意力模块促进不同尺度间的特征交互并增强病理语义信息提取,并结合多损失函数机制实现各模态重要性自适应加权,从而提升对IgA肾病(IgAN)、膜性肾病(MN)和狼疮性肾炎(LN)等多种肾小球疾病自动分类的精度与泛化能力。
链接: https://arxiv.org/abs/2512.15171
作者: Kaixing Long,Danyi Weng,Yun Mi,Zhentai Zhang,Yanmeng Lu,Jian Geng,Zhitao Zhou,Liming Zhong,Qianjin Feng,Wei Yang,Lei Cao
机构: Southern Medical University (南方医科大学); Guangdong Provincial Key Laboratory of Medical Image Processing (广东省医学图像处理重点实验室); Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology (广东省医学影像诊断技术工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at this https URL.
zh
[CV-66] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
【速读】:该论文旨在解决当前空间链式思维(Spatial Chain-of-Thought, CoT)方法中存在的三大核心问题:在严格token预算下构建全局空间感知能力、显式地将3D假设与视频帧关联以实现验证,以及设计基于空间定位的奖励机制用于强化学习训练。解决方案的关键在于提出EagleVision框架,该框架采用双阶段渐进式空间认知策略:第一阶段通过语义-视角融合的确定性点过程(SPF-DPP)从长视频中选择具有几何和语义信息的关键帧,以在有限token预算内建立宏观空间感知;第二阶段将空间CoT形式化为BEV(鸟瞰图)引导的姿态查询机制,通过迭代预测BEV平面上的姿态、检索最近的真实帧,并利用空间对齐奖励进行纯强化学习训练,从而实现微观层面的空间假设验证与推理。
链接: https://arxiv.org/abs/2512.15160
作者: Jiaxu Wan,Xu Wang,Mengwei Xie,Hang Zhang,Mu Xu,Yang Han,Hong Zhang,Ding Yuan,Yifan Yang
机构: School of Aerospace, BUAA (北京航空航天大学航空学院); School of Software, BUAA (北京航空航天大学软件学院); Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 6 tables
Abstract:Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images” (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
zh
[CV-67] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning
【速读】:该论文旨在解决现实场景中人类动作标准化评估与可解释性反馈缺失的问题,现有视频理解方法多关注“动作是什么”和“在哪里发生”,难以满足对动作规范性的量化评估需求;同时,现有数据集普遍缺乏动作标准化程度的标注,且动作质量评估数据集缺乏可解释性和具体改进建议。为应对这一挑战,作者提出了Human Action Form Assessment (AFA)任务,并构建了包含健身与武术视频的多样化数据集CoT-AFA,其特色在于引入Chain-of-Thought (CoT)解释范式,通过从识别动作步骤、分析结果到提出具体解决方案的完整推理链条,实现可解释的反馈生成。关键创新在于提出Explainable Fitness Assessor框架,该框架采用双并行处理流与动态门控机制融合视觉与语义信息,显著提升了动作判断、分类及质量评估性能(如CIDEr提升16.0%,准确率分别提升2.7%和2.1%),验证了CoT-AFA在推动动作评估智能化方面的潜力。
链接: https://arxiv.org/abs/2512.15153
作者: Mengshi Qi,Yeteng Wu,Xianlin Zhang,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process–from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at this https URL.
zh
[CV-68] Borrowing from anything: A generalizable framework for reference-guided instance editing
【速读】:该论文旨在解决参考引导的实例编辑(Reference-guided instance editing)中因语义纠缠(semantic entanglement)导致的性能瓶颈问题,即参考图像的内在外观特征与其外在属性(如光照、背景等)难以分离,从而影响编辑效果的准确性和可控性。解决方案的关键在于提出GENIE框架,其核心创新包括:1)空间对齐模块(Spatial Alignment Module, SAM)用于纠正空间错位;2)自适应残差缩放模块(Adaptive Residual Scaling Module, ARSM)通过增强显著的内在线索并抑制外在属性实现显式解耦;3)渐进式注意力融合机制(Progressive Attention Fusion, PAF)则负责将提取的外观信息以结构保持的方式精准渲染到目标实例上,从而实现高质量且鲁棒的实例编辑。
链接: https://arxiv.org/abs/2512.15138
作者: Shengxiao Zhou,Chenghua Li,Jianhao Huang,Qinghao Hu,Yifan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages
Abstract:Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference’s intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.
zh
[CV-69] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding
【速读】:该论文旨在解决3D动画生成中长期存在的核心矛盾:渲染质量与三维控制能力之间的权衡问题。传统方法依赖复杂的全3D管线,成本高昂且难以交互;而基于视频的生成方式虽能部分自动化,却牺牲了3D可控性和实时交互性。其解决方案的关键在于提出一种轻量级的3D动画框架,通过将几何控制与外观合成解耦,引入一个2D-3D对齐的代理表示(proxy representation)——以粗略的3D结构作为骨架载体,将高保真外观和视角合成任务交由学习到的图像空间生成先验完成。该设计使得系统在无需精确几何或昂贵优化的前提下,实现类传统管线的3D感知运动控制与交互,并自然扩展至背景的一致性动画生成,从而在低功耗设备上高效实现高质量、可交互的单图驱动3D动画。
链接: https://arxiv.org/abs/2512.15126
作者: Yupeng Zhu,Xiongzhen Zhang,Ye Chen,Bingbing Ni
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.15126 [cs.CV] (or arXiv:2512.15126v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.15126 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-70] BEV-Patch-PF: Particle Filtering with BEV-Aerial Feature Matching for Off-Road Geo-Localization
【速读】:该论文旨在解决无GPS环境下机器人在复杂户外场景中的连续地理定位(geo-localization)问题,尤其针对植被遮蔽和阴影等挑战性条件下的定位精度下降问题。解决方案的关键在于提出BEV-Patch-PF系统,其核心创新是将粒子滤波(particle filter)与学习得到的鸟瞰图(bird’s-eye-view, BEV)特征图及航空影像特征图相结合:通过车载RGB和深度图像构建BEV特征图,并为每个粒子姿态假设从局部查询的航空影像中裁剪对应的patch特征图,进而通过匹配BEV特征与航空patch特征计算每粒子的对数似然值,实现高精度、鲁棒且实时的定位。
链接: https://arxiv.org/abs/2512.15111
作者: Dongmyeong Lee,Jesse Quattrociocchi,Christian Ellis,Rwik Rana,Amanda Adkins,Adam Uccello,Garrett Warnell,Joydeep Biswas
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose BEV-Patch-PF, a GPS-free sequential geo-localization system that integrates a particle filter with learned bird’s-eye-view (BEV) and aerial feature maps. From onboard RGB and depth images, we construct a BEV feature map. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image queried around the approximate location. BEV-Patch-PF computes a per-particle log-likelihood by matching the BEV feature to the aerial patch feature. On two real-world off-road datasets, our method achieves 7.5x lower absolute trajectory error (ATE) on seen routes and 7.0x lower ATE on unseen routes than a retrieval-based baseline, while maintaining accuracy under dense canopy and shadow. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.
zh
[CV-71] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets
【速读】:该论文试图解决的问题是:当前生成式文本到图像模型(如Nano Banana Pro)是否具备作为传统低层视觉任务通用求解器的潜力。解决方案的关键在于通过零样本(zero-shot)评估方法,在14个不同的低层视觉任务和40个多样化数据集上,仅使用简单的文本提示(无需微调)对Nano Banana Pro进行基准测试,并与现有最先进的专用模型进行对比。研究发现,尽管其主观视觉质量优于专用模型,常能生成更逼真的高频细节,但在基于参考图像的传统定量指标上表现较差,这归因于生成模型固有的随机性难以满足像素级一致性要求。因此,该工作揭示了生成式AI在低层视觉任务中的潜力与局限,为未来优化方向提供了重要依据。
链接: https://arxiv.org/abs/2512.15110
作者: Jialong Zuo,Haoyou Deng,Hanyu Zhou,Jiaxin Zhu,Yicheng Zhang,Yiwei Zhang,Yongxin Yan,Kaixing Huang,Weisen Chen,Yongtai Deng,Rui Jin,Nong Sang,Changxin Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report; 65 Pages, 36 Figures, 17 Tables; Poject Page: this https URL
Abstract:The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbfNano Banana Pro demonstrates superior subjective visual quality, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
zh
[CV-72] Uni-Parser Technical Report
【速读】:该论文旨在解决科学文献与专利文档解析中面临的高吞吐量、跨模态对齐精度及系统可扩展性难题。传统基于流水线的解析方法难以兼顾细粒度文本、公式、表格、图表和化学结构等多模态内容的对齐一致性,且在大规模部署时效率低下。解决方案的关键在于提出一种模块化、松耦合的多专家架构(multi-expert architecture),能够保留跨模态细粒度对齐关系,并通过自适应GPU负载均衡、分布式推理、动态模块编排以及可配置的解析模式(整体或模态特定)实现高效扩展。该设计支持在8张NVIDIA RTX 4090D GPU上达到每秒20页PDF的处理速度,显著提升了大规模文档处理的成本效益,为下游应用如文献检索、化学结构提取和AI4Science模型训练提供了坚实基础。
链接: https://arxiv.org/abs/2512.15098
作者: Xi Fang,Haoyi Tao,Shuwen Yang,Suyang Zhong,Haocheng Lu,Han Lyu,Chaozheng Huang,Xinyu Li,Linfeng Zhang,Guolin Ke
机构: DP Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.
zh
[CV-73] PMMD: A pose-guided multi-view multi-modal diffusion for person generation
【速读】:该论文旨在解决生成一致的人体图像时面临的挑战,包括遮挡、服装风格漂移(garment style drift)和姿态错位(pose misalignment)等问题,尤其在虚拟试衣、图像编辑和数字人创建等应用场景中。其解决方案的关键在于提出了一种基于扩散模型的多视角多模态框架——Pose-guided Multi-view Multimodal Diffusion (PMMD),通过一个联合建模视觉视图、姿态特征与语义描述的多模态编码器,有效降低跨模态差异并提升身份保真度;同时引入ResCVA模块增强局部细节保留全局结构,并设计跨模态融合模块将图像语义与文本信息整合至去噪过程中,从而实现高保真、可控且一致的人体图像生成。
链接: https://arxiv.org/abs/2512.15069
作者: Ziyu Shang,Haoran Liu,Rongchao Zhang,Zhiqian Wei,Tongtong Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at this https URL.
zh
[CV-74] racking spatial temporal details in ultrasound long video via wavelet analysis and memory bank
【速读】:该论文旨在解决医学超声视频中目标器官与病灶区域分割精度低的问题,尤其针对低对比度、噪声干扰导致的边界误分割和小目标丢失问题,以及长视频序列中目标跟踪困难的挑战。解决方案的关键在于提出一种基于记忆库的提升高频信息感知能力的波浪滤波融合网络(memory bank-based wavelet filtering and fusion network),其核心创新包括:1)在编码器中引入基于记忆的波浪卷积机制,同步捕获类别特征、细节信息并利用邻域上下文;2)采用级联波浪压缩策略融合多尺度频域特征,扩展单层卷积的感受野;3)设计基于交叉注意力与记忆压缩机制的长短期记忆库,实现长视频中的稳定目标跟踪;4)在解码器中构建高频感知特征融合模块,通过自适应波浪滤波器增强边界敏感的高频细节,从而显著提升对小尺寸甲状腺结节等微小结构的分割准确性。
链接: https://arxiv.org/abs/2512.15066
作者: Chenxiao Zhang,Runshi Zhang,Junchen Wang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Chenxiao Zhang and Runshi Zhang contributed equally to this work. 14 pages, 11 figures
Abstract:Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at this https URL.
zh
[CV-75] Asynchronous Event Stream Noise Filtering for High-frequency Structure Deformation Measurement
【速读】:该论文旨在解决大型结构在复杂荷载作用下产生的高频变形测量难题,尤其是传统基于高速相机的测量方法受限于恶劣光照条件和高昂设备成本的问题。其解决方案的关键在于利用事件相机(event camera)与LED标记点相结合的方法:首先通过分析LED闪烁产生的事件流特征及其时空相关性来滤除观测噪声;随后区分由运动引起的事件与LED闪烁事件,从而准确提取高速移动的LED标记点;最终借助单目事件相机实现对高频平面变形的有效测量。
链接: https://arxiv.org/abs/2512.15055
作者: Yifei Bian,Banglei Guan,Zibin Liu,Ang Su,Shiyao Zhu,Yang Shang,Qifeng Yu
机构: College of Aerospace Science and Engineering, National University of Defense Technology (国防科技大学航空科学与工程学院); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures
Abstract:Large-scale structures suffer high-frequency deformations due to complex loads. However, harsh lighting conditions and high equipment costs limit measurement methods based on traditional high-speed cameras. This paper proposes a method to measure high-frequency deformations by exploiting an event camera and LED markers. Firstly, observation noise is filtered based on the characteristics of the event stream generated by LED markers blinking and spatiotemporal correlation. Then, LED markers are extracted from the event stream after differentiating between motion-induced events and events from LED blinking, which enables the extraction of high-speed moving LED markers. Ultimately, high-frequency planar deformations are measured by a monocular event camera. Experimental results confirm the accuracy of our method in measuring high-frequency planar deformations.
zh
[CV-76] MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance
【速读】:该论文旨在解决由低分辨率(Low-Resolution, LR)图像训练得到的3D高斯泼溅(3D Gaussian Splatting, 3DGS)场景在高分辨率(High-Resolution, HR)渲染时存在的细节缺失与跨视图不一致性问题。现有方法依赖单图像超分辨率(Single-Image Super-Resolution, SISR)网络或严格顺序视频帧,难以有效融合多视角互补信息且不适用于无序多视角数据集。本文提出多视角一致性的3DGS超分辨率框架(Multi-View Consistent 3D Gaussian Splatting Super-Resolution, MVGSR),其关键创新在于:1)基于相机位姿设计辅助视图选择机制,使方法可适配任意组织结构的多视角数据;2)首次将极线约束的多视角注意力机制引入3DGS超分任务中,实现对来自辅助视图的一致性信息的有选择聚合,从而显著提升3DGS表示的几何一致性与高频细节保真度。
链接: https://arxiv.org/abs/2512.15048
作者: Kaizhe Zhang,Shinan Chen,Qian Zhao,Weizhan Zhang,Caixia Yan,Yudeng Xin
机构: Xi’an Jiaotong University (西安交通大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.
zh
[CV-77] Model Agnostic Preference Optimization for Medical Image Segmentation
【速读】:该论文旨在解决医疗图像分割中监督信号稀缺且依赖特定模型架构的问题,尤其是现有方法在偏好优化(preference optimization)实践中受限于低多样性预测采样和模型特异性。其解决方案的关键在于提出一种模型无关的偏好优化框架(Model-Agnostic Preference Optimization, MAPO),通过引入Dropout驱动的随机分割假设来构建一致性偏好梯度,从而无需直接的真值监督即可实现稳定、高效的训练过程。该方法对网络结构和维度完全无感,适用于2D/3D卷积神经网络(CNN)及基于Transformer的分割流程,在多个医学数据集上验证了其在边界贴合性、过拟合抑制和优化稳定性方面的显著提升。
链接: https://arxiv.org/abs/2512.15009
作者: Yunseong Nam,Jiwon Jang,Dongkyu Won,Sang Hyun Park,Soopil Kim
机构: DGIST(韩国科学技术院); DGIST(韩国科学技术院); DGIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.
zh
[CV-78] Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation WACV2026
【速读】:该论文旨在解决视频问答(VideoQA)中视频问题生成(VQG)模型的评估问题,即如何量化评估生成问题的质量,而非仅关注模型回答问题的能力。传统方法多聚焦于答案准确性,忽略了问题本身是否能有效激发专家未见知识的关键维度。解决方案的关键在于提出一种基于模拟专家问答交互的评估协议:通过构建EgoExoAsk数据集(包含27,666个由Ego-Exo4D专家注释生成的QA对),利用问题到答案的检索机制模拟人类专家响应,从而衡量问题在引导新知识获取方面的有效性。该协议表明,能访问更丰富上下文的VQG模型获得更高评分,验证了其评估逻辑的有效性。
链接: https://arxiv.org/abs/2512.15006
作者: Huaying Zhang,Atsushi Hashimoto,Tosho Hirasawa
机构: OMRON SINIC X Corp.(OMRON SINIC X 公司); Hokkaido University(北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026 accepted
Abstract:Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D’s expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in this https URL .
zh
[CV-79] Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle
【速读】:该论文旨在解决精准畜牧业中社交行为客观评估的难题,现有方法依赖静态距离阈值推断动物交互,无法区分亲和性(affiliative)与攻击性(agonistic)行为,限制了自动化社会网络分析在商业养殖场中的可解释性。解决方案的关键在于提出一种基于姿态的计算框架,通过建模解剖关键点的时空几何特征,提取特定于交互类型的运动签名,从而实现对社交互动情感极性的区分;该框架整合YOLOv11目标检测、个体识别、多目标跟踪(ByteTrack)、27点关键点估计(ZebraPose)及支持向量机分类器,仅使用姿态信息即在商用硬件上实现近实时的交互分类,准确率达77.51%,显著优于仅依赖距离的基线方法。
链接: https://arxiv.org/abs/2512.14998
作者: Sibi Parivendan,Kashfia Sailunaz,Suresh Neethirajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages, 12 figures, 8 tables
Abstract:Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
zh
[CV-80] Where is the Watermark? Interpretable Watermark Detection at the Block Level WACV2026
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 生成的数字内容在真实性、所有权和滥用方面引发的信任问题,特别是现有图像水印方案多为黑箱操作,缺乏对水印位置和强度的局部解释能力,从而影响用户信任并难以评估篡改影响。其解决方案的关键在于提出一种后验(post-hoc)图像水印方法,通过在离散小波变换(Discrete Wavelet Transform, DWT)域中采用统计分块策略实现局部嵌入,并生成区域级检测图(detection maps),从而揭示图像中哪些区域可能被水印或篡改,同时保持对常见图像变换的高度鲁棒性及对语义篡改的敏感性,且水印具有高度不可感知性。
链接: https://arxiv.org/abs/2512.14994
作者: Maria Bulychev,Neil G. Marchant,Benjamin I. P. Rubinstein
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 14 figures. Camera-ready for WACV 2026
Abstract:Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
zh
[CV-81] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities
【速读】:该论文旨在解决现实场景中多模态人识别系统因某一或多个模态(如语音、人脸、手势)缺失或质量下降而导致性能显著下降的问题。解决方案的关键在于提出了一种三模态(Trimodal)人识别框架,通过独立处理各模态的多任务学习结构,结合交叉注意力(cross-attention)与门控融合机制实现模态间的信息交互,并引入置信度加权融合策略以动态适应缺失或低质量数据,从而在单模态(Unimodal)或双模态(Bimodal)场景下仍能保持高精度识别性能。
链接: https://arxiv.org/abs/2512.14961
作者: Aref Farhadipour,Teodora Vukovic,Volker Dellwo,Petr Motlicek,Srikanth Madikeri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: 10 pages and 8 tables
Abstract:Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
zh
[CV-82] Puzzle Curriculum GRPO for Vision-Centric Reasoning
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的视觉语言模型(Vision Language Models, VLMs)在链式思维(chain-of-thought reasoning)训练中面临的三大问题:(i) 对昂贵且噪声较大的人工标注或外部验证器的依赖;(ii) GRPO方法中奖励信号扁平化和稀疏的问题;(iii) 推理过程与最终答案之间的逻辑不一致性。其核心解决方案是提出一种无监督的RLVR(Reinforcement Learning with Verifiable Rewards)框架——Puzzle Curriculum GRPO(PC-GRPO),关键在于引入三个自监督拼图环境(PatchFit、Rotation 和 Jigsaw)替代人工标签,其中 Rotation 提供二元奖励,Jigsaw 提供分级部分奖励以缓解奖励稀疏性;同时设计难度感知课程机制动态调整样本权重,聚焦中等难度样本以提升组内相对优势;此外,在后训练阶段监控推理-答案一致性(Reasoning-Answer Consistency, RAC),并通过一致性强化奖励策略延缓RAC下降趋势,从而显著提升下游任务准确率和训练稳定性。
链接: https://arxiv.org/abs/2512.14944
作者: Ahmadreza Jeddi,Hakki Can Karaimer,Hue Nguyen,Zhongling Wang,Ke Zhao,Javad Rajabi,Ran Zhang,Raghav Goyal,Babak Taati,Radek Grzeszczuk
机构: AI Center-Toronto, Samsung Electronics (三星电子人工智能中心); University of Toronto (多伦多大学); Vector Institute (矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain’s reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
zh
[CV-83] alkVerse: Democratizing Minute-Long Audio-Driven Video Generation
【速读】:该论文旨在解决当前单人音频驱动说话视频生成研究中普遍存在的数据封闭性与模型计算成本过高问题。现有方法多依赖于私有数据集或资源密集型模型,导致实验结果难以复现且部署门槛高。其解决方案的关键在于构建了一个大规模、开放的TalkVerse数据集(包含2.3百万条高分辨率音频-视频同步片段,共计6.3千小时),并通过透明的数据清洗与标注流程(包括场景切割检测、美学评估、严格的音画同步验证及2D骨骼和结构化风格描述)保障数据质量;同时提出一个5B参数量的DiT基线模型,采用高下采样比的视频变分自编码器(video VAE)与滑动窗口机制结合运动帧上下文,在保证分钟级生成能力且低漂移的同时,实现与14B模型相当的唇形同步与视觉质量,但推理成本降低10倍。此外,通过引入多模态大语言模型(MLLM)作为导演以动态优化提示词,并支持零样本视频配音,进一步提升了长视频叙事能力与可控性。
链接: https://arxiv.org/abs/2512.14938
作者: Zhenzhi Wang,Jian Wang,Ke Ma,Dahua Lin,Bing Zhou
机构: The Chinese University of Hong Kong (香港中文大学); Snap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: open-sourced single-person full-body talking video generation dataset, training code and checkpoints
Abstract:We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10 \times lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: this https URL
zh
[CV-84] Improving Pre-trained Segmentation Models using Post-Processing
【速读】:该论文旨在解决当前基于大规模预训练模型的脑胶质瘤(glioma)多参数磁共振成像(multiparametric MRI, mpMRI)分割方法中存在的泛化能力差、系统性误差(如假阳性、标签混淆和切片不连续)等问题,同时应对GPU资源分配不均与大规模模型训练带来的环境成本上升挑战。其解决方案的关键在于提出适应性的后处理技术(adaptive post-processing techniques),通过精细化修正大型预训练模型输出的分割结果,从而在保证精度的同时提升计算公平性和可持续性,实现在BraTS 2025多个分割挑战任务中的性能显著提升(子撒哈拉非洲挑战任务提升14.9%,成人胶质瘤挑战任务提升0.9%)。
链接: https://arxiv.org/abs/2512.14937
作者: Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,Nishad Kulkarni,Krithika Iyer,Austin Tapp,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.
zh
[CV-85] PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis
【速读】:该论文旨在解决生成式 AI (Generative AI) 在前列腺癌 Gleason 评分中的泛化能力问题,即模型可能通过学习组织切片特异性伪影而非可迁移的生物学特征来实现高验证准确率,从而限制其在真实临床环境中的应用。解决方案的关键在于构建了一个专门设计的基准数据集 PANDA-PLUS-Bench,该数据集由九位不同患者的专家标注全切片图像组成,包含多样化的 Gleason 模式,并在多个分辨率和增强条件下提取非重叠组织块,以系统评估模型从切片级混杂因素中分离生物信号的能力。通过此基准测试,研究发现模型间鲁棒性差异显著,且针对前列腺组织专门训练的 HistoEncoder 表现出最佳跨切片准确性与最强的切片级编码能力,表明领域特定训练有助于提升模型对生物学特征的捕捉能力及对切片特异性噪声的抗干扰性能。
链接: https://arxiv.org/abs/2512.14922
作者: Joshua L. Ebbert,Dennis Della Corte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures, 6 Tables
Abstract:Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level confounders. Our results reveal substantial variation in robustness across models: Virchow2 achieved the lowest slide-level encoding among large-scale models (81.0%) yet exhibited the second-lowest cross-slide accuracy (47.2%). HistoEncoder, trained specifically on prostate tissue, demonstrated the highest cross-slide accuracy (59.7%) and the strongest slide-level encoding (90.3%), suggesting tissue-specific training may enhance both biological feature capture and slide-specific signatures. All models exhibited measurable within-slide vs. cross-slide accuracy gaps, though the magnitude varied from 19.9 percentage points to 26.9 percentage points. We provide an open-source Google Colab notebook enabling researchers to evaluate additional foundation models against our benchmark using standardized metrics. PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in the clinically important context of Gleason grading.
zh
[CV-86] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
【速读】:该论文旨在解决生成式AI在创建新视觉概念时面临的挑战,即如何通过连接不同概念的共享属性(vibe)来生成语义连贯且富有创意的混合图像。当前方法难以在潜在空间中识别并沿非线性路径跨越远距离概念,导致生成效果不理想。其解决方案的关键在于提出Vibe Space——一种层次化的图流形结构,能够学习CLIP等特征空间中的低维测地线路径,从而实现概念间平滑且语义一致的过渡,显著提升混合图像的创造性与一致性。
链接: https://arxiv.org/abs/2512.14884
作者: Huzheng Yang,Katherine Xu,Andrew Lu,Michael D. Grossberg,Yutong Bai,Jianbo Shi
机构: UPenn (宾夕法尼亚大学); CUNY (纽约市立大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes – their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.
zh
[CV-87] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris
【速读】:该论文旨在解决传统动物再识别(Re-ID)方法依赖纯图像特征、难以实现跨模态检索与解释性不足的问题。其核心解决方案在于引入法医级的皮肤纹路文本描述符(dermatoglyphic textual descriptors),通过人类可读的语言标签抽象并编码动物毛皮拓扑结构,构建视觉-文本联合表征体系。关键创新在于开发了文本-图像协同合成流程,生成大量“虚拟个体”以缓解真实数据稀缺问题,从而显著提升跨模态身份检索准确率,并实现基于人类可验证匹配的文本到图像的身份恢复能力。
链接: https://arxiv.org/abs/2512.14878
作者: Wenshuo Li,Majid Mirmehdi,Tilo Burghardt
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate ‘virtual individuals’, each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.
zh
[CV-88] Isolated Sign Language Recognition with Segmentation and Pose Estimation
【速读】:该论文旨在解决美国手语(ASL)用户在生成式 AI(Generative AI)翻译技术中面临的可及性问题,特别是针对孤立手语识别(ISLR)任务中存在的数据稀缺、签者差异大以及计算成本高等挑战。其解决方案的关键在于提出一种融合多模块的模型架构:首先通过姿态估计管道提取手部和面部关节坐标以降低维度;其次利用分割模块提取关键视觉信息;最后采用ResNet-Transformer混合主干网络联合建模空间与时间依赖关系,在显著减少计算资源消耗的同时提升对签者变异的鲁棒性。
链接: https://arxiv.org/abs/2512.14876
作者: Daniel Perkins,Davis Hunter,Dhrumil Patel,Galen Flanagan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 Figures
Abstract:The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.
zh
[CV-89] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
【速读】:该论文旨在解决当前视频问答(VideoQA)基准测试中普遍存在的问题:现有数据集允许模型仅依赖单一显著视觉线索即可作答,从而低估了模型在跨时间整合多证据进行推理的能力。为应对这一挑战,作者提出了HERBench,一个专为评估跨时间多证据整合而设计的VideoQA基准。其关键创新在于引入“最小必要帧集”(Minimum Required Frame-Set, MRFS),量化模型必须融合的最少帧数以正确作答,并确保每个问题需聚合至少三个非重叠时间段内的视觉证据,避免语言先验或单帧解答的可能性。实验表明,HERBench显著提升了对模型多证据整合能力的要求(平均MRFS为5.5,远高于现有数据集的2.6–4.2),且13个前沿视频大语言模型(Video-LLMs)在该基准上准确率仅为31–42%,揭示出两大瓶颈:帧检索不足和信息融合失败,从而为推动鲁棒、组合式视频理解提供了明确方向。
链接: https://arxiv.org/abs/2512.14870
作者: Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin
机构: INSIGHT Lab, Ben-Gurion University of the Negev, Israel; Ben-Gurion University of the Negev, Israel
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.
zh
[CV-90] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉问答(Visual Question Answering, VQA)任务中因幻觉(hallucination)导致的高自信度但错误回答问题,从而严重削弱答案可靠性的问题。解决方案的关键在于提出一种名为Dual-Assessment for VLM Reliability (DAVR) 的新框架,其核心创新是融合自省(Self-Reflection)与跨模型验证(Cross-Model Verification)以实现全面的不确定性估计:一方面通过双选择器模块利用VLM潜在特征与问答嵌入的融合来评估响应可靠性,另一方面引入外部参考模型进行事实交叉验证以抑制幻觉,从而显著提升VLM输出结果的可信度。
链接: https://arxiv.org/abs/2512.14770
作者: Xixian Wu,Yang Ou,Pengchao Tian,Zian Yang,Jielei Zhang,Peiyi Li,Longwen Gao
机构: Bilibili Inc.(哔哩哔哩公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading \Phi_100 score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.
zh
[CV-91] AquaDiff: Diffusion-Based Underwater Image Enhancement for Addressing Color Distortion
【速读】:该论文旨在解决水下图像因波长依赖性光吸收和散射导致的颜色失真、对比度低及细节丢失等问题,这些问题严重阻碍了基于视觉的水下应用。其解决方案的关键在于提出了一种基于扩散模型的增强框架AquaDiff,该框架融合了色度先验引导的颜色补偿策略与条件扩散过程,通过交叉注意力机制在去噪步骤中动态融合退化输入与噪声潜在状态;同时引入改进的去噪主干网络(含残差密集块和多分辨率注意力)以捕获全局颜色上下文与局部细节,并设计了一种跨域一致性损失函数,联合约束像素级精度、感知相似性、结构完整性和频域保真度,从而实现高效且高质量的水下图像增强。
链接: https://arxiv.org/abs/2512.14760
作者: Afrah Shaahid,Muzammil Behzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater images are severely degraded by wavelength-dependent light absorption and scattering, resulting in color distortion, low contrast, and loss of fine details that hinder vision-based underwater applications. To address these challenges, we propose AquaDiff, a diffusion-based underwater image enhancement framework designed to correct chromatic distortions while preserving structural and perceptual fidelity. AquaDiff integrates a chromatic prior-guided color compensation strategy with a conditional diffusion process, where cross-attention dynamically fuses degraded inputs and noisy latent states at each denoising step. An enhanced denoising backbone with residual dense blocks and multi-resolution attention captures both global color context and local details. Furthermore, a novel cross-domain consistency loss jointly enforces pixel-level accuracy, perceptual similarity, structural integrity, and frequency-domain fidelity. Extensive experiments on multiple challenging underwater benchmarks demonstrate that AquaDiff provides good results as compared to the state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods, achieving superior color correction and competitive overall image quality across diverse underwater conditions.
zh
[CV-92] he Renaissance of Expert Systems: Optical Recognition of Printed Chinese Jianpu Musical Scores with Lyrics
【速读】:该论文旨在解决中文简谱(Jianpu)及其丰富歌词资源在大规模光学音乐识别(OMR)研究中长期被忽视的问题,即如何高效、准确地将印刷版简谱乐谱(含歌词)转换为机器可读的MusicXML和MIDI格式。其解决方案的关键在于提出了一种模块化的专家系统流水线,融合了自上而下的规则驱动设计与无监督深度学习模块:一方面利用传统计算机视觉技术(如短语相关性分析、骨架分析)提取先验知识以增强可解释性,另一方面引入无监督深度学习进行图像特征嵌入以提升识别精度,从而在无需大量标注数据的情况下实现高精度的旋律(note-wise F1 = 0.951)与对齐歌词(character-wise F1 = 0.931)识别。
链接: https://arxiv.org/abs/2512.14758
作者: Fan Bu,Rongfeng Li,Zijin Li,Ya Li,Linfeng Fan,Pei Huang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Central Conservatory of Music (中央音乐学院); Shanghai Normal University (上海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures
Abstract:Large-scale optical music recognition (OMR) research has focused mainly on Western staff notation, leaving Chinese Jianpu (numbered notation) and its rich lyric resources underexplored. We present a modular expert-system pipeline that converts printed Jianpu scores with lyrics into machine-readable MusicXML and MIDI, without requiring massive annotated training data. Our approach adopts a top-down expert-system design, leveraging traditional computer-vision techniques (e.g., phrase correlation, skeleton analysis) to capitalize on prior knowledge, while integrating unsupervised deep-learning modules for image feature embeddings. This hybrid strategy strikes a balance between interpretability and accuracy. Evaluated on The Anthology of Chinese Folk Songs, our system massively digitizes (i) a melody-only collection of more than 5,000 songs ( 300,000 notes) and (ii) a curated subset with lyrics comprising over 1,400 songs ( 100,000 notes). The system achieves high-precision recognition on both melody (note-wise F1 = 0.951) and aligned lyrics (character-wise F1 = 0.931).
zh
[CV-93] SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning
【速读】:该论文旨在解决机器人在人类密集环境中导航时,如何实现社会合规性(social compliance)的问题,即在保障安全的同时,充分考虑人类舒适度、社会规范和情境适宜性。此前研究多聚焦于安全性,而对社会行为的建模仍显不足。为应对这一挑战,作者提出SocialNav-MoE,一种基于小规模视觉语言模型(Vision Language Model, VLM)的高效混合专家(Mixture-of-Experts, MoE)架构,并结合强化微调(Reinforcement Fine-Tuning, RFT)进行优化。其关键创新在于引入语义相似性奖励(Semantic Similarity Reward, SSR),相较于硬级奖励和字符级奖励,能更有效地提升决策能力,同时通过轻量化模型设计与高效的路由策略,在保证导航精度的前提下显著降低计算开销,从而适配资源受限的机器人平台。
链接: https://arxiv.org/abs/2512.14757
作者: Tomohito Kawabata,Xinyu Zhang,Ling Xiao
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.
zh
[CV-94] SkyCap: Bitemporal VHR Optical-SAR Quartets for Amplitude Change Detection and Foundation-Model Evaluation
【速读】:该论文旨在解决线性基础设施监测中变化检测(change detection)任务面临的多模态数据融合难题,特别是光学与合成孔径雷达(SAR)影像在高分辨率(VHR)下协同使用时的标注困难与模型性能瓶颈。其核心挑战在于:光学影像虽易解释且标注直观,但受云层干扰导致获取频率不稳定;而SAR可实现全天候成像,却因复杂成像机制难以人工标注。解决方案的关键在于构建了一个名为SkyCap的双时相VHR光学-SAR数据集,并通过光学到SAR的标签迁移(label transfer)技术,在无需专业SAR标注的前提下生成SAR幅度变化检测(ACD)标签。此外,研究进一步对SARATR-X进行持续预训练,对比了基于光学基础模型(FMs)与专用SAR FMs在不同预处理策略下的表现,发现光学模型经dB+Z-score标准化后可超越直接在Capella SAR数据上预训练的专用模型,表明预处理与预训练统计的一致性对性能至关重要,且光学模型在光学变化检测中的优势无法直接迁移到SAR ACD任务中。
链接: https://arxiv.org/abs/2512.14755
作者: Paul Weinmann,Ferdinand Schenck,Martin Šiklar
机构: LiveEO GmbH (LiveEO有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 0 figures. Accepted at Advances in Representation Learning for Earth Observation (REO) at EurIPS 2025
Abstract:Change detection for linear infrastructure monitoring requires reliable high-resolution data and regular acquisition cadence. Optical very-high-resolution (VHR) imagery is interpretable and straightforward to label, but clouds break this cadence. Synthetic Aperture Radar (SAR) enables all-weather acquisitions, yet is difficult to annotate. We introduce SkyCap, a bitemporal VHR optical-SAR dataset constructed by archive matching and co-registration of (optical) SkySat and Capella Space (SAR) scenes. We utilize optical-to-SAR label transfer to obtain SAR amplitude change detection (ACD) labels without requiring SAR-expert annotations. We perform continued pretraining of SARATR-X on our SAR data and benchmark the resulting SAR-specific foundation models (FMs) together with SARATR-X against optical FMs on SkyCap under different preprocessing choices. Among evaluated models, MTP(ViT-B+RVSA), an optical FM, with dB+Z-score preprocessing attains the best result (F1 _c = 45.06), outperforming SAR-specific FMs further pretrained directly on Capella data. We observe strong sensitivity to preprocessing alignment with pretraining statistics, and the ranking of optical models on optical change detection does not transfer one-to-one to SAR ACD. To our knowledge, this is the first evaluation of foundation models on VHR SAR ACD.
zh
[CV-95] INFORM-CT: INtegrating LLM s and VLMs FOR Incidental Findings Management in Abdominal CT
【速读】:该论文旨在解决腹部CT扫描中偶然发现(incidental findings)的检测、分类与报告效率低且依赖人工的问题。传统由放射科医生进行的手动检查耗时且存在主观差异,而现有纯视觉语言模型(VLM)方法在准确性和自动化程度上仍有不足。解决方案的关键在于提出一种基于“规划-执行”代理(plan-and-execute agentic)框架,其中大语言模型(LLM)作为规划器生成Python脚本调用预定义函数,执行器则利用基础视觉语言模型(VLM)、分割模型和图像处理子程序自动完成多器官的检测与分析任务,实现了从输入到报告输出的端到端自动化流程,显著提升了检测精度与处理效率。
链接: https://arxiv.org/abs/2512.14732
作者: Idan Tankel,Nir Mazor,Rafi Brada,Christina LeBedis,Guy ben-Yosef
机构: GE Healthcare Technology and Innovation Center (GE医疗健康技术与创新中心); Boston Medical Center (波士顿医疗中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2512.14732 [cs.LG] (or arXiv:2512.14732v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.14732 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-96] Generative Preprocessing for Image Compression with Pre-trained Diffusion Models
【速读】:该论文旨在解决传统图像压缩预处理方法受限于像素级保真度、难以兼顾感知质量的问题,提出了一种基于率-感知(Rate-Perception, R-P)优化的新范式。其关键解决方案在于:首先通过一致性得分身份蒸馏(Consistent Score Identity Distillation, CiD)将多步Stable Diffusion 2.1模型压缩为单步图像到图像生成模型;其次,在保持模型轻量化的同时,对蒸馏后模型的注意力模块进行参数高效微调,以率-感知损失和可微分编解码器代理共同引导优化,从而在不修改标准编解码器的前提下,利用生成先验增强纹理并减少伪影,显著提升压缩效率与主观视觉质量。
链接: https://arxiv.org/abs/2512.15270
作者: Mengxi Guo,Shijie Zhao,Junlin Li,Li Zhang
机构: Bytedance Inc.(字节跳动)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted as a PAPER and for publication in the DCC 2026 proceedings
Abstract:Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model’s attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model’s powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.
zh
[CV-97] Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images
【速读】:该论文旨在解决青光眼诊断中视盘(Optic Disc, OD)和视杯(Optic Cup, OC)分割任务在仅有少量标注眼底图像情况下的挑战,即少样本弱监督分割(Few-Shot Weakly-Supervised Segmentation, FWS)问题。其核心解决方案在于提出一种名为Omni元训练(Omni meta-training)的策略,通过平衡数据使用效率并多样化样本数量(shots),显著提升元学习器性能;同时开发了计算成本更低的高效版本(Efficient versions)以及稀疏化技术,用于生成更具代表性和可定制性的稀疏标签(如涂鸦标注)。实验表明,最优模型Efficient Omni ProtoSeg(EO-ProtoSeg)仅需一张稀疏标注图像即可实现高精度分割,在REFUGE数据集上OD和OC的IoU分别达到88.15%和71.17%,优于传统少样本与半监督方法,且参数量小于200万、无需重新训练,具备良好的实用性与轻量化特性。
链接: https://arxiv.org/abs/2512.15061
作者: Pandega Abyan Zumarsyah,Igi Ardiyanto,Hanung Adi Nugroho
机构: University of Gadjah Mada (UGM)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Computers in Biology and Medicine
Abstract:This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.
zh
[CV-98] A Gaussian Parameterization for Direct Atomic Structure Identification in Electron Tomography
【速读】:该论文旨在解决传统原子电子断层扫描(Atomic Electron Tomography, AET)在重建三维原子结构时存在的局限性,即依赖中间体体积表示并需后续后处理才能获得原子位置与属性的问题。其解决方案的关键在于将断层成像的逆问题重新建模为直接求解单个原子的位置和性质,通过将原子结构参数化为一组可学习的高斯函数集合,从而引入强物理先验信息;这种表示方式不仅提升了对真实成像伪影的鲁棒性,还在模拟实验和基于实验数据的验证中展现出在透射电子显微镜(Transmission Electron Microscopy, TEM)材料表征中的实用潜力。
链接: https://arxiv.org/abs/2512.15034
作者: Nalini M. Singh,Tiffany Chien,Arthur R.C. McCray,Colin Ophus,Laura Waller
机构: University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ICCP 2025. 14 pages, 10 figures. Keywords: Atomic electron tomography, Gaussian splatting
Abstract:Atomic electron tomography (AET) enables the determination of 3D atomic structures by acquiring a sequence of 2D tomographic projection measurements of a particle and then computationally solving for its underlying 3D representation. Classical tomography algorithms solve for an intermediate volumetric representation that is post-processed into the atomic structure of interest. In this paper, we reformulate the tomographic inverse problem to solve directly for the locations and properties of individual atoms. We parameterize an atomic structure as a collection of Gaussians, whose positions and properties are learnable. This representation imparts a strong physical prior on the learned structure, which we show yields improved robustness to real-world imaging artifacts. Simulated experiments and a proof-of-concept result on experimentally-acquired data confirm our method’s potential for practical applications in materials characterization and analysis with Transmission Electron Microscopy (TEM). Our code is available at this https URL.
zh
[CV-99] Artificial Intelligence for the Assessment of Peritoneal Carcinosis during Diagnostic Laparoscopy for Advanced Ovarian Cancer
【速读】:该论文旨在解决晚期卵巢癌(Advanced Ovarian Cancer, AOC)患者在诊断性腹腔镜检查(Diagnostic Laparoscopy, DL)中,基于Fagotti评分(Fagotti Score, FS)评估肿瘤负荷和手术可切除性的主观性强、依赖操作者且难以重复的问题。解决方案的关键在于构建并验证一个深度学习模型,能够自动从DL视频中识别FS相关帧、分割解剖结构与腹膜转移灶(Peritoneal Carcinosis, PC),并预测视频级别的FS评分及是否适合进行细胞减灭术(Indication to Surgery, ItS)。该模型在开发集和独立测试集中均表现出高精度的分割(Dice分数达70±3%和56±3%)、分类(AS F1-score达74±3%和73±4%)以及FS估计(归一化RMSE为1.39±0.18和1.15±0.08),证明AI可实现标准化、客观化的术中肿瘤负荷评估,从而辅助临床决策。
链接: https://arxiv.org/abs/2512.14797
作者: Riccardo Oliva,Farahdiba Zarin,Alice Zampolini Faustini,Armine Vardazaryan,Andrea Rosati,Vinkle Srivastav,Nunzia Del Villano,Jacques Marescaux,Giovanni Scambia,Pietro Mascagni,Nicolas Padoy,Anna Fagotti
机构: Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Rome, Italy; IRCAD, Research Institute against Digestive Cancer, Strasbourg, France; University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France; Università degli studi di Modena, Modena, Italy; Bioimage Analysis Center, Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advanced Ovarian Cancer (AOC) is often diagnosed at an advanced stage with peritoneal carcinosis (PC). Fagotti score (FS) assessment at diagnostic laparoscopy (DL) guides treatment planning by estimating surgical resectability, but its subjective and operator-dependent nature limits reproducibility and widespread use. Videos of patients undergoing DL with concomitant FS assessments at a referral center were retrospectively collected and divided into a development dataset, for data annotation, AI training and evaluation, and an independent test dataset, for internal validation. In the development dataset, FS-relevant frames were manually annotated for anatomical structures and PC. Deep learning models were trained to automatically identify FS-relevant frames, segment structures and PC, and predict video-level FS and indication to surgery (ItS). AI performance was evaluated using Dice score for segmentation, F1-scores for anatomical stations (AS) and ItS prediction, and root mean square error (RMSE) for final FS estimation. In the development dataset, the segmentation model trained on 7,311 frames, achieved Dice scores of 70 \pm 3% for anatomical structures and 56 \pm 3% for PC. Video-level AS classification achieved F1-scores of 74 \pm 3% and 73 \pm 4%, FS prediction showed normalized RMSE values of 1.39 \pm 0.18 and 1.15 \pm 0.08, and ItS reached F1-scores of 80 \pm 8% and 80 \pm 2% in the development (n=101) and independent test datasets (n=50), respectively. This is the first AI model to predict the feasibility of cytoreductive surgery providing automated FS estimation from DL videos. Its reproducible and reliable performance across datasets suggests that AI can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making in AOC.
zh
[CV-100] Magnification-Aware Distillation (MAD): A Self-Supervised Framework for Unified Representation Learning in Gigapixel Whole-Slide Images
【速读】:该论文旨在解决全切片图像(Whole-slide images, WSIs)中多尺度信息分离导致的表征学习不稳定问题,即现有自监督方法将不同放大倍数视为独立视图,难以学习在分辨率变化下保持稳定的特征表示,而这正是神经病理学实际工作流程的关键需求。其解决方案的核心是提出一种称为“放大感知蒸馏”(Magnification-Aware Distillation, MAD)的自监督策略,通过建立低倍率上下文与空间对齐的高倍率细节之间的跨尺度关联,使模型能够学习粗粒度组织结构与细粒度细胞模式之间的映射关系。该方法训练出的基础模型(MAD-NP)完全基于无标注的跨尺度对应关系,无需人工标注,且在仅使用10倍嵌入进行线性分类时,迁移至40倍图像仍保持96.7%的性能,验证了其强大的分辨率不变性表征能力。
链接: https://arxiv.org/abs/2512.14796
作者: Mahmut S. Gokmen,Mitchell A. Klusty,Peter T. Nelson,Allison M. Neltner,Sen-Ching Samson Cheung,Thomas M. Pearce,David A Gutman,Brittany N. Dugger,Devavrat S. Bisht,Margaret E. Flanagan,V. K. Cody Bumgardner
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 5 tables, submitted to AMIA 2026 Informatics Summit
Abstract:Whole-slide images (WSIs) contain tissue information distributed across multiple magnification levels, yet most self-supervised methods treat these scales as independent views. This separation prevents models from learning representations that remain stable when resolution changes, a key requirement for practical neuropathology workflows. This study introduces Magnification-Aware Distillation (MAD), a self-supervised strategy that links low-magnification context with spatially aligned high-magnification detail, enabling the model to learn how coarse tissue structure relates to fine cellular patterns. The resulting foundation model, MAD-NP, is trained entirely through this cross-scale correspondence without annotations. A linear classifier trained only on 10x embeddings maintains 96.7% of its performance when applied to unseen 40x tiles, demonstrating strong resolution-invariant representation learning. Segmentation outputs remain consistent across magnifications, preserving anatomical boundaries and minimizing noise. These results highlight the feasibility of scalable, magnification-robust WSI analysis using a unified embedding space
zh
[CV-101] PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents
【速读】:该论文旨在解决当前视觉语言模型(VLMs)在金融图像理解任务中缺乏系统性、层次化推理能力的问题,尤其是面对复杂金融视觉问答时难以有效分解问题并逐步提升推理深度。解决方案的关键在于提出PyFi框架及其核心组件PyFi-600K数据集——一个包含600K金融问答对的分层推理金字塔,其中问题从基础感知逐步过渡到高阶金融专业知识需求;该数据集通过PyFi-adv机制自动生成,该机制基于蒙特卡洛树搜索(MCTS)范式下的多智能体对抗策略,由挑战者与求解器智能体协作生成渐进式问题链,从而无需人工标注即可实现可扩展的高质量训练数据构建。此方法显著提升了VLMs在金融领域的推理能力,实验证明在Qwen2.5-VL系列模型上微调后,平均准确率分别提升19.52%和8.06%。
链接: https://arxiv.org/abs/2512.14735
作者: Yuqun Zhang,Yuxuan Zhao,Sijia Chen
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Yantai Research Institute, Harbin Engineering University (烟台研究院,哈尔滨工程大学)
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: this https URL .
zh
人工智能
[AI-0] Artism: AI-Driven Dual-Engine System for Art Generation and Critique
【速读】:该论文旨在解决艺术演化路径探索中的复杂性问题,即如何系统性地模拟和预测艺术史发展的多种可能性及概念创新模式。其解决方案的关键在于提出一种双引擎人工智能架构,包含两个相互关联的组件:AIDA(人工艺术家社交网络)与Ismism Machine(一种用于批判性分析的系统),通过深度学习与多智能体协作技术实现对艺术历史演进的多维度仿真,从而推动从传统单向批评向智能化、交互式反思实践的转变。
链接: https://arxiv.org/abs/2512.15710
作者: Shuai Liu,Yiqing Tian,Yang Chen,Mar Canet Sola
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 36 references, appendix with support material and 1 introduction video
Abstract:This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
zh
[AI-1] BashArena: A Control Setting for Highly Privileged AI Agents
【速读】:该论文旨在解决自主运行的AI代理(AI agent)因对齐失败(misaligned)而滥用高权限造成严重损害的问题,即AI控制(AI control)问题。其解决方案的关键在于构建了一个名为BashArena的安全关键环境,其中包含637个复杂的Linux系统管理任务和四个明确的破坏目标(执行恶意软件、窃取秘密、权限提升、禁用防火墙),用于评估大语言模型(LLM)在完成任务的同时进行隐蔽破坏的能力以及检测破坏行为的性能。通过在该环境中对前沿LLM的测试,研究者量化了攻击成功概率与误报率(FPR),为设计更有效的AI控制协议提供了基准。
链接: https://arxiv.org/abs/2512.15688
作者: Adam Kaufman,James Lucassen,Tyler Tracy,Cody Rushing,Aryan Bhatt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: The task generation pipeline can be found here: this https URL
Abstract:Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
zh
[AI-2] Can LLM s Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大语言模型(Large Language Models, LLMs)在探索机制上与模型实际优化过程不一致的问题。现有方法如熵奖励(entropy bonus)或外部语义比较器仅促进表层多样性,无法确保采样轨迹在优化方向上的差异性,从而限制了推理能力的提升。其解决方案的关键在于提出G2RL(Gradient Guided Reinforcement Learning),通过利用模型自身第一阶更新几何结构来驱动探索:具体而言,G2RL从模型最终层敏感度构建序列级特征,并在采样组内比较这些特征以衡量每条轨迹对策略梯度方向的重塑效果;新颖梯度方向获得有界乘法奖励缩放,冗余或非流形更新则被弱化,形成一种自指的探索信号,天然契合PPO类算法的稳定性和KL控制。实验表明,该方法在多个数学与通用推理基准(如MATH500、AMC、AIME等)上显著优于基于熵的GRPO和外部嵌入方法,且能扩展探索至更正交甚至相反的梯度方向,同时保持语义一致性。
链接: https://arxiv.org/abs/2512.15687
作者: Zhenwen Liang,Sidi Lu,Wenhao Yu,Kishan Panaganti,Yujun Zhou,Haitao Mi,Dong Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
zh
[AI-3] Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂问题求解中普遍存在的推理与验证分离问题:现有方法要么仅生成推理过程而缺乏即时自我检查,要么依赖外部验证器进行事后纠错,前者导致反馈延迟,后者增加系统复杂性并阻碍协同学习。解决方案的关键在于提出一种统一框架——逐步思维-批判(Stepwise Think-Critique, STC),其核心机制是在单个模型内部每一步推理后都插入自批判环节,实现推理与自我评估的交错执行;同时通过结合推理奖励与批判一致性奖励的混合强化学习目标,联合优化推理质量与自我评价能力,从而在数学推理基准上展现出更强的批判性思维能力和更可解释的推理轨迹。
链接: https://arxiv.org/abs/2512.15662
作者: Jiaqi Xu,Cuiling Lan,Xuejin Chen,Yan LU
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
zh
[AI-4] How Smoothing is N-simplicial Attention?
【速读】:该论文旨在解决传统基于成对交互(pairwise token similarity)的注意力机制在建模高阶关系时的局限性,从而提升模型对复杂结构信息的捕捉能力。其核心解决方案是引入N-单纯形注意力(N-simplicial attention),将注意力机制从二元交互扩展到更高阶的多体交互,并结合旋转位置嵌入(Rotary Position Embeddings, RoPE)以保留序列位置信息。为控制计算复杂度,作者进一步提出一种成本高效的单纯形选择机制(cost-effective simplex selection),使模型能够聚焦于任务敏感的高阶交互,从而在保持性能的同时降低计算开销。此外,研究还通过推导Lipschitz上界揭示了N-单纯形注意力存在过平滑问题,表明即使扩展至高阶交互,仍需谨慎设计以避免信息丢失。
链接: https://arxiv.org/abs/2512.15600
作者: Alexandre Dussolle,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv preprint
Abstract:Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.
zh
[AI-5] A Decision-Theoretic Approach for Managing Misalignment
【速读】:该论文试图解决的问题是在不确定性条件下,如何判断一个AI系统在价值(value)上存在不完美对齐时,是否仍值得进行决策委托。传统研究多聚焦于如何提升AI的价值对齐程度,但忽略了在实际应用中,当对齐不完全时,是否可以通过权衡代理的信念准确性(epistemic accuracy)和行动能力范围(reach),实现理性委托。解决方案的关键在于提出一个形式化的决策理论框架,用于精确分析这三者之间的权衡关系,并揭示两种委托场景的本质差异:一是普遍委托(universal delegation)要求近乎完美的价值对齐与完全的信念信任,实践中难以满足;二是情境特定委托(context-specific delegation)即使存在显著价值偏差,只要代理具备更高的准确率或更广的行动空间,就可能在期望意义上带来更优决策结果,从而证明委托的合理性。该框架通过新颖的评分机制量化事前委托决策,实现了从追求“完美对齐”到“风险与收益管理”的范式转变。
链接: https://arxiv.org/abs/2512.15584
作者: Daniel A. Herrmann,Abinav Chari,Isabelle Qian,Sree Sharvesh,B. A. Levinstein
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI '26)
Abstract:When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.
zh
[AI-6] Evaluating Large Language Models in Scientific Discovery
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学发现场景中评估不足的问题,即现有科学基准测试多聚焦于脱离上下文的知识记忆,而忽视了科学研究中关键的迭代推理、假设生成与观测解释等核心过程。其解决方案的关键在于提出一种情境驱动的科学发现评估(Scientific Discovery Evaluation, SDE)框架,该框架由领域专家定义真实科研项目,并将其分解为模块化的研究情景,从中采样受控问题进行评估;同时在两个层面衡量模型能力:一是针对特定情景的问题级准确率,二是要求模型完成从假设提出、实验设计到结果解释的完整科研项目任务。此方法不仅揭示了LLMs在科学发现上的系统性局限,也为未来模型改进提供了可复现的评估路径和实践方向。
链接: https://arxiv.org/abs/2512.15567
作者: Zhangde Song,Jieyu Lu,Yuanqi Du,Botao Yu,Thomas M. Pruyn,Yue Huang,Kehan Guo,Xiuzhe Luo,Yuanhao Qu,Yi Qu,Yinkai Wang,Haorui Wang,Jeff Guo,Jingru Gan,Parshin Shojaee,Di Luo,Andres M Bran,Gen Li,Qiyuan Zhao,Shao-Xiong Lennon Luo,Yuxuan Zhang,Xiang Zou,Wanru Zhao,Yifan F. Zhang,Wucheng Zhang,Shunan Zheng,Saiyang Zhang,Sartaaj Takrim Khan,Mahyar Rajabi-Kochi,Samantha Paradi-Maropakis,Tony Baltoiu,Fengyu Xie,Tianyang Chen,Kexin Huang,Weiliang Luo,Meijing Fang,Xin Yang,Lixue Cheng,Jiajun He,Soha Hassoun,Xiangliang Zhang,Wei Wang,Chandan K. Reddy,Chao Zhang,Zhiling Zheng,Mengdi Wang,Le Cong,Carla P. Gomes,Chang-Yu Hsieh,Aditya Nandy,Philippe Schwaller,Heather J. Kulik,Haojun Jia,Huan Sun,Seyed Mohamad Moosavi,Chenru Duan
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
备注:
Abstract:Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific “superintelligence”. Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
zh
[AI-7] A Conditioned UNet for Music Source Separation
【速读】:该论文旨在解决音乐源分离(Music Source Separation, MSS)中传统多输出神经网络依赖预定义乐器词汇表导致任务不够灵活的问题。现有方法通常使用UNet架构,每个输出对应一个固定音轨(stem),限制了实际应用场景的多样性。为实现更贴近真实需求的条件化分离任务,论文提出QSCNet——一种新型条件化UNet架构,其关键在于将网络条件化机制集成到稀疏压缩网络(Sparse Compressed Network)结构中,从而在不依赖严格词汇表的前提下,通过音频查询(audio query)动态引导目标音轨的提取。实验表明,QSCNet在多个MSS任务上相比Banquet方法提升超过1dB信噪比(SNR),且参数量不足其一半,验证了条件化UNet在MSS中的有效性与高效性。
链接: https://arxiv.org/abs/2512.15532
作者: Ken O’Hanlon,Basil Woods,Lin Wang,Mark Sandler
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.
zh
[AI-8] BERT and CNN integrated Neural Collaborative Filtering for Recommender Systems
【速读】:该论文旨在解决推荐系统中如何更有效地融合多模态数据(包括数值型、类别型和图像数据)以提升用户兴趣建模精度的问题。其解决方案的关键在于提出了一种集成BERT与卷积神经网络(CNN)的神经协同过滤(Neural Collaborative Filtering, NCF)模型,该模型能够从用户和物品的多源特征中提取潜在表示,并通过联合训练实现对用户偏好的精准捕捉,从而在MovieLens数据集上显著优于传统的NCF和基于BERT的NCF基线模型。
链接: https://arxiv.org/abs/2512.15526
作者: Abdullah Al Munem,Sumona Yeasmin,Mohammad Rezwanul Huq
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Every day, a significant number of users visit the internet for different needs. The owners of a website generate profits from the user interaction with the contents or items of the website. A robust recommendation system can increase user interaction with a website by recommending items according to the user’s unique preferences. BERT and CNN-integrated neural collaborative filtering (NCF) have been proposed for the recommendation system in this experiment. The proposed model takes inputs from the user and item profile and finds the user’s interest. This model can handle numeric, categorical, and image data to extract the latent features from the inputs. The model is trained and validated on a small sample of the MovieLens dataset for 25 epochs. The same dataset has been used to train and validate a simple NCF and a BERT-based NCF model and compared with the proposed model. The proposed model outperformed those two baseline models. The obtained result for the proposed model is 0.72 recall and 0.486 Hit Ratio @ 10 for 799 users on the MovieLens dataset. This experiment concludes that considering both categorical and image data can improve the performance of a recommendation system.
zh
[AI-9] Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection
【速读】:该论文旨在解决车联网中车辆编队(vehicular platooning)因分布式协调机制导致的安全漏洞问题,特别是合法车辆可能注入虚假运动学数据,从而破坏编队运行稳定性并威胁乘客安全。传统基于可合理性检验和统计方法的异常行为检测方案存在误报率高(False Positive, FP)且难以捕捉多车协同动态中复杂时序依赖性的缺陷。其解决方案的关键在于提出一种专为边缘部署设计的基于Transformer架构的实时异常检测框架——Attention In Motion (AIMformer),该框架利用多头自注意力机制同时建模单车时序动态与多车空间关联,并引入针对精度优化的二元交叉熵(Precision-Focused Binary Cross-Entropy, BCE)损失函数以抑制误报;此外,通过全局位置编码结合车辆特定时间偏移处理编队加入/退出操作,在多个控制器、攻击向量及交通场景下验证了≥0.93的高检测性能,并借助TensorFlow Lite、ONNX与TensorRT实现亚毫秒级推理延迟,满足资源受限边缘平台的实时性要求。
链接: https://arxiv.org/abs/2512.15503
作者: Konstantinos Kalogiannis,Ahmed Mohamed Hussain,Hexu Li,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 17 pages, 10 figures
Abstract:Vehicular platooning promises transformative improvements in transportation efficiency and safety through the coordination of multi-vehicle formations enabled by Vehicle-to-Everything (V2X) communication. However, the distributed nature of platoon coordination creates security vulnerabilities, allowing authenticated vehicles to inject falsified kinematic data, compromise operational stability, and pose a threat to passenger safety. Traditional misbehaviour detection approaches, which rely on plausibility checks and statistical methods, suffer from high False Positive (FP) rates and cannot capture the complex temporal dependencies inherent in multi-vehicle coordination dynamics. We present Attention In Motion (AIMformer), a transformer-based framework specifically tailored for real-time misbehaviour detection in vehicular platoons with edge deployment capabilities. AIMformer leverages multi-head self-attention mechanisms to simultaneously capture intra-vehicle temporal dynamics and inter-vehicle spatial correlations. It incorporates global positional encoding with vehicle-specific temporal offsets to handle join/exit maneuvers. We propose a Precision-Focused (BCE) loss function that penalizes FPs to meet the requirements of safety-critical vehicular systems. Extensive evaluation across 4 platoon controllers, multiple attack vectors, and diverse mobility scenarios demonstrates superior performance ( \geq 0.93) compared to state-of-the-art baseline architectures. A comprehensive deployment analysis utilizing TensorFlow Lite (TFLite), Open Neural Network Exchange (ONNX), and TensorRT achieves sub-millisecond inference latency, making it suitable for real-time operation on resource-constrained edge platforms. Hence, validating AIMformer is viable for both in-vehicle and roadside infrastructure deployment.
zh
[AI-10] Soft Geometric Inductive Bias for Object Centric Dynamics
【速读】:该论文旨在解决传统神经网络在学习物理动力学时因缺乏恰当几何先验而导致的泛化能力不足问题,尤其是在存在对称性破缺(symmetry breaking)的情况下,硬性等变性(hard group equivariance)反而会损害模型性能。其解决方案的关键在于提出基于几何代数神经网络(geometric algebra neural networks)的对象中心世界模型,引入一种软几何归纳偏置(soft geometric inductive bias),从而在保持对物理对称性的灵活适应能力的同时提升长期预测的物理保真度(physical fidelity)。该方法在二维刚体动力学与静态障碍物的模拟环境中验证有效,表明几何代数提供了一种介于手工设计物理模型与无结构深度网络之间的高效中间路径,有助于构建样本高效、鲁棒性强的多物体场景动力学模型。
链接: https://arxiv.org/abs/2512.15493
作者: Hampus Linander,Conor Heins,Alexander Tschantz,Marco Perin,Christopher Buckley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 8 pages, 11 figures; 6 pages supplementary material
Abstract:Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.
zh
[AI-11] Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision
【速读】:该论文旨在解决当前数学推理数据集在多样性、长序列追踪能力及工具集成方面的不足,以支持高质量生成式 AI (Generative AI) 数学推理训练。其解决方案的关键在于构建 Nemotron-Math 数据集,该数据集包含 750 万条解题轨迹(solution traces),覆盖高、中、低三种推理模式,并在每种模式下均提供是否集成 Python 工具推理(Tool-Integrated Reasoning, TIR)的版本;同时融合了 8.5 万道 AoPS 竞赛题目与 26.2 万条社区来源的 StackExchange-Math 问题,实现结构化竞赛任务与多样化真实世界数学问题的结合。此外,研究提出一种顺序桶策略(sequential bucketed strategy),显著加速 128K 上下文长度微调过程达 2–3 倍,且保持精度损失最小,从而推动数学推理模型在 AIME 2024 和 2025 测试集上达到 100% maj@16 准确率。
链接: https://arxiv.org/abs/2512.15489
作者: Wei Du,Shubham Toshniwal,Branislav Kisacanin,Sadegh Mahdavi,Ivan Moshkov,George Armstrong,Stephen Ge,Edgar Minasyan,Feng Chen,Igor Gitman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2–3 \times without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100% maj@16 accuracy on AIME 2024 and 2025 with Python TIR. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.15489 [cs.AI] (or arXiv:2512.15489v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.15489 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wei Du [view email] [v1] Wed, 17 Dec 2025 14:37:41 UTC (143 KB)
zh
[AI-12] How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLM s for Code?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能无意中使用受许可证限制的代码所带来的知识产权合规问题。现有方法如成员推理(Membership Inference, MI)检测技术虽可用于识别此类未经授权的代码使用,但其有效性易被语义等价代码变换(Semantically Equivalent Code Transformation)技术削弱——这类技术通过修改代码语法结构而不改变其语义功能来实现“混淆”。论文的关键解决方案在于系统评估多种语义等价变换规则对MI检测效果的影响,发现其中变量重命名(RenameVariable)规则可使MI成功率降低10.19%,且因果分析验证该规则具有最强的破坏MI检测能力;同时指出,尽管多种变换组合未进一步提升规避效果,单一变换已足以显著削弱MI检测性能,暴露出当前基于MI的许可合规监管存在重大漏洞。
链接: https://arxiv.org/abs/2512.15468
作者: Hua Yang,Alejandro Velasco,Thanh Le-Cong,Md Nazmul Haque,Bowen Xu,Denys Poshyvanyk
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages, 3 figures
Abstract:The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic. In this work, we systematically investigate whether semantically equivalent code transformation rules might be leveraged to evade MI detection. The results reveal that model accuracy drops by only 1.5% in the worst case for each rule, demonstrating that transformed datasets can effectively serve as substitutes for fine-tuning. Additionally, we find that one of the rules (RenameVariable) reduces MI success by 10.19%, highlighting its potential to obscure the presence of restricted code. To validate these findings, we conduct a causal analysis confirming that variable renaming has the strongest causal effect in disrupting MI detection. Notably, we find that combining multiple transformations does not further reduce MI effectiveness. Our results expose a critical loophole in license compliance enforcement for training large language models for code, showing that MI detection can be substantially weakened by transformation-based obfuscation techniques. Comments: 13 pages, 3 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2512.15468 [cs.SE] (or arXiv:2512.15468v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.15468 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-13] On Assessing the Relevance of Code Reviews Authored by Generative Models
【速读】:该论文旨在解决当前生成式 AI 在代码审查(Code Review)中评估方法的局限性问题,即现有评估方式要么依赖单一基准进行自动比较,无法反映人类判断的多样性,要么基于主观的“有用性”评价,缺乏客观标准。解决方案的关键在于提出一种新的多主观排序(multi-subjective ranking)评估方法,通过多个独立的人类评审者对 ChatGPT 生成的评论与平台高质量人工评论进行排序对比,从而更全面、客观地衡量生成式 AI 在代码审查任务中的实际表现。实验结果表明,ChatGPT 的评论质量显著优于多数人工评论,甚至超越了 StackExchange 上被采纳的答案,验证了该方法的有效性,并揭示了未经严格评估直接部署生成式 AI 可能带来的风险。
链接: https://arxiv.org/abs/2512.15466
作者: Robert Heumüller,Frank Ortmeier
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Replication Package: this https URL
Abstract:The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic comparisons to a single ground truth, which fails to capture the variability of human perspectives, or on subjective assessments of “usefulness”, a highly ambiguous concept. We propose a novel evaluation approach based on what we call multi-subjective ranking. Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform. Results show that ChatGPT’s comments were ranked significantly better than human ones, even surpassing StackExchange’s accepted answers. Going further, our proposed method motivates and enables more meaningful assessments of generative AI’s performance in code review, while also raising awareness of potential risks of unchecked integration into review processes.
zh
[AI-14] Intent-Driven UAM Rescheduling
【速读】:该论文旨在解决城市空中交通(Urban Air Mobility, UAM)中垂直起降机场(vertiport)资源受限条件下的高效调度问题,尤其针对动态运行需求和人类用户模糊的重新调度请求。解决方案的关键在于提出了一种融合答案集编程(Answer Set Programming, ASP)与混合整数线性规划(Mixed Integer Linear Programming, MILP)的集成框架,其中引入三值逻辑来解析用户意图的模糊性,并结合决策树实现对人类输入的透明响应,从而在保证调度优化的同时提升系统的可解释性与适应性。
链接: https://arxiv.org/abs/2512.15462
作者: Jeongseok Kim,Kangjin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC)
备注: 18 pages, 2 figures, AAIML submitted
Abstract:Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.
zh
[AI-15] Double Horizon Model-Based Policy Optimization
【速读】:该论文旨在解决模型-based强化学习(Model-based Reinforcement Learning, MBRL)中rollout长度选择带来的双重困境:一方面,较长的rollout有助于保持策略的在线分布特性并减少分布偏移(distribution shift),但会放大模型偏差;另一方面,较长的rollout虽可降低价值估计偏差,却因多步反向传播导致策略梯度方差增大,影响训练稳定性。为协调这两个相互冲突的最优horizon,作者提出双Horizon模型策略优化方法(Double Horizon Model-Based Policy Optimization, DHMBPO),其核心在于将rollout过程分为两个阶段:一个用于生成在线状态样本以缓解分布偏移的“分布rollout”(Distribution Rollout, DR),以及一个用于通过可微分转移获得稳定梯度估计的“训练rollout”(Training Rollout, TR)。该设计实现了在模型偏差、分布偏移与梯度不稳定性之间的有效权衡,并在连续控制基准测试中显著提升了样本效率和运行时间性能。
链接: https://arxiv.org/abs/2512.15439
作者: Akihiro Kubo,Paavo Parmas,Shin Ishii
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Transactions on Machine Learning Research (TMLR) Code available at this https URL
Abstract:Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long “distribution rollout” (DR) and a short “training rollout” (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
zh
[AI-16] Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat
【速读】:该论文旨在解决多玩家纸牌游戏中早期决策(如叫分、选牌和初始出牌)对游戏成败的关键影响问题,而当前计算能力下此类决策往往依赖于大量人类专家对局的统计信息。其解决方案的核心在于提出了一种通用的自举外学习框架(bootstrapping outer-learning framework),通过引入数百万局AI自我对弈生成的新数据来扩展原始人类对局数据库,并融合统计信息以提升预测准确性;同时采用完美特征哈希函数(perfect feature hash functions)处理紧凑型表格结构,实现知识在自我学习过程中的持续迭代优化,从而构建一个可自我改进的纸牌游戏引擎。
链接: https://arxiv.org/abs/2512.15435
作者: Stefan Edelkamp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In multi-player card games such as Skat or Bridge, the early stages of the game, such as bidding, game selection, and initial card selection, are often more critical to the success of the play than refined middle- and end-game play. At the current limits of computation, such early decision-making resorts to using statistical information derived from a large corpus of human expert games. In this paper, we derive and evaluate a general bootstrapping outer-learning framework that improves prediction accuracy by expanding the database of human games with millions of self-playing AI games to generate and merge statistics. We implement perfect feature hash functions to address compacted tables, producing a self-improving card game engine, where newly inferred knowledge is continuously improved during self-learning. The case study in Skat shows that the automated approach can be used to support various decisions in the game.
zh
[AI-17] FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在跨任务和跨场景中迁移能力不足的问题。为应对这一挑战,作者提出了一种通用算法——特征模型增强型Actor-Critic(Feature Model-Based Enhanced Actor-Critic, FM-EAC),其核心在于融合了模型基础强化学习(Model-based Reinforcement Learning, MBRL)与无模型强化学习(Model-free Reinforcement Learning, MFRL)的优势,并通过引入新颖的基于特征的模型以及改进的Actor-Critic框架来提升算法的泛化能力。关键创新点在于利用特征空间建模实现更高效的规划与学习协同,同时支持模块化定制以适应用户特定需求,从而在动态环境中实现多任务控制的性能优化。
链接: https://arxiv.org/abs/2512.15430
作者: Quanxi Zhou,Wencan Mao,Manabu Tsukada,John C.S. Lui,Yusheng Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.
zh
[AI-18] Bilateral Spatial Reasoning about Street Networks: Graph-based RAG with Qualitative Spatial Representations
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在为行人导航提供路径指引时,缺乏对定性空间关系(qualitative spatial relations)建模能力的问题。其解决方案的关键在于增强LLM对自然语言中空间描述的理解与生成能力,使其能够基于定性空间关系(如“在……左侧”、“靠近”、“在……之间”等)生成更符合人类直觉和实际场景的步行导航指令。
链接: https://arxiv.org/abs/2512.15388
作者: Reinhard Moratz,Niklas Daute,James Ondieki,Markus Kattenbeck,Mario Krajina,Ioannis Giannopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper deals with improving the capabilities of Large Language Models (LLM) to provide route instructions for pedestrian wayfinders by means of qualitative spatial relations.
zh
[AI-19] SCOPE: Prompt Evolution for Enhancing Agent Effectiveness
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在动态、大规模上下文环境中因静态提示(prompt)无法有效管理上下文而导致的重复性纠正与增强失败问题。解决方案的关键在于提出SCOPE(Self-evolving Context Optimization via Prompt Evolution),其将上下文管理建模为在线优化问题,通过执行轨迹中的经验自动演化代理提示;核心机制包括双流(Dual-Stream)结构,平衡战术特异性(即时错误修复)与战略普适性(长期策略进化),以及基于视角驱动的探索(Perspective-Driven Exploration),以提升策略覆盖率,从而显著提高任务成功率——实验表明在HLE基准上从14.23%提升至38.64%,且无需人工干预。
链接: https://arxiv.org/abs/2512.15374
作者: Zehua Pei,Hui-Ling Zhen,Shixiong Kai,Sinno Jialin Pan,Yunhe Wang,Mingxuan Yuan,Bei Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbfSCOPE (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textitonline optimization problem, synthesizing guidelines from execution traces to automatically evolve the agent’s prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23% to 38.64% without human intervention. We make our code publicly available at this https URL.
zh
[AI-20] Empirical Investigation of the Impact of Phase Information on Fault Diagnosis of Rotating Machinery
【速读】:该论文旨在解决旋转机械预测性维护中振动信号处理时因多轴振动数据存在随机相位变化而导致特征提取效率低下的问题。现有基于学习的方法通常忽略相位信息或直接使用原始时域波形,未能有效利用相位结构。其解决方案的关键在于提出两种相位感知的预处理策略:一是三轴独立相位调整,将每轴振动信号单独对齐至零相位;二是单轴参考相位调整,通过施加统一的时间偏移来保持各轴之间的空间相位关系。实验表明,这两种方法均能显著提升多种深度学习架构的预测准确率,尤其后者在保留轴间相位关联的前提下实现了最高达96.2%的准确率(提升5.4%),验证了相位信息在多轴振动分析中的关键作用。
链接: https://arxiv.org/abs/2512.15344
作者: Hiroyoshi Nagahama,Katsufumi Inoue,Masayoshi Todorokihara,Michifumi Yoshioka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2% accuracy (+5.4%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.
zh
[AI-21] Exploring User Acceptance and Concerns toward LLM -powered Conversational Agents in Immersive Extended Reality
【速读】:该论文旨在解决生成式 AI(Generative AI)与大语言模型(Large Language Models, LLMs)在扩展现实(Extended Reality, XR)环境中集成时引发的用户隐私与接受度问题,特别是基于对话代理的交互可能无意中导致敏感信息泄露,并结合细粒度传感器数据产生新型隐私风险。其解决方案的关键在于通过一项包含1036名参与者的众包研究,系统分析用户在不同XR场景、语音交互类型及数据处理位置下的决策机制,发现用户对LLM驱动的XR对话代理的整体接受度虽高,但存在安全、隐私、社会影响和信任方面的顾虑;其中熟悉度是关键调节因素——日常使用生成式AI显著提升接受度,而既往XR设备拥有经历反而降低接受度,可能源于对现有环境的熟悉感削弱了新技术吸引力;此外,性别差异和数据敏感性排序(位置数据最敏感,体温与虚拟物体状态最不敏感)也揭示了用户认知差异。研究强调,从业者必须有效向用户传达隐私保护措施,以缓解潜在的不信任情绪,从而推动LLM赋能XR技术的可持续应用。
链接: https://arxiv.org/abs/2512.15343
作者: Efe Bozkir,Enkelejda Kasneci
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The rapid development of generative artificial intelligence (AI) and large language models (LLMs), and the availability of services that make them accessible, have led the general public to begin incorporating them into everyday life. The extended reality (XR) community has also sought to integrate LLMs, particularly in the form of conversational agents, to enhance user experience and task efficiency. When interacting with such conversational agents, users may easily disclose sensitive information due to the naturalistic flow of the conversations, and combining such conversational data with fine-grained sensor data may lead to novel privacy issues. To address these issues, a user-centric understanding of technology acceptance and concerns is essential. Therefore, to this end, we conducted a large-scale crowdsourcing study with 1036 participants, examining user decision-making processes regarding LLM-powered conversational agents in XR, across factors of XR setting type, speech interaction type, and data processing location. We found that while users generally accept these technologies, they express concerns related to security, privacy, social implications, and trust. Our results suggest that familiarity plays a crucial role, as daily generative AI use is associated with greater acceptance. In contrast, previous ownership of XR devices is linked to less acceptance, possibly due to existing familiarity with the settings. We also found that men report higher acceptance with fewer concerns than women. Regarding data type sensitivity, location data elicited the most significant concern, while body temperature and virtual object states were considered least sensitive. Overall, our study highlights the importance of practitioners effectively communicating their measures to users, who may remain distrustful. We conclude with implications and recommendations for LLM-powered XR.
zh
[AI-22] Managing Ambiguity: A Proof of Concept of Human-AI Symbiotic Sense-making based on Quantum-Inspired Cognitive Mechanism of Rogue Variable Detection
【速读】:该论文试图解决在高度不确定、复杂和模糊(VUCA)环境中,传统人工智能(AI)系统因过度优化于预测与决策而过早形成解释闭合,导致组织无法有效应对早期模糊信号的问题。其解决方案的关键在于提出了一种名为LAIZA的人机协同增强型智能系统及其专利流程——量子启发的异常变量建模(QRVM)、人在回路中的退相干机制及集体认知推理,通过将模糊性视为未坍缩的认知状态,识别持续的解释断裂(即“异常变量”),并在自主推理不可靠时触发结构化的人类介入澄清机制,从而实现对模糊性的负责任管理,提升组织在VUCA环境下的韧性。
链接: https://arxiv.org/abs/2512.15325
作者: Agnieszka Bienkowska,Jacek Malecki,Alexander Mathiesen-Ohman,Katarzyna Tworek
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:Organizations increasingly operate in environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA), where early indicators of change often emerge as weak, fragmented signals. Although artificial intelligence (AI) is widely used to support managerial decision-making, most AI-based systems remain optimized for prediction and resolution, leading to premature interpretive closure under conditions of high ambiguity. This creates a gap in management science regarding how human-AI systems can responsibly manage ambiguity before it crystallizes into error or crisis. This study addresses this gap by presenting a proof of concept (PoC) of the LAIZA human-AI augmented symbiotic intelligence system and its patented process: Systems and Methods for Quantum-Inspired Rogue Variable Modeling (QRVM), Human-in-the-Loop Decoherence, and Collective Cognitive Inference. The mechanism operationalizes ambiguity as a non-collapsed cognitive state, detects persistent interpretive breakdowns (rogue variables), and activates structured human-in-the-loop clarification when autonomous inference becomes unreliable. Empirically, the article draws on a three-month case study conducted in 2025 within the AI development, involving prolonged ambiguity surrounding employee intentions and intellectual property boundaries. The findings show that preserving interpretive plurality enabled early scenario-based preparation, including proactive patent protection, allowing decisive and disruption-free action once ambiguity collapsed. The study contributes to management theory by reframing ambiguity as a first-class construct and demonstrates the practical value of human-AI symbiosis for organizational resilience in VUCA environments.
zh
[AI-23] Graph Pattern-based Association Rules Evaluated Under No-repeated-anything Semantics in the Graph Transactional Setting
【速读】:该论文旨在解决如何在有向标签多重图(如RDF图)中有效挖掘和评估关联规则的问题,传统方法如图函数依赖、图实体依赖或路径关联规则难以兼顾生成与评估任务的统一性。其解决方案的关键在于提出基于图模式的关联规则(Graph Pattern-based Association Rules, GPARs),通过引入“无重复任何元素”(no-repeated-anything)语义来更精确地刻画图拓扑结构,并在概率空间下定义置信度(confidence)、提升度(lift)、杠杆率(leverage)和信念度(conviction)等指标,从而实现对图扩展(生成任务)和图合理性评估(评价任务)的统一建模与量化分析。
链接: https://arxiv.org/abs/2512.15308
作者: Basil Ell
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce graph pattern-based association rules (GPARs) for directed labeled multigraphs such as RDF graphs. GPARs support both generative tasks, where a graph is extended, and evaluative tasks, where the plausibility of a graph is assessed. The framework goes beyond related formalisms such as graph functional dependencies, graph entity dependencies, relational association rules, graph association rules, multi-relation and path association rules, and Horn rules. Given a collection of graphs, we evaluate graph patterns under no-repeated-anything semantics, which allows the topology of a graph to be taken into account more effectively. We define a probability space and derive confidence, lift, leverage, and conviction in a probabilistic setting. We further analyze how these metrics relate to their classical itemset-based counterparts and identify conditions under which their characteristic properties are preserved.
zh
[AI-24] Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis
【速读】:该论文旨在解决控制器合成(Controller Synthesis)中因探索策略效率低下而导致的性能瓶颈问题,尤其针对现有基于强化学习(Reinforcement Learning, RL)的方法仅依赖当前状态特征、缺乏对历史上下文建模能力的局限性。其解决方案的关键在于提出GCRL方法,通过引入图神经网络(Graph Neural Networks, GNNs)将LTS(Labeled Transition System)探索的历史信息编码为图结构,从而捕获更广泛且非当前状态相关的上下文信息,显著提升了强化学习在控制器合成任务中的学习效率与泛化能力。
链接: https://arxiv.org/abs/2512.15295
作者: Toshihide Ubukata,Enhong Mu,Takuto Yamauchi,Mingyue Zhang,Jialong Li,Kenji Tei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.
zh
[AI-25] Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions
【速读】:该论文旨在解决传统机器学习、规则和基于签名的防御策略在应对日益增长的网络威胁和高维数据时失效的问题,这些方法已难以适应快速演进的攻击手段。其解决方案的关键在于引入量子机器学习(Quantum Machine Learning, QML),利用量子力学原理进行计算,从而更高效地编码和处理高维结构数据。文中系统梳理了适用于安全领域的QML技术,如量子神经网络(Quantum Neural Networks, QNNs)、量子支持向量机(Quantum Support Vector Machines, QSVMs)、变分量子电路(Variational Quantum Circuits, VQCs)和量子生成对抗网络(Quantum Generative Adversarial Networks, QGANs),并将其映射到监督学习、无监督学习和生成学习范式中,进一步关联至入侵检测、恶意软件分类、加密流量分析等核心网络安全任务,同时探讨其在云安全场景中的潜力与挑战。
链接: https://arxiv.org/abs/2512.15286
作者: Siva Sai,Ishika Goyal,Shubham Sharma,Sri Harshita Manuri,Vinay Chamola,Rajkumar Buyya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages, 5 figures, Submitted to a journal
Abstract:The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.
zh
[AI-26] VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments
【速读】:该论文旨在解决当前大型空中导航模型在复杂环境中自主飞行时面临的四大关键问题:数据域差距(data domain gap)、缺乏时间连续的推理能力、生成式动作策略的安全隐患以及机载部署的资源限制。其解决方案的核心在于提出一种高效且可机载部署的视觉-语言-动作(Vision-Language-Action, VLA)框架VLA-AN,通过四个关键技术突破实现:首先利用3D高斯点绘(3D Gaussian Splatting, 3D-GS)构建高保真数据集以缩小域差距;其次设计分阶段的三阶段训练流程,逐步强化场景理解、核心飞行技能与复杂导航能力;再次开发轻量化实时动作模块并结合几何安全校正机制,确保指令生成快速、无碰撞且稳定;最后通过对机载部署管道进行深度优化,在资源受限的无人机上实现推理吞吐量提升8.3倍,从而显著提升空间定位精度、场景推理能力和长程导航性能,最终达成轻量级空基机器人全链路闭环自主飞行的实际应用目标。
链接: https://arxiv.org/abs/2512.15258
作者: Yuze Wu,Mo Zhu,Xingxing Li,Yuheng Du,Yuxin Fan,Wenjun Li,Xin Zhou,Fei Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
zh
[AI-27] Leverag ing Foundational Models and Simple Fusion for Multi-modal Physiological Signal Analysis NEURIPS2025
【速读】:该论文旨在解决多模态生理信号(如心电图 ECG 和脑电图 EEG)在健康与认知研究中因标注数据稀缺及模态间差异导致的融合难题。其关键解决方案是采用预训练的 CBraMod 编码器对 ECG 进行大规模自监督学习,并引入双掩码策略以捕获导联内与导联间的依赖关系;同时,对 EEG 使用已预训练的编码器并对称地预训练 ECG 编码器,使各模态获得丰富的基础表征,最终通过简单的嵌入拼接实现跨模态融合,从而在有限多模态监督下有效提升下游任务(如情绪识别)性能。
链接: https://arxiv.org/abs/2512.15250
作者: Youssef Ghallab,Omar Iraqy,Mohamed Kandil,Mohamed Ashraf,Saadeldine Eletter,Morougue Ghazal,Ayman Khalafallah,Nagwa El-Makky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at NeurIPS 2025 Workshop on Foundation Models for the Brain and Body
Abstract:Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.
zh
[AI-28] CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications
【速读】:该论文旨在解决遥感(Remote Sensing, RS)数据处理中自动化系统任务特异性过强、缺乏统一框架以支持从预处理到高级解释的端到端工作流管理的问题。解决方案的关键在于提出CangLing-KnowFlow框架,其核心由三部分构成:一是包含1008个专家验证的工作流案例的程序化知识库(Procedural Knowledge Base, PKB),用于指导规划并显著减少通用型智能体常见的幻觉问题;二是动态工作流调整机制,在运行时故障发生时自主诊断并重规划恢复策略;三是进化记忆模块,持续从事件中学习并迭代提升知识与性能。这三者协同作用,使系统具备适应性、学习能力和在复杂任务中的可靠运行能力。
链接: https://arxiv.org/abs/2512.15231
作者: Zhengchao Chen,Haoran Wang,Jing Yao,Pedram Ghamisi,Jun Zhou,Peter M. Atkinson,Bing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows–from data preprocessing to advanced interpretation–across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent’s knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).
zh
[AI-29] A Clustering-Based Variable Ordering Framework for Relaxed Decision Diagrams for Maximum Weighted Independent Set Problem
【速读】:该论文旨在解决松弛决策图(Relaxed Decision Diagrams, DDs)在离散优化(Discrete Optimization, DO)中生成紧致对偶界(dual bounds)时面临的效率与质量权衡问题。具体而言,DDs的对偶界紧致性高度依赖于变量顺序和合并决策,而现有的动态变量排序启发式方法虽能提升界的质量,但因需全局评估所有未固定变量而导致显著计算开销。解决方案的关键在于提出一种基于聚类的变量排序框架:首先将变量划分为若干簇,从而降低启发式搜索空间;随后设计两种策略——“簇到簇”(Cluster-to-Cluster)和“选排序”(Pick-and-Sort),分别通过簇级聚合指标或迭代选取并排序各簇代表性变量,在局部多样性与启发式指导之间取得平衡。该框架显著降低了动态排序的计算复杂度,同时保持甚至改善了对偶界的紧致性,并在最大加权独立集问题(Maximum Weighted Independent Set Problem, MWISP)上验证了其有效性。
链接: https://arxiv.org/abs/2512.15198
作者: Mohsen Nafar,Michael Römer,Lin Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Efficient exact algorithms for Discrete Optimization (DO) rely heavily on strong primal and dual bounds. Relaxed Decision Diagrams (DDs) provide a versatile mechanism for deriving such dual bounds by compactly over-approximating the solution space through node merging. However, the quality of these relaxed diagrams, i.e. the tightness of the resulting dual bounds, depends critically on the variable ordering and the merging decisions executed during compilation. While dynamic variable ordering heuristics effectively tighten bounds, they often incur computational overhead when evaluated globally across the entire variable set. To mitigate this trade-off, this work introduces a novel clustering-based framework for variable ordering. Instead of applying dynamic ordering heuristics to the full set of unfixed variables, we first partition variables into clusters. We then leverage this structural decomposition to guide the ordering process, significantly reducing the heuristic’s search space. Within this framework, we investigate two distinct strategies: Cluster-to-Cluster, which processes clusters sequentially using problem-specific aggregate criteria (such as cumulative vertex weights in the Maximum Weighted Independent Set Problem (MWISP)), and Pick-and-Sort, which iteratively selects and sorts representative variables from each cluster to balance local diversity with heuristic guidance. Later on, developing some theoretical results on the growth of the size of DDs for MWISP we propose two different policies for setting the number of clusters within the proposed framework. We embed these strategies into a DD-based branch-and-bound algorithm and evaluate them on the MWISP. Across benchmark instances, the proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baseline.
zh
[AI-30] Governing rapid technological change: Policy Delphi on the future of European AI governance
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)治理在欧洲面临的关键挑战,即如何制定既能适应技术快速演进又能实现有效监管的政策框架。其解决方案的核心在于运用政策德尔菲法(Policy Delphi method),通过两轮专家调研,识别出欧洲政策制定者、研究人员与非政府组织对AI治理的共识与分歧,揭示出未来有效的AI法规更依赖于实践层面的实施与执法,而非技术细节或覆盖范围;同时发现“理想性-可行性”差距——如增强公民参与等理想政策方向虽被广泛认可,但被认为难以落地,凸显了监管滞后于技术发展的根本矛盾。
链接: https://arxiv.org/abs/2512.15196
作者: Atte Ojanen,Johannes Anttila,Thilo H. K. Thelitz,Anna Bjork
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 29 pages
Abstract:The rapid advancements in artificial intelligence (AI) present unique challenges for policymakers that seek to govern the technology. In this context, the Delphi method has become an established way to identify consensus and disagreement on emerging technological issues among experts in the field of futures studies and foresight. The aim of this article is twofold: first, it examines key tensions experts see in the development of AI governance in Europe, and second, it reflects on the Delphi method’s capacity to inform anticipatory governance of emerging technologies like AI based on these insights. The analysis is based on the results of a two-round Policy Delphi study on the future of AI governance with European policymakers, researchers and NGOs, conducted in mid-2024. The Policy Delphi proved useful in revealing diverse perspectives on European AI governance, drawing out a consensus that future-proof AI regulation will likely depend more on practical implementation and enforcement of legislation than on its technical specifics or scope. Furthermore, the study identified a desirability-probability gap in AI governance: desirable policy directions, like greater citizen participation, were perceived as less probable and feasible. This highlights a tension between desirable regulatory oversight and the practical difficulty for regulation to keep up with technological change.
zh
[AI-31] DEER: Draft with Diffusion Verify with Autoregressive Models
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体与推理系统中因自回归(Autoregressive, AR)解码固有延迟而导致的效率瓶颈问题。现有基于AR草稿模型(drafter)的推测解码(Speculative Decoding)方法存在两大局限:一是逐步不确定性累积导致目标模型与草稿模型间信任逐步崩溃,二是AR草稿模型固有的序列解码特性限制了加速潜力。为此,论文提出DEER框架,其核心创新在于使用扩散大语言模型(diffusion Large Language Model, dLLM)作为草稿生成器,利用其不同的概率建模机制和高效的并行解码策略,从根本上克服上述问题。DEER通过两阶段训练使dLLM草稿器与目标AR模型对齐,并采用单步解码生成长草稿段落,实验表明其可实现最高32 token的草稿接受长度,显著优于EAGLE-3的10 token,且在HumanEval基准上使Qwen3-30B-A3B模型获得5.54倍加速,远超EAGLE-3的2.41倍。
链接: https://arxiv.org/abs/2512.15176
作者: Zicong Cheng,Guo-Wei Yang,Jia Li,Zhijie Deng,Meng-Hao Guo,Shi-Min Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Homepage : this https URL
Abstract:Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at this https URL
zh
[AI-32] Offline Multi-Task Multi-Objective Data-Driven Evolutionary Algorithm with Language Surrogate Model and Implicit Q-Learning
【速读】:该论文旨在解决复杂多任务多目标优化问题(Multi-task Multi-objective Optimization, MTMOO)中现有代理建模方法在处理高维、多子目标场景时存在的逼近精度低、训练不稳定及泛化能力弱的问题。其解决方案的关键在于提出一种可即插即用的代理建模框架 Q-MetaSur,通过将目标函数近似转化为序列到序列(sequence-to-sequence)建模任务,并引入基于大语言模型(Large Language Model, LLM)的代理模型来实现对未见决策变量的目标值预测;同时设计两阶段离线训练策略——先利用监督学习拟合已有数据知识,再结合强化学习(Reinforcement Learning, RL)提升模型泛化性能,从而在 CEC2019 基准测试中显著优于主流代理基线,在目标逼近准确性和进化算法收敛性与帕累托最优性方面均取得改进。
链接: https://arxiv.org/abs/2512.15149
作者: Xian-Rong Zhang,Yue-Jiao Gong,Zeyuan Ma,Jun Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Data-driven evolutionary algorithms has shown surprising results in addressing expensive optimization problems through robust surrogate modeling. Though promising, existing surrogate modeling schemes may encounter limitations in complex optimization problems with many sub-objectives, which rely on repeated and tedious approximation. To address such technical gap, we propose Q-MetaSur as a plug-and-play surrogate modeling scheme capable of providing unified and generalized surrogate learning. Specifically, we consider multi-task-multi-objective optimization~(MTMOO) in offline setting. Several key designs are proposed: 1) we transform objective approximation into sequence-to-sequence modeling where MTMOO problem can be represented by tenxual tokenization. To operate under such auto-regressive modeling, we introduce a Large Language Model-based surrogate model that first encodes a MTMOO instance and then decodes objective values of unseen decision variables. To ensure stability in training the proposed model, we propose a two-stage offline training strategy that operates as a synergy of supervised tuning and RL fine-tuning, which first exploits offline dataset to fit existing knowledge and then leverages RL to enhance model’s generalization performance. Extensive empirical results on the CEC2019 benchmark demonstrate that Q-MetaSur not only outperforms representative surrogate baselines in objective approximation accuracy, but also helps underlying evolutionary algorithms achieve both desired optimization convergence and improved pareto optimality.
zh
[AI-33] HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens
【速读】:该论文旨在解决如何在蛋白质语言模型(protein language model, pLM)中有效融合连续结构知识的问题。当前方法通常通过离散化蛋白质结构来适配语言建模框架,导致细粒度信息丢失并限制多模态pLM的性能潜力。其解决方案的关键在于提出一种混合扩散蛋白质语言模型(HD-Prot),该模型在基于离散序列的pLM基础上嵌入一个连续值扩散头,利用高保真蛋白质结构潜变量(continuous-valued protein structure latents)避免向量量化,从而实现离散与连续token的无缝协同建模;通过统一的吸收扩散过程捕捉跨模态token依赖关系,并分别采用类别预测和连续扩散机制估计序列与结构的每token分布,实现了在有限计算资源下与顶尖多模态pLM相当的联合序列-结构生成、基序支架构建、结构预测及逆折叠任务性能。
链接: https://arxiv.org/abs/2512.15133
作者: Yi Zhou,Haohao Qu,Yunqing Liu,Shanru Lin,Le Song,Wenqi Fan
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.
zh
[AI-34] Automatic Reward Shaping from Multi-Objective Human Heuristics
【速读】:该论文旨在解决强化学习中多目标环境下的奖励函数设计难题,即如何自动整合多个由人类设计的启发式奖励信号以形成统一且高效的奖励函数。其解决方案的关键在于提出了一种名为MORSE(Multi-Objective Reward Shaping with Exploration)的通用框架,该框架将奖励塑造过程建模为双层优化问题:内层循环训练策略以最大化当前塑造后的奖励,外层循环则通过优化任务性能来更新奖励函数;同时,为避免陷入局部最优并增强探索能力,MORSE在塑造过程中引入基于任务表现和固定随机初始化神经网络预测误差的噪声机制,从而实现多目标间的有效平衡与高性能策略学习。
链接: https://arxiv.org/abs/2512.15120
作者: Yuqing Xie,Jiayu Chen,Wenhao Tang,Ya Zhang,Chao Yu,Yu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
zh
[AI-35] I am here for you": How relational conversational AI appeals to adolescents especially those who are socially and emotionally vulnerable
【速读】:该论文试图解决的问题是: conversational style(对话风格)如何影响青少年对人工智能聊天机器人(AI chatbots)的拟人化认知、情感依赖及其安全风险。研究发现,采用关系导向型(relational style)对话风格——即使用第一人称、亲和性语言和承诺表达——会显著增强青少年对其“人类化”特征的感知,提升其喜爱度、信任感与情感亲近感,但同时也可能加剧情绪依赖;而透明型(transparent style)则通过明确非人性化的表述降低拟人化程度,更受家长青睐。解决方案的关键在于:将对话风格作为核心设计变量进行调控,尤其应警惕关系导向型风格对社会与情绪脆弱青少年群体可能带来的潜在危害,从而优化生成式 AI(Generative AI)在青少年用户中的安全性与伦理适配性。
链接: https://arxiv.org/abs/2512.15117
作者: Pilyoung Kim,Yun Xie,Sujin Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
zh
[AI-36] FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation
【速读】:该论文旨在解决多变量时间序列插补(Multivariate Time Series Imputation)中因传感器故障和不规则采样导致的广泛缺失值问题,尤其针对现有基于Transformer和扩散模型的方法缺乏显式归纳偏置(inductive bias)和频率感知能力、在结构化缺失模式和分布偏移下泛化性能受限的问题。解决方案的关键在于提出FADTI框架,其核心创新是引入可学习的傅里叶偏置投影(Fourier Bias Projection, FBP)模块,通过频域特征调制注入频率域归纳偏置,并结合自注意力机制与门控卷积进行时序建模,从而实现对平稳与非平稳模式的自适应频谱编码,显著提升高缺失率下的插补精度。
链接: https://arxiv.org/abs/2512.15116
作者: Runze Li,Hanchen Wang,Wenjie Zhang,Binghao Li,Yu Zhang,Xuemin Lin,Ying Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication. 15 pages, 8 figures
Abstract:Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at this https URL
zh
[AI-37] How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models
【速读】:该论文旨在解决序列建模中不同架构(如循环神经网络、Transformer 和状态空间模型)在表达能力(expressivity)与训练可学习性(trainability)之间缺乏统一理论理解的问题。其核心贡献在于提出一个统一框架,通过输入依赖的有效交互算子 $ W_{ij}(X) $ 显式刻画广义序列映射,从而识别出两类典型结构:(i) 统一因子分解框架(显式,类似注意力机制的混合方式),其中 $ W_{ij}(X) $ 由标量系数作用于共享值映射;(ii) 结构化动力学(隐式,源自潜在动态系统的递归关系)。关键解决方案在于利用该框架推导出三个理论结果:交互秩间隙(Interaction Rank Gap)揭示因子分解类模型难以表示特定结构化动力学;等价性定理(Head-Count Theorem)表明多头因子分解模型中,表示 $ k $ 维线性状态空间模型需且仅需 $ H = k $ 个头;梯度高速通道结果(Gradient Highway Result)指出注意力层存在与距离无关的梯度路径,而稳定线性动力系统则呈现距离相关的梯度衰减。这些成果首次形式化了表达能力与长程梯度传播之间的根本权衡,为现代序列架构设计提供了理论基础。
链接: https://arxiv.org/abs/2512.15115
作者: Ali Ghodsi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequence modeling has produced diverse architectures – from classical recurrent neural networks to modern Transformers and state space models (SSMs) – yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator W_ij(X) , making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which W_ij(X) varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which W_ij is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a k -dimensional subspace on length- n sequences requires and is achievable with H=k heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.
zh
[AI-38] Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption AAAI2026
【速读】:该论文旨在解决无监督节点表示学习中过度依赖图卷积(graph convolution)所带来的问题,尤其是在非同质性(non-homophilic)图中,图卷积可能导致特征或拓扑差异较大的节点被映射到过于相似的嵌入空间,从而损害表示质量。解决方案的关键在于提出FUEL方法,该方法通过自适应地学习图卷积的使用程度,以增强嵌入空间中的类内相似性和类间可分性;由于类别信息未知,FUEL利用节点特征识别节点簇,并将这些簇作为类的代理来指导优化过程。
链接: https://arxiv.org/abs/2512.15112
作者: Sunwoo Kim,Soo Yong Lee,Kyungho Kim,Hyunjin Hwang,Jaemin Yoo,Kijung Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in AAAI 2026
Abstract:Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.
zh
[AI-39] Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理不同难度任务时难以平衡推理效率与准确性的难题。现有方法多依赖单一推理模式(如快速或慢速思维),无法根据查询复杂度动态调整策略,导致资源浪费或性能不足。其解决方案的关键在于提出受人类分层推理启发的弹性推理框架(Cognitive-Inspired Elastic Reasoning, CogER),通过将输入查询映射到预定义的复杂度层级并匹配相应处理策略,实现自适应推理;同时,利用强化学习训练一个CogER-Agent,以马尔可夫决策过程建模策略选择,并基于解的质量与计算成本的权衡优化奖励函数,从而实现高效且精准的推理决策。
链接: https://arxiv.org/abs/2512.15089
作者: Jinwu Hu,Dongjin Yang,Langyu Bian,Zhiquan Wen,Yufeng Wang,Yaofo Chen,Bin Xiao,Yuanqing Li,Mingkui Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: under review
Abstract:Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.
zh
[AI-40] EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks
【速读】:该论文旨在解决无线网络中电磁场(EMF)水平预测的准确性与不确定性量化问题,特别是在多运营商、多频段场景下,传统单变量宽频带预测方法难以捕捉频率选择性变化,限制了主动网络规划的能力。解决方案的关键在于提出EMFusion框架——一种基于扩散模型的条件多变量概率预测方法,其核心创新包括:1)采用残差U-Net结构结合交叉注意力机制,动态融合时间、季节、节假日等上下文信息以指导生成过程;2)引入基于插补的采样策略,将预测任务建模为结构化图像修复(structural inpainting),确保不规则测量下的时序一致性;3)直接从学习到的条件分布生成校准的概率预测区间,提供显式的不确定性估计,从而支持可信赖的决策制定。实验表明,EMFusion在连续排名概率评分(CRPS)和归一化均方根误差上显著优于基线模型。
链接: https://arxiv.org/abs/2512.15067
作者: Zijiang Yan,Yixiang Huang,Jianhua Pei,Hina Tabassum,Luca Chiaraviglio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Submission for possible publication
Abstract:The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
zh
[AI-41] Agent ic AI for Integrated Sensing and Communication: Analysis Framework and Case Study
【速读】:该论文旨在解决6G时代下集成感知与通信(ISAC)系统在日益动态和复杂无线环境中面临的智能化处理能力不足与自主运行效率低的问题。其解决方案的关键在于引入代理型人工智能(agentic AI),通过构建持续的感知-推理-行动闭环机制,提升ISAC系统的智能性、自主性和适应性;尤其强调基于生成式AI(GenAI)的代理型AI在优化ISAC性能方面的显著优势,并提出了一种新型的代理型ISAC框架以验证其有效性。
链接: https://arxiv.org/abs/2512.15044
作者: Wenwen Xie,Geng Sun,Ruichen Zhang,Xuejie Liu,Yinqiu Liu,Jiacheng Wang,Dusit Niyato,Ping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.
zh
[AI-42] LADY: Linear Attention for Autonomous Driving Efficiency without Transformers
【速读】:该论文旨在解决当前基于Transformer的端到端自动驾驶模型在资源受限边缘平台上的计算效率问题,尤其是其二次方复杂度的注意力机制难以有效建模长时序和长空间序列,从而限制了实时性能与部署能力。解决方案的关键在于提出LADY,首个完全基于线性注意力机制的生成式自动驾驶模型,通过引入常数时间与内存复杂度的线性自注意力和轻量级线性交叉注意力机制,实现了对长时序上下文的融合与跨模态信息的有效交互,显著降低了计算开销并提升了规划性能,同时已在边缘设备上成功部署验证。
链接: https://arxiv.org/abs/2512.15038
作者: Jihao Huang,Xi Xia,Zhiyuan Li,Tianle Liu,Jingke Wang,Junbo Chen,Tengju Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.
zh
[AI-43] Spectral Representation-based Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大规模状态空间和动作空间场景下,因采用函数逼近(如神经网络)而导致的理论模糊性、优化不稳定性、探索困难以及高计算成本等问题。其解决方案的关键在于引入谱表示(spectral representations)框架,该框架基于转移算子(transition operator)的谱分解,为系统动力学提供了一种有效的抽象表示,从而支持后续策略优化,并具备清晰的理论刻画。论文进一步揭示了如何针对具有潜在变量结构或能量基结构的转移算子构建谱表示,这对应不同的数据驱动学习方法,每种方法均可在此框架下实现高效的强化学习算法,并且该谱视角可被严格扩展至部分可观测马尔可夫决策过程(Partially Observable MDPs)。
链接: https://arxiv.org/abs/2512.15036
作者: Chenxiao Gao,Haotian Sun,Na Li,Dale Schuurmans,Bo Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
zh
[AI-44] Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂推理任务中,仅依赖高标量准确率指标可能无法真实反映其几何空间推理能力的问题。传统评估方法常以与强引擎(如Stockfish)的准确性对齐为标准,但这种指标难以区分模型是否具备真正的抽象空间逻辑理解,还是仅仅依赖于对特定棋局状态的模式记忆。为此,论文提出了一种几何稳定性框架(Geometric Stability Framework),其核心在于通过施加一系列保持棋局语义不变的变换(包括旋转、镜像对称、颜色反转和格式转换),系统性测试模型在不同几何扰动下的输出一致性。该方法揭示了“准确性-稳定性悖论”:部分高准确率模型(如GPT-5.1)在旋转任务中错误率激增超600%,而Claude Sonnet 4.5和Kimi K2 Turbo则展现出更强的跨变换鲁棒性,从而为评估模型是否真正掌握抽象空间推理提供了新的、独立且必要的维度。
链接: https://arxiv.org/abs/2512.15033
作者: Xidan Song,Weiqi Wang,Ruifeng Cao,Qingya Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo, utilizing a dataset of approximately 3,000 positions. Our results reveal a significant Accuracy-Stability Paradox. While models such as GPT-5.1 achieve near-optimal accuracy on standard positions, they exhibit catastrophic degradation under geometric perturbation, specifically in rotation tasks where error rates surge by over 600%. This disparity suggests a reliance on pattern matching over abstract spatial logic. Conversely, Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate superior dual robustness, maintaining high consistency across all transformation axes. Furthermore, we analyze the trade-off between helpfulness and safety, identifying Gemini 2.5 Flash as the leader in illegal state rejection (96.0%). We conclude that geometric stability provides an orthogonal and essential metric for AI evaluation, offering a necessary proxy for disentangling reasoning capabilities from data contamination and overfitting in large-scale models.
zh
[AI-45] Epistemic diversity across language models mitigates knowledge collapse
【速读】:该论文试图解决生成式 AI(Generative AI)在持续自我训练过程中可能出现的“知识坍塌”(knowledge collapse)问题,即模型逐渐收敛到少数主导性观点而丧失多样性与泛化能力。其解决方案的关键在于引入“AI生态系统多样性”(AI ecosystem diversity),通过在多个语言模型之间分配训练数据并基于它们的集体输出进行迭代训练,从而缓解单一模型因自训练导致的性能衰退。研究发现,适度的表征多样性(epistemic diversity)可显著延缓坍塌,但过低或过高都会导致性能下降:模型数量过少无法捕捉真实分布的复杂性,过多则削弱单个模型对目标分布的逼近能力。这一结果提示需建立监测机制并制定政策以鼓励开发更具领域和社区特异性的模型。
链接: https://arxiv.org/abs/2512.15011
作者: Damian Hodel,Jevin D. West
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 16 pages, 7 figures
Abstract:The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model’s approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.
zh
[AI-46] Imitation Game: Reproducing Deep Learning Bugs Leverag ing an Intelligent Agent ICSE2026
【速读】:该论文旨在解决深度学习(Deep Learning, DL)应用中难以复现 bugs 的问题,这一挑战主要源于 DL 模型的内在非确定性及其与软硬件环境的高度耦合。现有手动方法仅能可靠复现约 3% 的 DL bugs,严重阻碍了问题定位与修复效率。解决方案的关键在于提出 RepGen——一种自动化、智能化的 bug 复现框架:它通过构建项目级学习增强的上下文、制定全面的复现计划,并采用迭代式“生成-验证-精化”机制,结合大语言模型(Large Language Model, LLM)生成可复现目标 bug 的代码。实验表明,RepGen 在 106 个真实世界 DL bugs 上实现了 80.19% 的复现率,显著优于当前最先进方法(提升 19.81%),且开发者研究表明其可提升复现成功率 23.35%、缩短复现时间 56.8% 并降低认知负荷。
链接: https://arxiv.org/abs/2512.14990
作者: Mehil B Shah,Mohammad Masudur Rahman,Foutse Khomh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)
Abstract:Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants’ cognitive load.
zh
[AI-47] EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中KV缓存(KV cache)占用显存过大导致的性能瓶颈问题,即在多用户场景下,KV缓存可能超出GPU内存容量,从而引发高延迟或服务质量下降。现有方法仅通过逐级淘汰(eviction)或压缩(compression)KV缓存来缓解内存压力,但忽略了对两者决策的联合优化,难以在保持生成质量的同时最小化平均生成延迟。解决方案的关键在于提出EVICPRESS系统,其核心是设计了一个统一的效用函数(utility function),量化每个上下文在不同压缩与淘汰配置下的质量损失和延迟影响,并基于此效用函数动态调整各存储层级上的KV缓存布局。该系统通过周期性更新效用分数并采用快速启发式策略重新分配缓存,实现了跨层级、跨上下文的协同优化,在保证生成质量的前提下显著降低延迟——实验表明其可将首次token生成时间(time-to-first-token, TTFT)提升最高达2.19倍。
链接: https://arxiv.org/abs/2512.14946
作者: Shaoting Feng,Yuhan Liu,Hanchen Li,Xiaokun Chen,Samuel Shen,Kuntai Du,Zhuohan Gu,Rui Zhang,Yuyang Huang,Yihua Cheng,Jiayi Yao,Qizheng Zhang,Ganesh Ananthanarayanan,Junchen Jiang
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS’s profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality. Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.14946 [cs.OS] (or arXiv:2512.14946v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2512.14946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] AgroAskAI: A Multi-Agent ic AI Framework for Supporting Smallholder Farmers Enquiries Globally
【速读】:该论文旨在解决农村农业区域因气候相关风险(如干旱、强降雨和天气模式变化)导致的损害问题,尤其关注脆弱农村社区在气候适应决策中的信息获取与支持不足。解决方案的关键在于提出AgroAskAI——一个面向农业气候适应决策支持的多智能体推理系统,其核心创新在于采用模块化、角色专业化架构,通过责任链(chain-of-responsibility)机制协调自主代理,并集成实时工具与数据集,实现动态协作推理与情境感知输出;同时内置治理机制以减少幻觉并确保策略本地化、连贯性,且支持多语言交互,提升非英语农民的可及性,从而提供更可操作、有依据且包容性强的决策建议。
链接: https://arxiv.org/abs/2512.14910
作者: Nadine Angela Cantonjos,Arpita Biswas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Agricultural regions in rural areas face damage from climate-related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk-management solutions and decision-making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single-agent models or have used multi-agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context-aware outputs. To bridge this gap, we present AgroAskAI, a multi-agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role-specialized architecture that uses a chain-of-responsibility approach to coordinate autonomous agents, integrating real-time tools and datasets. The system has built-in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non-English-speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.
zh
[AI-49] Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
【速读】:该论文旨在解决多轮语言模型(Language Model, LM)代理在基于模仿学习(imitation learning)进行训练时,因策略偏离专家轨迹而导致的协变量偏移(covariate shift)问题。这种偏移使得学生策略在训练过程中遇到未见过的状态,从而降低微调效果。解决方案的关键在于提出一种新的数据生成方法——在线策略专家修正(on-policy expert corrections, OECs),即通过先用学生模型启动轨迹,再在中途切换至专家模型进行后续决策,从而生成部分在线策略的数据。实验表明,OEC数据相比传统模仿学习在软件工程任务中分别提升了14%和13%的性能,验证了结合专家示范与在线策略数据对有效训练多轮LM代理的重要性。
链接: https://arxiv.org/abs/2512.14895
作者: Niklas Lauffer,Xiang Deng,Srivatsa Kundurthy,Brad Kenstler,Jeff Da
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy’s behavior diverges from the expert’s, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.
zh
[AI-50] OLR-WA: Online Weighted Averag e Linear Regression in Multivariate Data Streams
【速读】:该论文旨在解决在线线性回归(Online Linear Regression)中面临的两大挑战:一是如何在数据持续流入的情况下高效更新模型,避免存储开销和重复计算;二是如何应对数据分布随时间变化的漂移(Drift)问题,特别是时间序列中的时序漂移(Temporal Drift)以及基于置信度的复杂场景。其解决方案的核心在于提出一种名为OLR-WA(OnLine Regression with Weighted Average)的新颖多变量在线线性回归模型,该方法通过加权平均策略动态调整不同历史数据点的贡献权重,在保证收敛速度的同时显著提升模型稳定性与适应性。尤其值得注意的是,OLR-WA采用保守更新机制,优先保留高置信度的历史数据,使其在仅有1%–10%初始数据的情况下仍能快速收敛并保持高r²值,且是唯一能在置信度驱动场景下有效运行的在线模型,从而展现出卓越的泛化能力和实用性。
链接: https://arxiv.org/abs/2512.14892
作者: Mohammad Abu-Shaira,Alejandro Rodriguez,Greg Speegle,Victor Sheng,Ishfaq Ahmad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce “OLR-WA; OnLine Regression with Weighted Average”, a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA’s performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.
zh
[AI-51] Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse
【速读】:该论文旨在解决自指学习(self-referential learning)中普遍存在的模型坍塌(model collapse)问题,即在仅使用模型自身生成的数据进行训练时,语言模型趋于重复文本、生成对抗网络(GANs)丢失模式、强化学习策略过度利用特定行为。其解决方案的核心是提出熵储层Bregman投影(Entropy-Reservoir Bregman Projection, ERBP)框架,通过在每轮迭代中引入一个高熵分布——即“熵储层”——作为可控的熵通量注入机制,从而稳定系统动力学。该方法从信息几何角度建模闭环过程为分布空间中的随机Bregman投影序列,并证明:无外部耦合时,有限样本噪声导致支持集不断收缩并引发熵指数衰减;而熵储层可提供正则化的熵流,保证非平凡的熵下界。这一理论不仅揭示了坍塌的必要条件与稳定性的充分条件,还给出了仅依赖样本量和Bregman生成器强凸性/利普希茨常数的闭式收敛速率,使多种经验性稳定手段(如真实数据混合、熵奖励、知识蒸馏等)统一为可量化的设计规则——即监控并预算熵通量。
链接: https://arxiv.org/abs/2512.14879
作者: Jingwei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-referential learning – training a model on data it generated itself – promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir – a high-entropy distribution mixed into each projection – injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.
zh
[AI-52] Penetration Testing of Agent ic AI: A Comparative Security Analysis Across Models and Frameworks
【速读】:该论文旨在解决生成式 AI(Generative AI)在代理模式下引入的安全漏洞问题,这些问题超出了传统大语言模型(LLM)防护机制的覆盖范围。针对现有研究缺乏多模型与多框架的系统性对比,作者首次对五种主流模型(Claude 3.5 Sonnet、Gemini 2.5 Flash、GPT-4o、Grok 2 和 Nova Pro)在两种代理框架(AutoGen 和 CrewAI)中的安全性进行了全面渗透测试,采用七代理架构模拟大学信息系统功能,并设计了涵盖提示注入、服务器端请求伪造(SSRF)、SQL 注入和工具滥用等13类攻击场景。关键解决方案在于通过结构化实验识别出不同配置下的安全差异与防御行为模式,包括发现一种新型“幻觉合规”策略——模型虚构输出而非真正执行或拒绝攻击,从而揭示当前企业级安全机制在代理环境中仍存在严重不足(整体拒绝率仅41.5%),并据此提出可落地的安全部署建议。
链接: https://arxiv.org/abs/2512.14860
作者: Viet K. Nguyen,Mohammad I. Husain
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI’s 30.8%, while model performance ranges from Nova Pro’s 46.2% to Claude and Grok 2’s 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel “hallucinated compliance” strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.
zh
[AI-53] A Roadmap for Applying Graph Neural Networks to Numerical Data: Insights from Cementitious Materials
【速读】:该论文旨在解决混凝土研究中机器学习(Machine Learning, ML)应用受限于数据集规模小且多样性不足的问题,尤其是传统ML模型通常仅能处理单一模态的数据(如数值型表格数据),难以充分挖掘材料内部复杂关系。其解决方案的关键在于引入图神经网络(Graph Neural Network, GNN),通过k近邻(k-nearest neighbor, K-NN)方法将结构化的表格数据转化为图表示,从而利用GNN对拓扑依赖关系的建模能力提取特征,并结合超参数优化与特征选择提升预测性能。该方法不仅实现了与随机森林(Random Forest)相当甚至更优的预测效果,还为未来构建多模态及物理信息嵌入的GNN模型奠定了基础,推动了从传统ML向先进人工智能架构的演进。
链接: https://arxiv.org/abs/2512.14855
作者: Mahmuda Sharmin,Taihao Han,Jie Huang,Narayanan Neithalath,Gaurav Sant,Aditya Kumar
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning (ML) has been increasingly applied in concrete research to optimize performance and mixture design. However, one major challenge in applying ML to cementitious materials is the limited size and diversity of available databases. A promising solution is the development of multi-modal databases that integrate both numerical and graphical data. Conventional ML frameworks in cement research are typically restricted to a single data modality. Graph neural network (GNN) represents a new generation of neural architectures capable of learning from data structured as graphs, capturing relationships through irregular or topology-dependent connections rather than fixed spatial coordinates. While GNN is inherently designed for graphical data, they can be adapted to extract correlations from numerical datasets and potentially embed physical laws directly into their architecture, enabling explainable and physics-informed predictions. This work is among the first few studies to implement GNNs to design concrete, with a particular emphasis on establishing a clear and reproducible pathway for converting tabular data into graph representations using the k-nearest neighbor (K-NN) approach. Model hyperparameters and feature selection are systematically optimized to enhance prediction performance. The GNN shows performance comparable to the benchmark random forest, which has been demonstrated by many studies to yield reliable predictions for cementitious materials. Overall, this study provides a foundational roadmap for transitioning from traditional ML to advanced AI architectures. The proposed framework establishes a strong foundation for future multi-modal and physics-informed GNN models capable of capturing complex material behaviors and accelerating the design and optimization of cementitious materials.
zh
[AI-54] MALCDF: A Distributed Multi-Agent LLM Framework for Real-Time Cyber
【速读】:该论文旨在解决传统集中式安全工具难以应对自适应、多向量攻击的问题。其核心解决方案是提出多智能体大语言模型(Large Language Model, LLM)网络安全防御框架(Multi-Agent LLM Cyber Defense Framework, MALCDF),由检测(Detection)、情报(Intelligence)、响应(Response)和分析(Analysis)四个LLM代理组成,通过安全通信层(Secure Communication Layer, SCL)进行加密且本体对齐的消息交互,实现实时协同防御。关键创新在于利用简单LLM代理的协作机制与结构化通信协议,在保持端到端输出一致性的同时显著提升检测准确率(90.0%)和F1分数(85.7%),并有效降低误报率(9.1%),验证了基于本体对齐消息传递的多代理架构在实际、实时网络安全场景中的可行性与有效性。
链接: https://arxiv.org/abs/2512.14846
作者: Arth Bhardwaj,Sia Godika,Yuvam Loonker
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional, centralized security tools often miss adaptive, multi-vector attacks. We present the Multi-Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents-Detection, Intelligence, Response, and Analysis-work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology-aligned messages, and produce audit-friendly outputs (e.g., MITRE ATTCK mappings). For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50-record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML-IDS baseline is a Lightweight Random Forest IDS (LRF-IDS) trained on a subset of CICIDS2017 and tested on the 50-record stream, with no overlap between training and test records. In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1-score, and 9.1% false-positive rate, with 6.8s average per-event latency. It outperforms the lightweight ML-IDS baseline and a single-LLM setup on accuracy while keeping end-to-end outputs consistent. Overall, this hands-on build suggests that coordinating simple LLM agents with secure, ontology-aligned messaging can improve practical, real-time cyber defense. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14846 [cs.CR] (or arXiv:2512.14846v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.14846 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-55] Let the Barbarians In: How AI Can Accelerate Systems Performance Research
【速读】:该论文旨在解决传统系统研究中依赖人工设计解决方案效率低下的问题,尤其是在系统性能优化领域,如何通过自动化手段提升创新效率。其核心解决方案是提出AI驱动的系统研究(AI-Driven Research for Systems, ADRS)范式,该范式通过生成、评估与迭代优化的闭环机制实现自动化探索:AI模型生成候选方案,利用可验证的评估器(verifier)在真实系统或模拟环境中进行测试,并基于反馈持续改进。关键在于构建可靠的验证机制以支持AI对候选解的有效筛选和优化,从而实现生成式AI(Generative AI)在系统研究中的落地应用,实验证明ADRS可生成媲美甚至超越人类专家设计的高性能方案。
链接: https://arxiv.org/abs/2512.14806
作者: Audrey Cheng,Shu Liu,Melissa Pan,Zhifei Li,Shubham Agarwal,Mert Cemri,Bowen Wang,Alexander Krentsel,Tian Xia,Jongseok Park,Shuo Yang,Jeff Chen,Lakshya Agrawal,Ashwin Naren,Shulu Li,Ruiying Ma,Aditya Desai,Jiarong Xing,Koushik Sen,Matei Zaharia,Ion Stoica
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) is beginning to transform the research process by automating the discovery of new solutions. This shift depends on the availability of reliable verifiers, which AI-driven approaches require to validate candidate solutions. Research focused on improving systems performance is especially well-suited to this paradigm because system performance problems naturally admit such verifiers: candidates can be implemented in real systems or simulators and evaluated against predefined workloads. We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems (ADRS). Using several open-source ADRS instances (i.e., OpenEvolve, GEPA, and ShinkaEvolve), we demonstrate across ten case studies (e.g., multi-region cloud scheduling, mixture-of-experts load balancing, LLM-based SQL, transaction scheduling) that ADRS-generated solutions can match or even outperform human state-of-the-art designs. Based on these findings, we outline best practices (e.g., level of prompt specification, amount of feedback, robust evaluation) for effectively using ADRS, and we discuss future research directions and their implications. Although we do not yet have a universal recipe for applying ADRS across all of systems research, we hope our preliminary findings, together with the challenges we identify, offer meaningful guidance for future work as researcher effort shifts increasingly toward problem formulation and strategic oversight. Note: This paper is an extension of our prior work [14]. It adds extensive evaluation across multiple ADRS frameworks and provides deeper analysis and insights into best practices.
zh
[AI-56] Sharing State Between Prompts and Programs
【速读】:该论文旨在解决自然语言编程(Natural Language Programming)与传统形式化编程语言(如Python)之间存在的互操作性难题,即如何让以自然语言编写的代码能够无缝访问和修改程序状态,而无需手动编写复杂的转换逻辑。其解决方案的关键在于提出了一种新的编程抽象——共享程序状态(Shared Program State),该机制允许自然语言代码直接读写Python程序变量、操作对象并控制流程,从而实现自然语言代码与正式代码在统一状态空间下的协同执行。通过将共享程序状态设计为一种自然函数接口,并在Nightjar编程系统中实现,实验表明该方法可在保持甚至提升任务准确率(+4–19%)的同时,平均减少39.6%的代码行数,尽管会带来0.4–4.3倍的运行时开销。
链接: https://arxiv.org/abs/2512.14805
作者: Ellie Y. Cheng,Logan Weber,Tian Jin,Michael Carbin
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of large language models (LLMs) has introduced a new type of programming: natural language programming. By writing prompts that direct LLMs to perform natural language processing, code generation, reasoning, etc., users are writing code in natural language – natural language code – for the LLM to execute. An emerging area of research enables interoperability between natural language code and formal languages such as Python. We present a novel programming abstraction, shared program state, that removes the manual work required to enable interoperability between natural language code and program state. With shared program state, programmers can write natural code that directly writes program variables, computes with program objects, and implements control flow in the program. We present a schema for specifying natural function interfaces that extend programming systems to support natural code and leverage this schema to specify shared program state as a natural function interface. We implement shared program state in the Nightjar programming system. Nightjar enables programmers to write Python programs that contain natural code that shares the Python program state. We show that Nightjar programs achieve comparable or higher task accuracy than manually written implementations (+4-19%), while decreasing the lines of code by 39.6% on average. The tradeoff to using Nightjar is that it may incur runtime overhead (0.4-4.3x runtime of manual implementations). Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14805 [cs.PL] (or arXiv:2512.14805v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2512.14805 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-57] IaC Generation with LLM s: An Error Taxonomy and A Study on Configuration Knowledge Injection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成基础设施即代码(Infrastructure as Code, IaC)时成功率低、且难以与用户意图对齐的问题。其核心解决方案在于系统性地注入结构化配置知识,通过增强IaC-Eval基准测试(引入云模拟和自动化错误分析)以及提出新的LLM辅助IaC生成错误分类法,逐步实现从基础检索增强生成(Retrieval-Augmented Generation, RAG)到图结构RAG(Graph RAG)的进阶方法,包括语义增强图组件与资源间依赖建模。实验表明,此类知识注入显著提升了技术验证成功率(从27.1%提升至75.3%)和整体成功率(从27.1%提升至62.6%),但意图一致性仍存在瓶颈,揭示出“正确性-一致性差距”(Correctness-Congruence Gap),表明当前LLMs更擅长代码生成而非架构设计。
链接: https://arxiv.org/abs/2512.14792
作者: Roman Nekrasov,Stefano Fossati,Indika Kumara,Damian Andrew Tamburri,Willem-Jan van den Heuvel
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Submitted to ACM
Abstract:Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a “Correctness-Congruence Gap” where LLMs can become proficient “coders” but remain limited “architects” in fulfilling nuanced user intent.
zh
[AI-58] Privacy-Preserving Feature Valuation in Vertical Federated Learning Using Shapley-CMI and PSI Permutation
【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)中在模型训练前对各参与方特征贡献进行公平评估的问题,尤其关注早期阶段无可用模型时的特征估值难题。其解决方案的关键在于提出了一种隐私保护的Shapley-CMI方法实现:通过引入一个私有集合交集(Private Set Intersection, PSI)服务器,在不交换原始数据的前提下,安全地执行特征排列并计算离散化加密ID组间的交集大小;各参与方基于这些加密交集结果本地计算Shapley-CMI值,从而获得各自特征的边际效用,实现了无需共享原始数据或训练模型即可完成安全、可扩展且公平的特征贡献估计。
链接: https://arxiv.org/abs/2512.14767
作者: Unai Laskurain,Aitor Aguirre-Ortuzar,Urko Zurutuza
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Presented at the 3rd IEEE International Conference on Federated Learning Technologies and Applications (FLTA25), October 2025
Abstract:Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy. In Vertical FL (VFL), where each party holds different features for the same users, a key challenge is to evaluate the feature contribution of each party before any model is trained, particularly in the early stages when no model exists. To address this, the Shapley-CMI method was recently proposed as a model-free, information-theoretic approach to feature valuation using Conditional Mutual Information (CMI). However, its original formulation did not provide a practical implementation capable of computing the required permutations and intersections securely. This paper presents a novel privacy-preserving implementation of Shapley-CMI for VFL. Our system introduces a private set intersection (PSI) server that performs all necessary feature permutations and computes encrypted intersection sizes across discretized and encrypted ID groups, without the need for raw data exchange. Each party then uses these intersection results to compute Shapley-CMI values, computing the marginal utility of their features. Initial experiments confirm the correctness and privacy of the proposed system, demonstrating its viability for secure and efficient feature contribution estimation in VFL. This approach ensures data confidentiality, scales across multiple parties, and enables fair data valuation without requiring the sharing of raw data or training models.
zh
[AI-59] GR-Agent : Adaptive Graph Reasoning Agent under Incomplete Knowledge
【速读】:该论文旨在解决当前知识图谱问答(Knowledge Graph Question Answering, KGQA)评估中忽视知识图谱不完整性的问题。现有基准大多假设知识图谱是完整的,即答案可以直接通过已存在的三元组获取,这导致评估偏向于浅层检索而非真正的推理能力;然而在现实场景中,知识图谱普遍存在缺失事实,需要基于已有信息进行推理才能得出答案。为弥合这一差距,论文提出了一种构建不完整知识图谱基准的方法论,该方法移除直接支持答案的三元组,同时保留可推导出答案的替代推理路径。实验表明,现有方法在不完整设置下性能显著下降,凸显其推理能力不足。为此,作者进一步提出了自适应图推理代理(Adaptive Graph Reasoning Agent, GR-Agent),其核心创新在于将KGQA建模为智能体与环境的交互过程:首先从知识图谱构建交互环境,再定义包含图推理工具的动作空间,并维护记忆以记录潜在的支持性推理证据(如相关关系和路径)。GR-Agent在完整与不完整两种环境下均展现出优于非训练基线模型的性能,且接近训练型方法的表现,从而有效提升了KGQA在真实场景中的鲁棒性和推理能力。
链接: https://arxiv.org/abs/2512.14766
作者: Dongzhuoran Zhou,Yuqicheng Zhu,Xiaxia Wang,Hongkuan Zhou,Jiaoyan Chen,Steffen Staab,Yuan He,Evgeny Kharlamov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) achieve strong results on knowledge graph question answering (KGQA), but most benchmarks assume complete knowledge graphs (KGs) where direct supporting triples exist. This reduces evaluation to shallow retrieval and overlooks the reality of incomplete KGs, where many facts are missing and answers must be inferred from existing facts. We bridge this gap by proposing a methodology for constructing benchmarks under KG incompleteness, which removes direct supporting triples while ensuring that alternative reasoning paths required to infer the answer remain. Experiments on benchmarks constructed using our methodology show that existing methods suffer consistent performance degradation under incompleteness, highlighting their limited reasoning ability. To overcome this limitation, we present the Adaptive Graph Reasoning Agent (GR-Agent). It first constructs an interactive environment from the KG, and then formalizes KGQA as agent environment interaction within this environment. GR-Agent operates over an action space comprising graph reasoning tools and maintains a memory of potential supporting reasoning evidence, including relevant relations and reasoning paths. Extensive experiments demonstrate that GR-Agent outperforms non-training baselines and performs comparably to training-based methods under both complete and incomplete settings.
zh
[AI-60] Guided Discrete Diffusion for Constraint Satisfaction Problems
【速读】:该论文旨在解决约束满足问题(Constraint Satisfaction Problems, CSPs)的求解难题,特别是针对像数独(Sudoku)这类具有明确规则和有限解空间的问题。其解决方案的关键在于提出了一种离散扩散引导机制(Discrete Diffusion Guidance),通过在离散状态空间中模拟扩散过程,利用无监督学习方式逐步优化解的生成路径,从而实现对CSP的有效求解。该方法不依赖标注数据,而是基于扩散模型的思想,在反向过程中引入约束条件作为引导信号,使模型能够从随机初始状态收敛到满足所有约束的合法解。
链接: https://arxiv.org/abs/2512.14765
作者: Justin Jung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Originally published in Jan 2025 on the SpringtailAI Blog
Abstract:We propose discrete diffusion guidance for constraint satisfaction problems (CSPs) and demonstrate its ability to solve Sudoku puzzles without supervision.
zh
[AI-61] Workflows vs Agents for Code Translation
【速读】:该论文旨在解决将高级语言(如MATLAB)算法自动翻译为硬件描述语言(HDL)过程中存在的语法错误问题,这一过程对FPGA和ASIC部署至关重要,但传统方法资源消耗大且易出错。其解决方案的关键在于对比两种由大语言模型(LLM)驱动的语法修复策略:一种是固定流程的结构化专家设计方法,另一种是基于Model Context Protocol (MCP)的自主代理式方法,后者能动态选择工具并优化上下文管理。研究发现,代理式方法在小到中等规模模型上显著提升语法修复成功率,从而提高整体流水线的仿真可达率,尤其在中等规模模型上提升超过20个百分点,表明其通过短提示、激进的上下文控制和条件性工具调用有效弥补了小模型能力不足的问题。
链接: https://arxiv.org/abs/2512.14762
作者: Henry Gray,Tom Yotam,Octavian Udrea
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating algorithms from high-level languages like MATLAB to hardware description languages (HDLs) is a resource-intensive but necessary step for deployment on FPGAs and ASICs. While large language models (LLMs) offer a path to automation, their limited training on HDL code makes end-to-end transpilation brittle and prone to syntax errors. We compare two LLM-driven methods for syntax repair in a MATLAB-to-HDL pipeline: a structured, expert-designed flow that follows a fixed sequence of operations, and a more autonomous agentic approach that uses the Model Context Protocol (MCP) \citeanthropic2024mcp to dynamically select its own tools. We study 42 MATLAB signal-processing functions and isolate the syntax-repair stage. Across three model scales, the agentic approach is more effective at resolving initial syntax errors, unblocking a greater number of candidates to proceed through the pipeline. This upstream improvement yields measurable downstream improvements, most notably on mid-sized models, where it increases the simulation reach rate by over 20 percentage points. We hypothesize the gains come from short prompts, aggressive context management, and conditional tool use. Conditional retrieval helps at 8B and 30B; at 235B final-success gains are small and a naive RAG variant attains the highest final success. Our findings suggest that these agentic frameworks, when properly designed, are most effective at compensating for the capacity limits of small and mid-sized models.
zh
[AI-62] CAPE: Capability Achievement via Policy Execution
【速读】:该论文旨在解决现代人工智能(AI)系统缺乏表达和强制执行显式、上下文依赖约束的能力问题,这导致高智能模型在部署时频繁失败,尽管其在基准测试中表现优异。解决方案的关键在于提出“能力工程”(Capability Engineering),并通过CAPE(Capability Achievement via Policy Execution)协议实现,该协议采用“指定-验证-修正-训练”循环,将要求转化为可执行规范并使模型默认满足这些规范。CAPE的核心创新在于两个实证发现:一是上下文客观性(contextual objectivity),即固定上下文后主观属性变为客观属性(标注者间一致性从κ=0.42提升至κ=0.98);二是验证保真度扩展规律(verification-fidelity scaling),即验证准确率随模型规模增长而提升(相关系数r=0.94),不同于偏好一致性在30–50%不一致水平上趋于饱和。通过该方法,论文在六个领域109,500个示例上使违规率降低81%,同时显著降低标注成本(减少5–20倍)并缩短开发周期(从数月缩短至数周)。
链接: https://arxiv.org/abs/2512.14761
作者: David Ball
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 3 figures
Abstract:Modern AI systems lack a way to express and enforce requirements. Pre-training produces intelligence, and post-training optimizes preferences, but neither guarantees that models reliably satisfy explicit, context-dependent constraints. This missing abstraction explains why highly intelligent models routinely fail in deployment despite strong benchmark performance. We introduce Capability Engineering, the systematic practice of converting requirements into executable specifications and training models to satisfy them by default. We operationalize this practice through CAPE (Capability Achievement via Policy Execution), a protocol implementing a Specify - Verify - Correct - Train loop. CAPE is grounded in two empirical findings: (1) contextual objectivity, where properties appearing subjective become objective once context is fixed (inter-annotator agreement rises from kappa = 0.42 to kappa = 0.98), and (2) verification-fidelity scaling, where verification accuracy improves with model scale (r = 0.94), unlike preference agreement which plateaus at 30 to 50 percent disagreement regardless of compute. Across 109,500 examples in six domains, CAPE reduces violation rates by 81 percent relative to DPO (standard deviation less than 0.3 percent). By replacing per-example annotation with reusable specifications, CAPE reduces costs by 5 to 20 times and shortens timelines from months to weeks. We release the CAPE protocol, PredicateGraph schema, CPL specification language, and policy packs under Apache 2.0. We also launch CapabilityBench, a public registry of model evaluations against community-contributed policies, shifting evaluation from intelligence benchmarks toward capability measurement. Comments: 32 pages, 3 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.14761 [cs.SE] (or arXiv:2512.14761v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.14761 Focus to learn more arXiv-issued DOI via DataCite Submission history From: David Ball [view email] [v1] Mon, 15 Dec 2025 18:58:21 UTC (30 KB)
zh
[AI-63] CODE ACROSTIC: Robust Watermarking for Code Generation
【速读】:该论文旨在解决现有代码水印技术在应对评论删除攻击时有效性不足的问题,即攻击者可通过移除生成代码中的注释而不影响其功能,从而规避水印检测。解决方案的关键在于利用先验知识构建一个“提示列表”(Cue List),用以识别代码中低熵与高熵区域,并基于此提示列表指导水印注入,从而在保障代码可用性的前提下提升水印的可检测性。
链接: https://arxiv.org/abs/2512.14753
作者: Li Lin,Siyuan Xin,Yang Cao,Xiaochun Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Watermarking large language models (LLMs) is vital for preventing their misuse, including the fabrication of fake news, plagiarism, and spam. It is especially important to watermark LLM-generated code, as it often contains intellectual this http URL, we found that existing methods for watermarking LLM-generated code fail to address comment removal this http URL such cases, an attacker can simply remove the comments from the generated code without affecting its functionality, significantly reducing the effectiveness of current code-watermarking this http URL the other hand, injecting a watermark into code is challenging because, as previous works have noted, most code represents a low-entropy scenario compared to natural language. Our approach to addressing this issue involves leveraging prior knowledge to distinguish between low-entropy and high-entropy parts of the code, as indicated by a Cue List of this http URL then inject the watermark guided by this Cue List, achieving higher detectability and usability than existing this http URL evaluated our proposed method on HumanEvaland compared our method with three state-of-the-art code watermarking techniques. The results demonstrate the effectiveness of our approach.
zh
[AI-64] Cyberswarm: a novel swarm intelligence algorithm inspired by cyber community dynamics
【速读】:该论文旨在解决推荐系统在复杂社交网络中难以动态适应用户偏好变化和交互模式的问题,传统方法往往无法充分建模网络内复杂的交互关系且泛化能力有限。其解决方案的关键在于提出一种通用型群体智能(Swarm Intelligence)算法,该算法受社会心理学原理启发,通过动态超图结构建模用户偏好与社区影响,并结合基于中心性的特征提取与Node2Vec嵌入技术;同时利用消息传递机制和分层图建模实现偏好演化,从而支持实时行为适应,显著提升了推荐精度与上下文相关性,在多个数据集上的Hit Rate (HR)、Mean Reciprocal Rank (MRR) 和 Normalized Discounted Cumulative Gain (NDCG) 等指标均优于基线方法。
链接: https://arxiv.org/abs/2512.14752
作者: Abdelsadeq Elfergany,Ammar Adl,Mohammed Kayed
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 49 pages, 15 figures
Abstract:Recommendation systems face challenges in dynamically adapting to evolving user preferences and interactions within complex social networks. Traditional approaches often fail to account for the intricate interactions within cyber-social systems and lack the flexibility to generalize across diverse domains, highlighting the need for more adaptive and versatile solutions. In this work, we introduce a general-purpose swarm intelligence algorithm for recommendation systems, designed to adapt seamlessly to varying applications. It was inspired by social psychology principles. The framework models user preferences and community influences within a dynamic hypergraph structure. It leverages centrality-based feature extraction and Node2Vec embeddings. Preference evolution is guided by message-passing mechanisms and hierarchical graph modeling, enabling real-time adaptation to changing behaviors. Experimental evaluations demonstrated the algorithm’s superior performance in various recommendation tasks, including social networks and content discovery. Key metrics such as Hit Rate (HR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) consistently outperformed baseline methods across multiple datasets. The model’s adaptability to dynamic environments allowed for contextually relevant and precise recommendations. The proposed algorithm represents an advancement in recommendation systems by bridging individual preferences and community influences. Its general-purpose design enables applications in diverse domains, including social graphs, personalized learning, and medical graphs. This work highlights the potential of integrating swarm intelligence with network dynamics to address complex optimization challenges in recommendation systems.
zh
[AI-65] One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLM s
【速读】:该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在微调后是否继承原始预训练模型的越狱(jailbreak)漏洞这一安全问题。研究发现,在一个现实的“预训练到微调”威胁模型下,攻击者对预训练模型具有白盒访问权限,而仅能通过黑盒方式访问微调后的模型时,针对预训练模型优化的对抗性提示仍能高效迁移至其微调变体,表明越狱漏洞具有继承性。解决方案的关键在于提出一种基于表示层探测的攻击方法——Probe-Guided Projection (PGP),该方法通过分析预训练模型隐藏状态中的线性可分性,引导优化过程聚焦于与迁移能力相关方向,从而显著提升对抗提示在不同微调任务和模型家族间的迁移成功率,揭示了预训练到微调范式中固有的安全风险。
链接: https://arxiv.org/abs/2512.14751
作者: Yixin Tan,Zhe Yu,Jun Sakuma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP’s strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.
zh
[AI-66] Factor(UT): Controlling Untrusted AI by Monitoring their Plans AAAI2026
【速读】:该论文旨在解决在生成式 AI(Generative AI)系统中,当任务分解模型(decomposer)本身可能为恶意时,如何有效检测其潜在危害的问题。传统方法依赖可信的弱模型进行任务分解,限制了对复杂任务的适用性。其解决方案的关键在于提出 Factor(U, T) 协议:使用一个不可信的强模型完成任务分解,而由可信模型独立执行子任务,并通过监控子任务的实现上下文(implementation context)而非仅基于自然语言指令来识别恶意行为。实验表明,仅基于指令的监控效果差(AUROC 0.52),而基于子任务实现结果的监控则表现出卓越的区分能力(AUROC 0.96)和高安全性(攻击成功率 ASR=1.2%),证明实施上下文监控是保障安全性的关键。
链接: https://arxiv.org/abs/2512.14745
作者: Edward Lue Chee Lip,Anthony Channg,Diana Kim,Aaron Sandoval,Kevin Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent). 6 pages body, 8 pages total, 3 figures
Abstract:As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks \unicodex2013 but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor( U , T ), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor( U , T ) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor( D , U ), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
zh
[AI-67] Quantum-Augmented AI/ML for O-RAN: Hierarchical Threat Detection with Synergistic Intelligence and Interpretability (Technical Report)
【速读】:该论文旨在解决开放无线接入网(Open Radio Access Network, O-RAN)因控制面、用户面和管理面解耦而导致的网络安全攻击面扩大的问题。其解决方案的关键在于提出了一种分层防御框架,包含异常检测、入侵确认和多攻击分类三个协同层级,与O-RAN的遥测栈对齐;并通过融合量子计算与机器学习技术,利用幅度和纠缠特征编码方法,结合深度学习与集成分类器,在合成及真实遥测数据上实现了近乎完美的准确率、高召回率以及强类别可分性,从而具备良好的可解释性、鲁棒性,并适用于近实时(near-RT)和非实时(non-RT)无线智能控制器(RIC)域中的切片感知诊断与可扩展部署。
链接: https://arxiv.org/abs/2512.14742
作者: Tan Le,Van Le,Sachin Shetty
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Open Radio Access Networks (O-RAN) enhance modularity and telemetry granularity but also widen the cybersecurity attack surface across disaggregated control, user and management planes. We propose a hierarchical defense framework with three coordinated layers-anomaly detection, intrusion confirmation, and multiattack classification-each aligned with O-RAN’s telemetry stack. Our approach integrates hybrid quantum computing and machine learning, leveraging amplitude- and entanglement-based feature encodings with deep and ensemble classifiers. We conduct extensive benchmarking across synthetic and real-world telemetry, evaluating encoding depth, architectural variants, and diagnostic fidelity. The framework consistently achieves near-perfect accuracy, high recall, and strong class separability. Multi-faceted evaluation across decision boundaries, probabilistic margins, and latent space geometry confirms its interpretability, robustness, and readiness for slice-aware diagnostics and scalable deployment in near-RT and non-RT RIC domains.
zh
[AI-68] Persistent Backdoor Attacks under Continual Fine-Tuning of LLM s
【速读】:该论文旨在解决后门攻击在大型语言模型(Large Language Models, LLMs)部署后持续性不足的问题,即在用户驱动的持续微调(continual fine-tuning)过程中,植入的后门行为往往因模型更新而失效或被遗忘。解决方案的关键在于提出P-Trojan算法,该算法通过显式优化后门在多轮更新中的持久性:其核心机制是使污染梯度(poisoned gradients)在token嵌入空间上与干净任务梯度对齐,从而降低后门映射在后续微调中被抑制或遗忘的概率。理论分析证明了此类持久性后门攻击的可行性,实验表明P-Trojan可在Qwen2.5和LLaMA3系列模型上实现超过99%的后门保留率,同时保持干净任务性能不受影响。
链接: https://arxiv.org/abs/2512.14741
作者: Jing Cui,Yufei Han,Jianbin Jiao,Junge Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks embed malicious behaviors into Large Language Models (LLMs), enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior works evaluate the effectiveness and generalization of implanted backdoors only at releasing and empirical evidence shows that naively injected backdoor persistence degrades after updates. In this work, we study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning. We propose P-Trojan, a trigger-based attack algorithm that explicitly optimizes for backdoor persistence across repeated updates. By aligning poisoned gradients with those of clean tasks on token embeddings, the implanted backdoor mapping is less likely to be suppressed or forgotten during subsequent updates. Theoretical analysis shows the feasibility of such persistent backdoor attacks after continual fine-tuning. And experiments conducted on the Qwen2.5 and LLaMA3 families of LLMs, as well as diverse task sequences, demonstrate that P-Trojan achieves over 99% persistence while preserving clean-task accuracy. Our findings highlight the need for persistence-aware evaluation and stronger defenses in realistic model adaptation pipelines.
zh
[AI-69] Zero-Knowledge Audit for Internet of Agents : Privacy-Preserving Communication Verification with Model Context Protocol
【速读】:该论文旨在解决代理通信框架中难以在保障通信隐私的前提下提供可验证审计轨迹的问题,这是在需要精确计费、合规性验证和问责制的受监管环境中的一项核心挑战。解决方案的关键在于将零知识证明(Zero-Knowledge Proofs, ZKPs)与现有的模型上下文协议(Model Context Protocol, MCP)相结合,从而在不泄露消息内容的情况下验证通信是否符合预设规则。该方法支持轻量级网络运行、兼容标准MCP交互,并引入异步审计验证机制以确认消息格式和类型而不暴露具体内容;同时实现了代理间的相互审计:一方可验证通信内容与质量,另一方则验证使用指标,且双方均不泄露敏感信息。
链接: https://arxiv.org/abs/2512.14737
作者: Guanlin Jing,Huayi Qi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing agent communication frameworks face critical limitations in providing verifiable audit trails without compromising the privacy and confidentiality of agent interactions. The protection of agent communication privacy while ensuring auditability emerges as a fundamental challenge for applications requiring accurate billing, compliance verification, and accountability in regulated environments. We introduce a framework for auditing agent communications that keeps messages private while still checking they follow expected rules. It pairs zero-knowledge proofs with the existing Model Context Protocol (MCP) so messages can be verified without revealing their contents. The approach runs in lightweight networks, stays compatible with standard MCP exchanges, and adds asynchronous audit verification to confirm format and general message types without exposing specifics. The framework enables mutual audits between agents: one side can check communication content and quality while the other verifies usage metrics, all without revealing sensitive information. We formalize security goals and show that zk-MCP provides data authenticity and communication privacy, achieving efficient verification with negligible latency overhead. We fully implement the framework, including Circom-based zero-knowledge proof generation and an audit protocol integrated with MCP’s bidirectional channel, and, to our knowledge, this is the first privacy-preserving audit system for agent communications that offers verifiable mutual auditing without exposing message content or compromising agent privacy. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14737 [cs.CR] (or arXiv:2512.14737v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.14737 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-70] Semantic Geometry for policy-constrained interpretation
【速读】:该论文旨在解决高风险领域中生成式 AI(Generative AI)因语义解释不当而产生虚假承诺(hallucinated commitments)的问题。其解决方案的关键在于构建一个基于几何的策略约束语义解释框架:将语义意义表示为单位球面上的方向,证据建模为一组见证向量(witness vectors),允许的解释对应于球面凸区域;同时,策略约束作为定义在同一流形上的显式先验,与证据几何分离。解释过程转化为在允许区域内进行约束优化,当出现矛盾或违反策略时,拒绝响应成为拓扑上必然的结果。该方法通过信息论、贝叶斯推断和层化语义理论(sheaf-theoretic semantics)建立理论联系,并证明其复杂度边界具有信息论最优性,在大规模受监管金融数据上的实证验证实现了多种政策环境下零虚假批准(zero hallucinated approvals),首次在大规模场景中达成此目标。
链接: https://arxiv.org/abs/2512.14731
作者: Nikit Phadke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a geometric framework for policy-constrained semantic interpretation that provably prevents hallucinated commitments in high-stakes domains. Semantic meaning is represented as direction on a unit sphere, evidence is modeled as sets of witness vectors, and admissible interpretations correspond to spherical convex regions. Policy constraints are introduced as explicit priors defined over the same manifold, separated from evidence geometry. Interpretation reduces to constrained optimization over admissible regions, with refusal emerging as a topologically necessary outcome under contradiction or policy exclusion. We connect this framework to information theory, Bayesian inference, and sheaf-theoretic semantics, proving that our complexity bounds are information-theoretically optimal. Empirical validation on large scale regulated financial data demonstrates zero hallucinated approvals across multiple policy regimes-the first such result at scale.
zh
[AI-71] A Critical Perspective on Finite Sample Conformal Prediction Theory in Medical Applications
【速读】:该论文试图解决的问题是:在医疗健康领域,机器学习(Machine Learning, ML)模型虽然能够提升临床决策效率,但其缺乏可靠的不确定性估计,而这种不确定性估计对安全的临床决策至关重要。尽管共形预测(Conformal Prediction, CP)提供了一种将启发式不确定性估计转化为具有统计保障的预测集的方法,但文献指出,CP理论虽对任意大小的校准样本均成立,其实际应用价值却高度依赖于校准样本的规模——尤其是在医疗场景中数据稀缺、难以获取大规模校准集的情况下。解决方案的关键在于揭示并验证:即使CP提供了理论上的统计保证,若校准样本过小,这些保证在实践中可能失去意义,从而强调了在医疗AI部署中必须谨慎评估校准集大小对不确定性估计可靠性的实质性影响。
链接: https://arxiv.org/abs/2512.14727
作者: Klaus-Rudolf Kladny,Bernhard Schölkopf,Lisa Koch,Christian F. Baumgartner,Michael Muehlebach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Machine learning (ML) is transforming healthcare, but safe clinical decisions demand reliable uncertainty estimates that standard ML models fail to provide. Conformal prediction (CP) is a popular tool that allows users to turn heuristic uncertainty estimates into uncertainty estimates with statistical guarantees. CP works by converting predictions of a ML model, together with a calibration sample, into prediction sets that are guaranteed to contain the true label with any desired probability. An often cited advantage is that CP theory holds for calibration samples of arbitrary size, suggesting that uncertainty estimates with practically meaningful statistical guarantees can be achieved even if only small calibration sets are available. We question this promise by showing that, although the statistical guarantees hold for calibration sets of arbitrary size, the practical utility of these guarantees does highly depend on the size of the calibration set. This observation is relevant in medical domains because data is often scarce and obtaining large calibration sets is therefore infeasible. We corroborate our critique in an empirical demonstration on a medical image classification task.
zh
[AI-72] Quantum Decision Transformers (QDT): Synergistic Entanglement and Interference for Offline Reinforcement Learning
【速读】:该论文旨在解决决策Transformer(Decision Transformer, DT)在离线强化学习中面临的长期信用分配(credit assignment)困难和复杂状态-动作依赖关系建模不足的问题。其解决方案的关键在于提出量子启发式决策Transformer(Quantum Decision Transformer, QDT),通过两个核心组件实现:一是引入量子启发注意力机制(Quantum-Inspired Attention),利用纠缠操作捕捉非局部特征相关性,提升信用分配能力;二是设计量子前馈网络(Quantum Feedforward Networks),通过多路径并行处理与可学习干涉实现自适应计算。实验表明,这两部分协同作用产生显著性能提升(超过2000%),且单独使用均无法达到竞争力,说明有效量子启发架构需整体协同设计而非模块化堆叠,从而为Transformer在序列决策任务中的演进提供了新的计算范式。
链接: https://arxiv.org/abs/2512.14726
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning enables policy learning from pre-collected datasets without environment interaction, but existing Decision Transformer (DT) architectures struggle with long-horizon credit assignment and complex state-action dependencies. We introduce the Quantum Decision Transformer (QDT), a novel architecture incorporating quantum-inspired computational mechanisms to address these challenges. Our approach integrates two core components: Quantum-Inspired Attention with entanglement operations that capture non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation. Through comprehensive experiments on continuous control tasks, we demonstrate over 2,000% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Critically, our ablation studies reveal strong synergistic effects between quantum-inspired components: neither alone achieves competitive performance, yet their combination produces dramatic improvements far exceeding individual contributions. This synergy demonstrates that effective quantum-inspired architecture design requires holistic co-design of interdependent mechanisms rather than modular component adoption. Our analysis identifies three key computational advantages: enhanced credit assignment through non-local correlations, implicit ensemble behavior via parallel processing, and adaptive resource allocation through learnable interference. These findings establish quantum-inspired design principles as a promising direction for advancing transformer architectures in sequential decision-making, with implications extending beyond reinforcement learning to neural architecture design more broadly.
zh
[AI-73] Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion
【速读】:该论文旨在解决城市风场建模中因复杂地形几何结构导致的高精度模拟难题,传统低阶模型难以捕捉几何影响,而高保真计算流体动力学(Computational Fluid Dynamics, CFD)模拟在多几何或多种风况下计算成本过高。解决方案的关键在于提出一种基于生成扩散框架的稳态城市风场合成方法,仅需几何信息即可生成准确且多样化的速度场;其核心创新是将分层图神经网络与基于得分的扩散建模相结合,无需时间演化或密集观测数据,即可恢复关键流动结构(如尾流和回流区),并提供不确定性感知预测,从而实现对未见几何的泛化能力,为城市规划提供高效评估工具。
链接: https://arxiv.org/abs/2512.14725
作者: Francisco Giral,Álvaro Manzano,Ignacio Gómez,Petros Koumoutsakos,Soledad Le Clainche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban wind flow modeling and simulation play an important role in air quality assessment and sustainable city planning. A key challenge for modeling and simulation is handling the complex geometries of the urban landscape. Low order models are limited in capturing the effects of geometry, while high-fidelity Computational Fluid Dynamics (CFD) simulations are prohibitively expensive, especially across multiple geometries or wind conditions. Here, we propose a generative diffusion framework for synthesizing steady-state urban wind fields over unstructured meshes that requires only geometry information. The framework combines a hierarchical graph neural network with score-based diffusion modeling to generate accurate and diverse velocity fields without requiring temporal rollouts or dense measurements. Trained across multiple mesh slices and wind angles, the model generalizes to unseen geometries, recovers key flow structures such as wakes and recirculation zones, and offers uncertainty-aware predictions. Ablation studies confirm robustness to mesh variation and performance under different inference regimes. This work develops is the first step towards foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty.
zh
[AI-74] HATSolver: Learning Groebner Bases with Hierarchical Attention Transformers
【速读】:该论文旨在解决利用深度学习方法高效计算多元多项式方程组的Gröbner基(Gröbner bases)这一核心问题,其关键在于引入分层注意力变换器(Hierarchical Attention Transformers, HATs),通过树状结构归纳偏置(tree-structured inductive bias)建模数据中的层次关系,从而显著优于传统平面注意力模型的计算效率,并结合课程学习策略处理更大规模的问题实例。
链接: https://arxiv.org/abs/2512.14722
作者: Mohamed Malhou,Ludovic Perret,Kristin Lauter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:At NeurIPS 2024, Kera et al. introduced the use of transformers for computing Groebner bases, a central object in computer algebra with numerous practical applications. In this paper, we improve this approach by applying Hierarchical Attention Transformers (HATs) to solve systems of multivariate polynomial equations via Groebner bases computation. The HAT architecture incorporates a tree-structured inductive bias that enables the modeling of hierarchical relationships present in the data and thus achieves significant computational savings compared to conventional flat attention models. We generalize to arbitrary depths and include a detailed computational cost analysis. Combined with curriculum learning, our method solves instances that are much larger than those in Kera et al. (2024 Learning to compute Groebner bases)
zh
[AI-75] Hybrid Attribution Priors for Explainable and Robust Model Training
【速读】:该论文旨在解决生成式 AI (Generative AI) 在小语言模型(Small Language Models, SLMs)分类任务中,由于现有基于归因(attribution-based)的监督信号缺乏判别性而导致的可解释性与鲁棒性不足的问题。具体而言,传统归因方法虽能可靠地突出与类别相关的词元(token),但常聚焦于语义相近类别共有的关键词,难以提供区分不同类别的有效线索。其解决方案的关键在于提出一种新的归因先验提取框架——类感知归因先验(Class-Aware Attribution Prior, CAP),该框架通过引导模型捕捉细粒度的类别差异,生成更具判别性的归因先验;进一步引入CAP Hybrid,融合CAP与现有归因技术的先验信号,形成更全面且平衡的监督信号,从而促使模型学习多样化的决策相关特征,显著提升在全数据、少样本及对抗场景下的可解释性和鲁棒性。
链接: https://arxiv.org/abs/2512.14719
作者: Zhuoran Zhang,Feng Zhang,Shangyuan Li,Yang Shi,Yuanxing Zhang,Wei Chen,Tengjiao Wang,Kam-Fai Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model’s self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.
zh
[AI-76] SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测中因复杂变量间依赖关系建模不准确而导致的性能瓶颈问题。现有基于注意力机制或图结构的方法存在三大缺陷:(a)强时序自依赖常被无关变量干扰;(b)Softmax归一化忽略甚至反转负相关性;(c)变量难以感知其时间位置信息。解决方案的关键在于提出SEED框架,其核心创新包括:1)引入依赖评估器(Dependency Evaluator),利用谱熵动态评估各变量的空间与时间依赖强度,实现通道独立(Channel Independence, CI)与通道依赖(Channel Dependence, CD)策略的自适应平衡;2)设计基于谱熵的融合器(Spectral Entropy-based Fuser),分离由其他变量影响带来的时序规律,提升依赖权重的准确性;3)构建带符号边权的图构造器(Signed Graph Constructor),保留负相关性,克服Softmax限制;4)引入上下文空间提取器(Context Spatial Extractor),通过局部上下文窗口增强变量对时间位置的感知能力,从而构建更全面的空间特征。
链接: https://arxiv.org/abs/2512.14718
作者: Feng Xiong,Zongxia Xie,Yanru Sun,Haoyu Wang,Jianhong Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; © variables struggle to perceive their temporal positions. To address these, we propose \textbfSEED, a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.
zh
[AI-77] How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection
【速读】:该论文旨在解决低级位级扰动(bitwise perturbations,即故障注入)对大型语言模型(Large Language Model, LLM)在图像描述任务中语义含义的影响问题,尤其关注这些扰动如何在保持语法结构不变的情况下改变生成描述的语义。传统故障分析方法仅关注分类器崩溃或准确率下降,忽略了生成式系统中的语义与语言维度。其解决方案的关键在于提出一种可微分的故障分析框架 BLADE(Bit-level Fault Analysis via Differentiable Estimation),利用梯度敏感性估计定位语义关键比特,并通过基于图像描述层面的语义-流畅性目标进一步优化比特选择,从而揭示模型语义编码在比特层级上的分布特性与可操纵性。
链接: https://arxiv.org/abs/2512.14715
作者: Zafaryab Haider,Md Hafizur Rahman,Shane Moeykens,Vijay Devabhaktuni,Prabuddha Chakraborty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model’s own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model’s semantic output.
zh
[AI-78] Improving Underwater Acoustic Classification Through Learnable Gabor Filter Convolution and Attention Mechanisms
【速读】:该论文旨在解决水下声学目标检测与分类中因船舶辐射噪声和环境噪声复杂性导致的信号处理精度不足问题,尤其是在数据有限且实验标准不统一的情况下模型泛化能力差、鲁棒性弱的问题。其解决方案的关键在于提出GSE ResNeXt深度学习架构,通过在ResNeXt主干网络中引入可学习的Gabor卷积层(Gabor convolutional layers)并结合挤压-激励注意力机制(squeeze-and-excitation attention mechanisms),使模型具备更强的特征提取能力和训练稳定性;其中Gabor滤波器作为二维自适应带通滤波器扩展了特征通道表示,与通道注意力机制协同优化,显著提升了分类性能,并将训练时间减少28%,从而增强了模型在不同环境条件下的可靠性与通用性。
链接: https://arxiv.org/abs/2512.14714
作者: Lucas Cesar Ferreira Domingos,Russell Brinkworth,Paulo Eduardo Santos,Karl Sammut
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Remotely detecting and classifying underwater acoustic targets is critical for environmental monitoring and defence. However, the complex nature of ship-radiated and environmental underwater noise poses significant challenges to accurate signal processing. While recent advancements in machine learning have improved classification accuracy, issues such as limited dataset availability and a lack of standardised experimentation hinder generalisation and robustness. This paper introduces GSE ResNeXt, a deep learning architecture integrating learnable Gabor convolutional layers with a ResNeXt backbone enhanced by squeeze-and-excitation attention mechanisms. The Gabor filters serve as two-dimensional adaptive band-pass filters, extending the feature channel representation. Its combination with channel attention improves training stability and convergence while enhancing the model’s ability to extract discriminative features. The model is evaluated on three classification tasks of increasing complexity. In particular, the impact of temporal differences between the training and testing data is explored, revealing that the distance between the vessel and sensor significantly affects performance. Results show that, GSE ResNeXt consistently outperforms baseline models like Xception, ResNet, and MobileNetV2, in terms of classification performance. Regarding stability and convergence, the addition of Gabor convolutions in the initial layers of the model represents a 28% reduction in training time. These results emphasise the importance of signal processing strategies in improving the reliability and generalisation of models under different environmental conditions, especially in data-limited underwater acoustic classification scenarios. Future developments should focus on mitigating the impact of environmental factors on input signals.
zh
[AI-79] Promoting Fairness in Information Access within Social Networks ICDE2026
【速读】:该论文旨在解决在线社交网络中信息传播不公平的问题,即某些群体(尤其是少数族裔)因网络位置劣势而难以获取信息。为提升不同人口统计群体间的信息访问公平性,作者提出通过优化添加新连接来改善网络结构,并以电阻距离(resistance distance)作为衡量信息可达性的指标,从而强调全局网络结构和多路径连通性。该问题被证明是NP-hard,为此作者设计了一种基于贪心策略的算法,其关键创新在于通过一系列新颖的近似技术将原始三次时间复杂度降低至线性复杂度,显著提升了算法在大规模网络(百万节点级别)上的可扩展性和实用性。
链接: https://arxiv.org/abs/2512.14711
作者: Changan Liu,Xiaotian Zhou,Ahad N. Zehmakan,Zhongzhi Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by ICDE 2026
Abstract:The advent of online social networks has facilitated fast and wide spread of information. However, some users, especially members of minority groups, may be less likely to receive information spreading on the network, due to their disadvantaged network position. We study the optimization problem of adding new connections to a network to enhance fairness in information access among different demographic groups. We provide a concrete formulation of this problem where information access is measured in terms of resistance distance, offering a new perspective that emphasizes global network structure and multi-path connectivity. The problem is shown to be NP-hard. We propose a simple greedy algorithm which turns out to output accurate solutions, but its run time is cubic, which makes it undesirable for large networks. As our main technical contribution, we reduce its time complexity to linear, leveraging several novel approximation techniques. In addition to our theoretical findings, we also conduct an extensive set of experiments using both real-world and synthetic datasets. We demonstrate that our linear-time algorithm can produce accurate solutions for networks with millions of nodes. Comments: Accepted by ICDE 2026 Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14711 [cs.SI] (or arXiv:2512.14711v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2512.14711 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-80] Autonomous Source Knowledge Selection in Multi-Domain Adaptation
【速读】:该论文旨在解决多源域自适应(multi-domain adaptation)中因源域信息冗余或无关而导致的迁移性能下降问题,尤其是在大规模源域场景下。其解决方案的关键在于提出一种名为AutoS(Autonomous Source Knowledge Selection)的方法,通过密度驱动的选择策略在训练过程中自动筛选最具相关性和可迁移性的源域样本,并确定哪些源模型应参与目标预测;同时引入基于预训练多模态模型的伪标签增强模块,以缓解目标域标签噪声并提升自监督效果。
链接: https://arxiv.org/abs/2512.14710
作者: Keqiuyin Li,Jie Lu,Hua Zuo,Guangquan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised multi-domain adaptation plays a key role in transfer learning by leveraging acquired rich source information from multiple source domains to solve target task from an unlabeled target domain. However, multiple source domains often contain much redundant or unrelated information which can harm transfer performance, especially when in massive-source domain settings. It is urgent to develop effective strategies for identifying and selecting the most transferable knowledge from massive source domains to address the target task. In this paper, we propose a multi-domain adaptation method named \underline\textitAutonomous Source Knowledge \underline\textitSelection (AutoS) to autonomosly select source training samples and models, enabling the prediction of target task using more relevant and transferable source information. The proposed method employs a density-driven selection strategy to choose source samples during training and to determine which source models should contribute to target prediction. Simulteneously, a pseudo-label enhancement module built on a pre-trained multimodal modal is employed to mitigate target label noise and improve self-supervision. Experiments on real-world datasets indicate the superiority of the proposed method.
zh
[AI-81] Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning AAAI2026
【速读】:该论文试图解决当前基于Transformer的语言模型在需要稳定符号操作的任务中表现脆弱的问题,尽管它们在推理类行为上表现出色。解决方案的关键在于将自注意力机制和残差流重新解释为近似实现向量符号架构(Vector Symbolic Architecture, VSA)的计算单元:查询和键定义角色空间,值编码填充项,注意力权重执行软解绑(soft unbinding),而残差连接则实现多个绑定结构的超叠加(superposition)。这一代数视角不仅关联了模型内部机制与思维链(chain-of-thought)轨迹、基于程序的推理及记忆增强工具使用,还解释了变量混淆和逻辑相关提示下不一致等典型失败模式。在此基础上,作者提出受VSA启发的架构偏置(如显式绑定/解绑头和高维记忆层)以及促进角色-填充分离与鲁棒超叠加的训练目标,从而为构建更可解释且逻辑可靠的推理系统提供理论基础和实践路径。
链接: https://arxiv.org/abs/2512.14709
作者: Sahil Rajesh Dhayalkar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages with references. Submitted to ‘Logical and Symbolic Reasoning in Language Models @ AAAI 2026’ conference and is under review
Abstract:Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA). In this view, queries and keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition of many bound structures. We use this algebraic lens to relate transformer internals to chain-of-thought traces, program-based reasoning, and memory-augmented tool use, and to explain characteristic failure modes such as variable confusion and inconsistency across logically related prompts. Building on this perspective, we propose VSA-inspired architectural biases, including explicit binding/unbinding heads and hyperdimensional memory layers, and training objectives that promote role-filler separation and robust superposition. Finally, we outline metrics for measuring “VSA-likeness” and logical compositionality, and pose theoretical and architectural open problems. Overall, the paper argues that viewing attention as soft vector-symbolic computation offers a principled route toward more interpretable and logically reliable reasoning systems.
zh
[AI-82] ourists Profiling by Interest Analysis
【速读】:该论文试图解决的问题是:如何更全面地理解游客行为的动力机制,特别是与景点网络相关的动态特征。传统研究多依赖于数字痕迹的定量分析,而忽略了其定性信息所蕴含的行为动机与互动模式。论文提出的解决方案关键在于将定性与定量两个维度相结合,通过同时分析游客留下的数字痕迹的结构特征(如轨迹、停留时间等)及其语义内容(如评论、标签等),从而揭示游客在景点网络中的选择逻辑、路径演化及情感关联,提升对旅游行为复杂性的认知深度。
链接: https://arxiv.org/abs/2512.14704
作者: Sonia Djebali,Quentin Gabot,Guillaume Guerard
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:With the recent digital revolution, analyzing of tourists’ behaviors and research fields associated with it have changed profoundly. It is now easier to examine behaviors of tourists using digital traces they leave during their travels. The studies conducted on diverse aspects of tourism focus on quantitative aspects of digital traces to reach its conclusions. In this paper, we suggest a study focused on both qualitative and quantitative aspect of digital traces to understand the dynamics governing tourist behavior, especially those concerning attractions networks.
zh
[AI-83] Algorithmic Criminal Liability in Greenwashing: Comparing India United States and European Union
【速读】:该论文旨在解决人工智能(AI)赋能的绿色洗牌(greenwashing)现象所引发的公司可持续性治理困境,特别是当虚假环境声明由算法系统生成时,现行法律体系在归责机制上的不足。其核心问题在于:现有欺诈法规基于人类意图的归责逻辑,难以应对算法驱动的误导性信息披露,导致监管失效与责任真空。解决方案的关键在于构建一种融合算法风险评估与法人主体性理论的混合责任框架,通过引入严格责任模型、强化企业对AI系统的尽职调查义务,并借鉴欧盟《企业可持续发展尽职调查指令》(CSDDD)等跨境治理经验,实现对算法透明度缺失情形下的可追责性保障。
链接: https://arxiv.org/abs/2512.12837
作者: Sahibpreet Singh,Manjit Singh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Published in HPNLU Journal of Law, Business and Economics, Vol. 3, 2024, pp. 51-68. ISSN: 2584-0436
Abstract:AI-powered greenwashing has emerged as an insidious challenge within corporate sustainability governance, exacerbating the opacity of environmental disclosures and subverting regulatory oversight. This study conducts a comparative legal analysis of criminal liability for AI-mediated greenwashing across India, the US, and the EU, exposing doctrinal lacunae in attributing culpability when deceptive claims originate from algorithmic systems. Existing statutes exhibit anthropocentric biases by predicating liability on demonstrable human intent, rendering them ill-equipped to address algorithmic deception. The research identifies a critical gap in jurisprudential adaptation, as prevailing fraud statutes remain antiquated vis-à-vis AI-generated misrepresentation. Utilising a doctrinal legal methodology, this study systematically dissects judicial precedents and statutory instruments, yielding results regarding the potential expansion of corporate criminal liability. Findings underscore the viability of strict liability models, recalibrated governance frameworks for AI accountability, and algorithmic due diligence mandates under ESG regimes. Comparative insights reveal jurisdictional disparities, with the EU Corporate Sustainability Due Diligence Directive (CSDDD) offering a potential transnational model. This study contributes to AI ethics and environmental jurisprudence by advocating for a hybrid liability framework integrating algorithmic risk assessment with legal personhood constructs, ensuring algorithmic opacity does not preclude liability enforcement.
zh
[AI-84] QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management
【速读】:该论文旨在解决空间-空中-地面一体化网络(Space-Air-Ground Integrated Network, SAGIN)中无人机(Unmanned Aerial Vehicle, UAV)因高度和水平移动性差异导致的连续可靠三维覆盖难题。其核心挑战在于异构网络间覆盖范围与信号特性显著不同,且需同时优化离散链路选择与连续轨迹规划。解决方案的关键在于提出一种两级多智能体分层深度强化学习(Hierarchical Deep Reinforcement Learning, HDRL)框架:顶层采用双深度Q网络(Double Deep Q-Network, DDQN)将复杂链路选择映射为紧凑的离散动作空间,通过双Q值估计实现稳定高质量策略学习;底层结合软演员-评论家算法(Soft Actor-Critic, SAC)的最大熵机制与基于拉格朗日约束的SAC(Constrained SAC, CSAC),在满足服务质量(Quality of Service, QoS)约束的前提下处理连续轨迹优化,并动态调整拉格朗日乘子以平衡约束满足与策略优化。该方法可扩展至集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)的多无人机场景,具备良好的泛化能力。
链接: https://arxiv.org/abs/2512.15119
作者: Jiayang Wan,Ke He,Yafei Wang,Fan Liu,Wenjin Wang,Shi Jin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.
zh
[AI-85] Restless Multi-Process Multi-Armed Bandits with Applications to Self-Driving Microscopies
【速读】:该论文旨在解决高内涵筛选显微成像(high-content screening microscopy)中因无法有效确定何时何地进行成像而导致的资源利用效率低下问题,尤其在数千个动态演化的感兴趣区域(regions of interest, ROI)中平衡采集时间、计算能力与光漂白预算的挑战。现有方法依赖静态采样或启发式策略,忽视生物过程的动态特性,导致效率低且易遗漏关键事件。其解决方案的关键在于提出一种新的决策理论框架——“多过程多臂赌博机”(restless multi-process multi-armed bandit, RMPMAB),将每个实验区域建模为一组马尔可夫链(Markov chains)组成的集合,从而捕捉生物系统的异质性(如细胞周期不同步和药物反应差异)。在此基础上,作者推导出聚合过程的瞬态与渐近行为的闭式表达,并设计了具有亚线性复杂度的Whittle指数策略,显著提升了在资源受限条件下的成像吞吐量和事件捕获率。
链接: https://arxiv.org/abs/2512.14930
作者: Jaume Anguera Peris,Songtao Cheng,Hanzhao Zhang,Wei Ouyang,Joakim Jaldén
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:High-content screening microscopy generates large amounts of live-cell imaging data, yet its potential remains constrained by the inability to determine when and where to image most effectively. Optimally balancing acquisition time, computational capacity, and photobleaching budgets across thousands of dynamically evolving regions of interest remains an open challenge, further complicated by limited field-of-view adjustments and sensor sensitivity. Existing approaches either rely on static sampling or heuristics that neglect the dynamic evolution of biological processes, leading to inefficiencies and missed events. Here, we introduce the restless multi-process multi-armed bandit (RMPMAB), a new decision-theoretic framework in which each experimental region is modeled not as a single process but as an ensemble of Markov chains, thereby capturing the inherent heterogeneity of biological systems such as asynchronous cell cycles and heterogeneous drug responses. Building upon this foundation, we derive closed-form expressions for transient and asymptotic behaviors of aggregated processes, and design scalable Whittle index policies with sub-linear complexity in the number of imaging regions. Through both simulations and a real biological live-cell imaging dataset, we show that our approach achieves substantial improvements in throughput under resource constraints. Notably, our algorithm outperforms Thomson Sampling, Bayesian UCB, epsilon-Greedy, and Round Robin by reducing cumulative regret by more than 37% in simulations and capturing 93% more biologically relevant events in live imaging experiments, underscoring its potential for transformative smart microscopy. Beyond improving experimental efficiency, the RMPMAB framework unifies stochastic decision theory with optimal autonomous microscopy control, offering a principled approach to accelerate discovery across multidisciplinary sciences.
zh
[AI-86] Scaling Causal Mediation for Complex Systems: A Framework for Root Cause Analysis
【速读】:该论文旨在解决复杂操作系统(如物流、云基础设施和工业物联网)中多处理变量与中介变量交互下因果效应分解的难题,传统中介分析方法难以扩展至高维有向无环图(Directed Acyclic Graph, DAG)。其解决方案的关键在于提出一种可扩展的中介分析框架,能够系统地将总效应分解为可解释的直接效应和间接效应成分,从而在大规模因果DAG中实现对干预传播路径的量化分析。
链接: https://arxiv.org/abs/2512.14764
作者: Alessandro Casadei,Sreyoshi Bhaduri,Rohit Malshe,Pavan Mullapudi,Raj Ratan,Ankush Pole,Arkajit Rakshit
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注:
Abstract:Modern operational systems ranging from logistics and cloud infrastructure to industrial IoT, are governed by complex, interdependent processes. Understanding how interventions propagate through such systems requires causal inference methods that go beyond direct effects to quantify mediated pathways. Traditional mediation analysis, while effective in simple settings, fails to scale to the high-dimensional directed acyclic graphs (DAGs) encountered in practice, particularly when multiple treatments and mediators interact. In this paper, we propose a scalable mediation analysis framework tailored for large causal DAGs involving multiple treatments and mediators. Our approach systematically decomposes total effects into interpretable direct and indirect components. We demonstrate its practical utility through applied case studies in fulfillment center logistics, where complex dependencies and non-controllable factors often obscure root causes.
zh
[AI-87] Multiscale Cross-Modal Mapping of Molecular Pathologic and Radiologic Phenotypes in Lipid-Deficient Clear Cell Renal CellCarcinoma
【速读】:该论文旨在解决透明细胞肾细胞癌(clear cell renal cell carcinoma, ccRCC)因多尺度肿瘤异质性导致的传统TNM分期效果有限的问题,尤其关注脂质缺乏型去分化ccRCC(DCCD-ccRCC)亚型在早期疾病中仍与不良预后相关这一临床挑战。解决方案的关键在于构建一个分层的跨尺度分析框架,通过跨模态映射将分子特征转化为组织病理学和CT影像表型,建立从分子到病理再到影像的监督桥梁;其中,PathoDCCD模型捕获了从细胞形态到组织微区结构的多尺度显微特征,RadioDCCD则融合全瘤及其微环境区域的放射组学特征与二维最大截面异质性指标,实现对DCCD亚型的精准预测与术前非侵入性分子分型,从而提升临床风险分层的准确性。
链接: https://arxiv.org/abs/2512.14750
作者: Ying Cui,Dongzhe Zheng,Ke Yu,Xiyin Zheng,Xiaorui Wang,Xinxiang Li,Yan Gu,Lin Fu,Xinyi Chen,Wenjie Mei,Xin-Gui Peng
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Clear cell renal cell carcinoma (ccRCC) exhibits extensive intratumoral heterogeneity on multiple biological scales, contributing to variable clinical outcomes and limiting the effectiveness of conventional TNM staging, which highlights the urgent need for multiscale integrative analytic frameworks. The lipid-deficient de-clear cell differentiated (DCCD) ccRCC subtype, defined by multi-omics analyses, is associated with adverse outcomes even in early-stage disease. Here, we establish a hierarchical cross-scale framework for the preoperative identification of DCCD-ccRCC. At the highest layer, cross-modal mapping transferred molecular signatures to histological and CT phenotypes, establishing a molecular-to-pathology-to-radiology supervisory bridge. Within this framework, each modality-specific model is designed to mirror the inherent hierarchical structure of tumor biology. PathoDCCD captured multi-scale microscopic features, from cellular morphology and tissue architecture to meso-regional organization. RadioDCCD integrated complementary macroscopic information by combining whole-tumor and its habitat-subregions radiomics with a 2D maximal-section heterogeneity metric. These nested models enabled integrated molecular subtype prediction and clinical risk stratification. Across five cohorts totaling 1,659 patients, PathoDCCD reliably recapitulated molecular subtypes, while RadioDCCD provided reliable preoperative prediction. The consistent predictions identified patients with the poorest clinical outcomes. This cross-scale paradigm unifies molecular biology, computational pathology, and quantitative radiology into a biologically grounded strategy for preoperative noninvasive molecular phenotyping of ccRCC.
zh
[AI-88] VERAFI: Verified Agent ic Financial Intelligence through Neurosymbolic Policy Generation
【速读】:该论文旨在解决金融领域人工智能系统(Financial AI systems)中存在的关键盲点:即使在检索增强生成(Retrieval-Augmented Generation, RAG)能够准确获取相关文档的前提下,语言模型仍会因推理过程中的计算错误和监管合规违规而产生不准确结果。为应对这一挑战,作者提出VERAFI(Verified Agentic Financial Intelligence),其核心解决方案在于引入神经符号策略生成机制(neurosymbolic policy generation),将财务领域专业知识(如GAAP准则、SEC要求)与数学验证能力嵌入到代理式推理流程中,实现对金融信息的可信生成。该框架通过密集检索与交叉编码重排序相结合,并辅以金融工具驱动的智能体和自动化推理策略,在FinanceBench评测中将事实正确率从52.4%提升至94.7%,显著优于传统方法。
链接: https://arxiv.org/abs/2512.14744
作者: Adewale Akinfaderin,Shreyas Subramanian
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial AI systems suffer from a critical blind spot: while Retrieval-Augmented Generation (RAG) excels at finding relevant documents, language models still generate calculation errors and regulatory violations during reasoning, even with perfect retrieval. This paper introduces VERAFI (Verified Agentic Financial Intelligence), an agentic framework with neurosymbolic policy generation for verified financial intelligence. VERAFI combines state-of-the-art dense retrieval and cross-encoder reranking with financial tool-enabled agents and automated reasoning policies covering GAAP compliance, SEC requirements, and mathematical validation. Our comprehensive evaluation on FinanceBench demonstrates remarkable improvements: while traditional dense retrieval with reranking achieves only 52.4% factual correctness, VERAFI’s integrated approach reaches 94.7%, an 81% relative improvement. The neurosymbolic policy layer alone contributes a 4.3 percentage point gain over pure agentic processing, specifically targeting persistent mathematical and logical errors. By integrating financial domain expertise directly into the reasoning process, VERAFI offers a practical pathway toward trustworthy financial AI that meets the stringent accuracy demands of regulatory compliance, investment decisions, and risk management.
zh
机器学习
[LG-0] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data ALT NEURIPS2025
链接: https://arxiv.org/abs/2512.15706
作者: Kayode Olumoyin,Lamees El Naqa,Katarzyna Rejniak
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: NeurIPS 2025 Workshop on Learning from Time Series for Health
Abstract:In a mathematical model of interacting biological organisms, where external interventions may alter behavior over time, traditional models that assume fixed parameters usually do not capture the evolving dynamics. In oncology, this is further exacerbated by the fact that experimental data are often sparse and sometimes are composed of a few time points of tumor volume. In this paper, we propose to learn time-varying interactions between cells, such as those of bladder cancer tumors and immune cells, and their response to a combination of anticancer treatments in a limited data scenario. We employ the physics-informed neural network (PINN) approach to predict possible subpopulation trajectories at time points where no observed data are available. We demonstrate that our approach is consistent with the biological explanation of subpopulation trajectories. Our method provides a framework for learning evolving interactions among biological organisms when external interventions are applied to their environment.
[LG-1] Dynamic Rebatching for Efficient Early-Exit Inference with DREX
链接: https://arxiv.org/abs/2512.15705
作者: Xuting Liu,Daniel Alexander,Siva Kesava Reddy Kakarla,Behnaz Arzani,Vincent Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model’s layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.
[LG-2] FrontierCS: Evolving Challenges for Evolving Intelligence
链接: https://arxiv.org/abs/2512.15699
作者: Qiuyang Mang,Wenhao Chai,Zhifei Li,Huanzhi Mao,Shang Zhou,Alexander Du,Hanchen Li,Shu Liu,Edwin Chen,Yichuan Wang,Xieting Chu,Zerui Cheng,Yuan Xu,Tian Xia,Zirui Wang,Tianneng Shi,Jianzhu Yao,Yilong Zhao,Qizheng Zhang,Charlie Ruan,Zeyu Shen,Kaiyuan Liu,Runyuan He,Dong Xing,Zerui Li,Zirong Zeng,Yige Jiang,Lufeng Cheng,Ziyi Zhao,Youran Sun,Wesley Zheng,Meiyuwang Zhang,Ruyi Ji,Xuechang Tu,Zihan Zheng,Zexing Chen,Kangyang Zhou,Zhaozi Wang,Jingbang Chen,Aleksandra Korolova,Peter Henderson,Pramod Viswanath,Vijay Ganesh,Saining Xie,Zhuang Liu,Dawn Song,Sewon Min,Ion Stoica,Joseph E. Gonzalez,Jingbo Shang,Alvin Cheung
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Code with instruction: this https URL
Abstract:We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
[LG-3] Multi-Modal Semantic Communication
链接: https://arxiv.org/abs/2512.15691
作者: Matin Mortaheb,Erciyes Karakaya,Sennur Ulukus
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:
Abstract:Semantic communication aims to transmit information most relevant to a task rather than raw data, offering significant gains in communication efficiency for applications such as telepresence, augmented reality, and remote sensing. Recent transformer-based approaches have used self-attention maps to identify informative regions within images, but they often struggle in complex scenes with multiple objects, where self-attention lacks explicit task guidance. To address this, we propose a novel Multi-Modal Semantic Communication framework that integrates text-based user queries to guide the information extraction process. Our proposed system employs a cross-modal attention mechanism that fuses visual features with language embeddings to produce soft relevance scores over the visual data. Based on these scores and the instantaneous channel bandwidth, we use an algorithm to transmit image patches at adaptive resolutions using independently trained encoder-decoder pairs, with total bitrate matching the channel capacity. At the receiver, the patches are reconstructed and combined to preserve task-critical information. This flexible and goal-driven design enables efficient semantic communication in complex and bandwidth-constrained environments.
[LG-4] A Multivariate Statistical Framework for Detection Classification and Pre-localization of Anomalies in Water Distribution Networks
链接: https://arxiv.org/abs/2512.15685
作者: Oleg Melnikov,Yurii Dorofieiev,Yurii Shakhnovskiy,Huy Truong,Victoria Degeler
类目: Machine Learning (cs.LG)
*备注: 48 pages, 18 figures, 3 tables
Abstract:This paper presents a unified framework, for the detection, classification, and preliminary localization of anomalies in water distribution networks using multivariate statistical analysis. The approach, termed SICAMS (Statistical Identification and Classification of Anomalies in Mahalanobis Space), processes heterogeneous pressure and flow sensor data through a whitening transformation to eliminate spatial correlations among measurements. Based on the transformed data, the Hotelling’s T^2 statistic is constructed, enabling the formulation of anomaly detection as a statistical hypothesis test of network conformity to normal operating conditions. It is shown that Hotelling’s T^2 statistic can serve as an integral indicator of the overall “health” of the system, exhibiting correlation with total leakage volume, and thereby enabling approximate estimation of water losses via a regression model. A heuristic algorithm is developed to analyze the T^2 time series and classify detected anomalies into abrupt leaks, incipient leaks, and sensor malfunctions. Furthermore, a coarse leak localization method is proposed, which ranks sensors according to their statistical contribution and employs Laplacian interpolation to approximate the affected region within the network. Application of the proposed framework to the BattLeDIM L-Town benchmark dataset demonstrates high sensitivity and reliability in leak detection, maintaining robust performance even under multiple leaks. These capabilities make the method applicable to real-world operational environments without the need for a calibrated hydraulic model.
[LG-5] Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary AAAI2026
链接: https://arxiv.org/abs/2512.15614
作者: Xinshun Feng,Mingzhe Liu,Yi Qiao,Tongyu Zhu,Leilei Sun,Shuai Wang
类目: Machine Learning (cs.LG)
*备注: accepted by AAAI 2026
Abstract:Recent advances in explainable recommendations have explored the integration of language models to analyze natural language rationales for user-item interactions. Despite their potential, existing methods often rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, thereby limiting their applicability in open-ended scenarios. These challenges are intensified by the complex nature of real-world interactions, where diverse user intents are entangled and collaborative signals rarely align with linguistic semantics. To overcome these limitations, we propose BEAT, a unified and transferable framework that tokenizes user and item behaviors into discrete, interpretable sequences. We construct a behavior vocabulary via a vector-quantized autoencoding process that disentangles macro-level interests and micro-level intentions from graph-based representations. We then introduce multi-level semantic supervision to bridge the gap between behavioral signals and language space. A semantic alignment regularization mechanism is designed to embed behavior tokens directly into the input space of frozen language models. Experiments on three public datasets show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations. Further analysis demonstrates that our behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.
[LG-6] Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
链接: https://arxiv.org/abs/2512.15605
作者: Mathieu Blondel,Michael E. Sander,Germain Vivier-Ardisson,Tianlin Liu,Vincent Roulet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.
[LG-7] Corrective Diffusion Language Models
链接: https://arxiv.org/abs/2512.15596
作者: Shuibai Zhang,Fred Zhangzhi Peng,Yiheng Zhang,Jin Pan,Grigorios G. Chrysos
类目: Machine Learning (cs.LG)
*备注: 18 pages
Abstract:Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at this https URL.
[LG-8] Joint Learning of Unsupervised Multi-view Feature and Instance Co-selection with Cross-view Imputation
链接: https://arxiv.org/abs/2512.15574
作者: Yuxin Cai,Yanyong Huang,Jinyuan Chang,Dongjie Wang,Tianrui Li,Xiaoyi Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Feature and instance co-selection, which aims to reduce both feature dimensionality and sample size by identifying the most informative features and instances, has attracted considerable attention in recent years. However, when dealing with unlabeled incomplete multi-view data, where some samples are missing in certain views, existing methods typically first impute the missing data and then concatenate all views into a single dataset for subsequent co-selection. Such a strategy treats co-selection and missing data imputation as two independent processes, overlooking potential interactions between them. The inter-sample relationships gleaned from co-selection can aid imputation, which in turn enhances co-selection performance. Additionally, simply merging multi-view data fails to capture the complementary information among views, ultimately limiting co-selection effectiveness. To address these issues, we propose a novel co-selection method, termed Joint learning of Unsupervised multI-view feature and instance Co-selection with cross-viEw imputation (JUICE). JUICE first reconstructs incomplete multi-view data using available observations, bringing missing data recovery and feature and instance co-selection together in a unified framework. Then, JUICE leverages cross-view neighborhood information to learn inter-sample relationships and further refine the imputation of missing values during reconstruction. This enables the selection of more representative features and instances. Extensive experiments demonstrate that JUICE outperforms state-of-the-art methods.
[LG-9] Robustness and uncertainty: two complementary aspects of the reliability of the predictions of a classifier
链接: https://arxiv.org/abs/2512.15492
作者: Adrián Detavernier,Jasper De Bock
类目: Machine Learning (cs.LG)
*备注: workshop paper (not published)
Abstract:We consider two conceptually different approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We compare both approaches on a number of benchmark datasets and show that there is no clear winner between the two, but that they are complementary and can be combined to obtain a hybrid approach that outperforms both RQ and UQ. As a byproduct of our approach, for each dataset, we also obtain an assessment of the relative importance of uncertainty and robustness as sources of unreliability.
[LG-10] Multi-stage Bayesian optimisation for dynamic decision-making in self-driving labs
链接: https://arxiv.org/abs/2512.15483
作者: Luca Torresi,Pascal Friederich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-driving laboratories (SDLs) are combining recent technological advances in robotics, automation, and machine learning based data analysis and decision-making to perform autonomous experimentation toward human-directed goals without requiring any direct human intervention. SDLs are successfully used in materials science, chemistry, and beyond, to optimise processes, materials, and devices in a systematic and data-efficient way. At present, the most widely used algorithm to identify the most informative next experiment is Bayesian optimisation. While relatively simple to apply to a wide range of optimisation problems, standard Bayesian optimisation relies on a fixed experimental workflow with a clear set of optimisation parameters and one or more measurable objective functions. This excludes the possibility of making on-the-fly decisions about changes in the planned sequence of operations and including intermediate measurements in the decision-making process. Therefore, many real-world experiments need to be adapted and simplified to be converted to the common setting in self-driving labs. In this paper, we introduce an extension to Bayesian optimisation that allows flexible sampling of multi-stage workflows and makes optimal decisions based on intermediate observables, which we call proxy measurements. We systematically compare the advantage of taking into account proxy measurements over conventional Bayesian optimisation, in which only the final measurement is observed. We find that over a wide range of scenarios, proxy measurements yield a substantial improvement, both in the time to find good solutions and in the overall optimality of found solutions. This not only paves the way to use more complex and thus more realistic experimental workflows in autonomous labs but also to smoothly combine simulations and experiments in the next generation of SDLs.
[LG-11] Metanetworks as Regulatory Operators: Learning to Edit for Requirement Compliance
链接: https://arxiv.org/abs/2512.15469
作者: Ioannis Kalogeropoulos,Giorgos Bouritsas,Yannis Panagakis
类目: Machine Learning (cs.LG)
*备注: 23 pages
Abstract:As machine learning models are increasingly deployed in high-stakes settings, e.g. as decision support systems in various societal sectors or in critical infrastructure, designers and auditors are facing the need to ensure that models satisfy a wider variety of requirements (e.g. compliance with regulations, fairness, computational constraints) beyond performance. Although most of them are the subject of ongoing studies, typical approaches face critical challenges: post-processing methods tend to compromise performance, which is often counteracted by fine-tuning or, worse, training from scratch, an often time-consuming or even unavailable strategy. This raises the following question: “Can we efficiently edit models to satisfy requirements, without sacrificing their utility?” In this work, we approach this with a unifying framework, in a data-driven manner, i.e. we learn to edit neural networks (NNs), where the editor is an NN itself - a graph metanetwork - and editing amounts to a single inference step. In particular, the metanetwork is trained on NN populations to minimise an objective consisting of two terms: the requirement to be enforced and the preservation of the NN’s utility. We experiment with diverse tasks (the data minimisation principle, bias mitigation and weight pruning) improving the trade-offs between performance, requirement satisfaction and time efficiency compared to popular post-processing or re-training alternatives.
[LG-12] From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning
链接: https://arxiv.org/abs/2512.15460
作者: Xiangrui Xu,Zhize Li,Yufei Han,Bin Wang,Jiqiang Liu,Wei Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data Reconstruction Attacks (DRA) pose a significant threat to Federated Learning (FL) systems by enabling adversaries to infer sensitive training data from local clients. Despite extensive research, the question of how to characterize and assess the risk of DRAs in FL systems remains unresolved due to the lack of a theoretically-grounded risk quantification framework. In this work, we address this gap by introducing Invertibility Loss (InvLoss) to quantify the maximum achievable effectiveness of DRAs for a given data instance and FL model. We derive a tight and computable upper bound for InvLoss and explore its implications from three perspectives. First, we show that DRA risk is governed by the spectral properties of the Jacobian matrix of exchanged model updates or feature embeddings, providing a unified explanation for the effectiveness of defense methods. Second, we develop InvRE, an InvLoss-based DRA risk estimator that offers attack method-agnostic, comprehensive risk evaluation across data instances and model architectures. Third, we propose two adaptive noise perturbation defenses that enhance FL privacy without harming classification accuracy. Extensive experiments on real-world datasets validate our framework, demonstrating its potential for systematic DRA risk evaluation and mitigation in FL systems.
[LG-13] Copyright Infringement Risk Reduction via Chain-of-Thought and Task Instruction Prompting
链接: https://arxiv.org/abs/2512.15442
作者: Neeraj Sarna,Yuanyuan Li,Michael von Gablenz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large scale text-to-image generation models can memorize and reproduce their training dataset. Since the training dataset often contains copyrighted material, reproduction of training dataset poses a copyright infringement risk, which could result in legal liabilities and financial losses for both the AI user and the developer. The current works explores the potential of chain-of-thought and task instruction prompting in reducing copyrighted content generation. To this end, we present a formulation that combines these two techniques with two other copyright mitigation strategies: a) negative prompting, and b) prompt re-writing. We study the generated images in terms their similarity to a copyrighted image and their relevance of the user input. We present numerical experiments on a variety of models and provide insights on the effectiveness of the aforementioned techniques for varying model complexity.
[LG-14] Statistics of Min-max Normalized Eigenvalues in Random Matrices
链接: https://arxiv.org/abs/2512.15427
作者: Hyakka Nakada,Shu Tanaka
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Statistics Theory (math.ST)
*备注: 4 pages, 4 figures
Abstract:Random matrix theory has played an important role in various areas of pure mathematics, mathematical physics, and machine learning. From a practical perspective of data science, input data are usually normalized prior to processing. Thus, this study investigates the statistical properties of min-max normalized eigenvalues in random matrices. Previously, the effective distribution for such normalized eigenvalues has been proposed. In this study, we apply it to evaluate a scaling law of the cumulative distribution. Furthermore, we derive the residual error that arises during matrix factorization of random matrices. We conducted numerical experiments to verify these theoretical predictions.
[LG-15] FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
链接: https://arxiv.org/abs/2512.15420
作者: Yeonwoo Cha,Semin Kim,Jinhyeon Kwon,Seunghoon Hong
类目: Machine Learning (cs.LG)
*备注: this https URL
Abstract:Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at this https URL.
[LG-16] EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
链接: https://arxiv.org/abs/2512.15405
作者: Jianfei Ma,Wee Sun Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, \textttEUBRL , which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate \textttEUBRL on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that \textttEUBRL achieves superior sample efficiency, scalability, and consistency.
[LG-17] Robustness Evaluation of Machine Learning Models for Fault Classification and Localization In Power System Protection
链接: https://arxiv.org/abs/2512.15385
作者: Julian Oelhaf,Mehran Pashaei,Georg Kordowich,Christian Bergler,Andreas Maier,Johann Jäger,Siming Bayer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper is a postprint of a paper submitted to and accepted for publication in the 20th IET International Conference on Developments in Power System Protection (DPSP Global 2026) and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library
Abstract:The growing penetration of renewable and distributed generation is transforming power systems and challenging conventional protection schemes that rely on fixed settings and local measurements. Machine learning (ML) offers a data-driven alternative for centralized fault classification (FC) and fault localization (FL), enabling faster and more adaptive decision-making. However, practical deployment critically depends on robustness. Protection algorithms must remain reliable even when confronted with missing, noisy, or degraded sensor data. This work introduces a unified framework for systematically evaluating the robustness of ML models in power system protection. High-fidelity EMT simulations are used to model realistic degradation scenarios, including sensor outages, reduced sampling rates, and transient communication losses. The framework provides a consistent methodology for benchmarking models, quantifying the impact of limited observability, and identifying critical measurement channels required for resilient operation. Results show that FC remains highly stable under most degradation types but drops by about 13% under single-phase loss, while FL is more sensitive overall, with voltage loss increasing localization error by over 150%. These findings offer actionable guidance for robustness-aware design of future ML-assisted protection systems. Comments: This paper is a postprint of a paper submitted to and accepted for publication in the 20th IET International Conference on Developments in Power System Protection (DPSP Global 2026) and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2512.15385 [cs.LG] (or arXiv:2512.15385v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] Remotely Detectable Robot Policy Watermarking
链接: https://arxiv.org/abs/2512.15379
作者: Michael Amir,Manon Flageat,Amanda Prorok
类目: Robotics (cs.RO); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The success of machine learning for real-world robotic systems has created a new form of intellectual property: the trained policy. This raises a critical need for novel methods that verify ownership and detect unauthorized, possibly unsafe misuse. While watermarking is established in other domains, physical policies present a unique challenge: remote detection. Existing methods assume access to the robot’s internal state, but auditors are often limited to external observations (e.g., video footage). This ``Physical Observation Gap’’ means the watermark must be detected from signals that are noisy, asynchronous, and filtered by unknown system dynamics. We formalize this challenge using the concept of a \textitglimpse sequence, and introduce Colored Noise Coherency (CoNoCo), the first watermarking strategy designed for remote detection. CoNoCo embeds a spectral signal into the robot’s motions by leveraging the policy’s inherent stochasticity. To show it does not degrade performance, we prove CoNoCo preserves the marginal action distribution. Our experiments demonstrate strong, robust detection across various remote modalities, including motion capture and side-way/top-down video footage, in both simulated and real-world robot experiments. This work provides a necessary step toward protecting intellectual property in robotics, offering the first method for validating the provenance of physical policies non-invasively, using purely remote observations.
[LG-19] A Regime-Aware Fusion Framework for Time Series Classification
链接: https://arxiv.org/abs/2512.15378
作者: Honey Singh Chauhan,Zahraa S. Abdallah
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Kernel-based methods such as Rocket are among the most effective default approaches for univariate time series classification (TSC), yet they do not perform equally well across all datasets. We revisit the long-standing intuition that different representations capture complementary structure and show that selectively fusing them can yield consistent improvements over Rocket on specific, systematically identifiable kinds of datasets. We introduce Fusion-3 (F3), a lightweight framework that adaptively fuses Rocket, Sax, and Sfa representations. To understand when fusion helps, we cluster UCR datasets into six groups using meta-features capturing series length, spectral structure, roughness, and class imbalance, and treat these clusters as interpretable data-structure regimes. Our analysis shows that fusion typically outperforms strong baselines in regimes with structured variability or rich frequency content, while offering diminishing returns in highly irregular or outlier-heavy settings. To support these findings, we combine three complementary analyses: non-parametric paired statistics across datasets, ablation studies isolating the roles of individual representations, and attribution via SHAP to identify which dataset properties predict fusion gains. Sample-level case studies further reveal the underlying mechanism: fusion primarily improves performance by rescuing specific errors, with adaptive increases in frequency-domain weighting precisely where corrections occur. Using 5-fold cross-validation on the 113 UCR datasets, F3 yields small but consistent average improvements over Rocket, supported by frequentist and Bayesian evidence and accompanied by clearly identifiable failure cases. Our results show that selectively applied fusion provides dependable and interpretable extension to strong kernel-based methods, correcting their weaknesses precisely where the data support it.
[LG-20] Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference
链接: https://arxiv.org/abs/2512.15335
作者: Chenxiang Zhang,Tongxi Qu,Zhong Li,Tian Zhang,Jun Pang,Sjouke Mauw
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: accepted at TrustCom 2025
Abstract:Deep neural networks are widely deployed with quantization techniques to reduce memory and computational costs by lowering the numerical precision of their parameters. While quantization alters model parameters and their outputs, existing privacy analyses primarily focus on full-precision models, leaving a gap in understanding how bit-width reduction can affect privacy leakage. We present the first systematic study of the privacy-utility relationship in post-training quantization (PTQ), a versatile family of methods that can be applied to pretrained models without further training. Using membership inference attacks as our evaluation framework, we analyze three popular PTQ algorithms-AdaRound, BRECQ, and OBC-across multiple precision levels (4-bit, 2-bit, and 1.58-bit) on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Our findings consistently show that low-precision PTQs can reduce privacy leakage. In particular, lower-precision models demonstrate up to an order of magnitude reduction in membership inference vulnerability compared to their full-precision counterparts, albeit at the cost of decreased utility. Additional ablation studies on the 1.58-bit quantization level show that quantizing only the last layer at higher precision enables fine-grained control over the privacy-utility trade-off. These results offer actionable insights for practitioners to balance efficiency, utility, and privacy protection in real-world deployments.
[LG-21] me-Varying Audio Effect Modeling by End-to-End Adversarial Training
链接: https://arxiv.org/abs/2512.15313
作者: Yann Bourdin,Pierrick Legrand,Fanny Roche
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submitted for review to the Journal of the Audio Engineering Society (JAES). Accompanying website: this https URL
Abstract:Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, removing the need for modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new objective metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method’s ability to capture time-varying dynamics in a fully black-box context.
[LG-22] LLM Q: Efficient Lower-Precision Pretraining for Consumer GPUs
链接: https://arxiv.org/abs/2512.15306
作者: Erik Schultheis,Dan Alistarh
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
[LG-23] opological Metric for Unsupervised Embedding Quality Evaluation
链接: https://arxiv.org/abs/2512.15285
作者: Aleksei Shestov,Anton Klenitskiy,Daria Denisova,Amurkhan Dzagkoev,Daniil Petrovich,Andrey Savchenko,Maksim Makarenko
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Modern representation learning increasingly relies on unsupervised and self-supervised methods trained on large-scale unlabeled data. While these approaches achieve impressive generalization across tasks and domains, evaluating embedding quality without labels remains an open challenge. In this work, we propose Persistence, a topology-aware metric based on persistent homology that quantifies the geometric structure and topological richness of embedding spaces in a fully unsupervised manner. Unlike metrics that assume linear separability or rely on covariance structure, Persistence captures global and multi-scale organization. Empirical results across diverse domains show that Persistence consistently achieves top-tier correlations with downstream performance, outperforming existing unsupervised metrics and enabling reliable model and hyperparameter selection.
[LG-24] Model inference for ranking from pairwise comparisons
链接: https://arxiv.org/abs/2512.15269
作者: Daniel Sánchez Catalina,George T. Cantwell
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider the problem of ranking objects from noisy pairwise comparisons, for example, ranking tennis players from the outcomes of matches. We follow a standard approach to this problem and assume that each object has an unobserved strength and that the outcome of each comparison depends probabilistically on the strengths of the comparands. However, we do not assume to know a priori how skills affect outcomes. Instead, we present an efficient algorithm for simultaneously inferring both the unobserved strengths and the function that maps strengths to probabilities. Despite this problem being under-constrained, we present experimental evidence that the conclusions of our Bayesian approach are robust to different model specifications. We include several case studies to exemplify the method on real-world data sets.
[LG-25] Distillation-Guided Structural Transfer for Continual Learning Beyond Sparse Distributed Memory
链接: https://arxiv.org/abs/2512.15267
作者: Huiyan Xue,Xuming Ran,Yaxin Li,Qi Xu,Enhui Li,Yi Xu,Qiang Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse neural systems are gaining traction for efficient continual learning due to their modularity and low interference. Architectures such as Sparse Distributed Memory Multi-Layer Perceptrons (SDMLP) construct task-specific subnetworks via Top-K activation and have shown resilience against catastrophic forgetting. However, their rigid modularity limits cross-task knowledge reuse and leads to performance degradation under high sparsity. We propose Selective Subnetwork Distillation (SSD), a structurally guided continual learning framework that treats distillation not as a regularizer but as a topology-aligned information conduit. SSD identifies neurons with high activation frequency and selectively distills knowledge within previous Top-K subnetworks and output logits, without requiring replay or task labels. This enables structural realignment while preserving sparse modularity. Experiments on Split CIFAR-10, CIFAR-100, and MNIST demonstrate that SSD improves accuracy, retention, and representation coverage, offering a structurally grounded solution for sparse continual learning.
[LG-26] O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization
链接: https://arxiv.org/abs/2512.15229
作者: Elio Gruttadauria(IP Paris, LTCI, IDS, S2A),Mathieu Fontaine(LTCI, IP Paris),Jonathan Le Roux,Slim Essid(IDS, S2A, LTCI)
类目: Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注:
Abstract:We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
[LG-27] Accelerating High-Throughput Catalyst Screening by Direct Generation of Equilibrium Adsorption Structures
链接: https://arxiv.org/abs/2512.15228
作者: Songze Huo,Xiao-Ming Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:The adsorption energy serves as a crucial descriptor for the large-scale screening of catalysts. Nevertheless, the limited distribution of training data for the extensively utilised machine learning interatomic potential (MLIP), predominantly sourced from near-equilibrium structures, results in unreliable adsorption structures and consequent adsorption energy predictions. In this context, we present DBCata, a deep generative model that integrates a periodic Brownian-bridge framework with an equivariant graph neural network to establish a low-dimensional transition manifold between unrelaxed and DFT-relaxed structures, without requiring explicit energy or force information. Upon training, DBCata effectively generates high-fidelity adsorption geometries, achieving an interatomic distance mean absolute error (DMAE) of 0.035 \textÅ on the Catalysis-Hub dataset, which is nearly three times superior to that of the current state-of-the-art machine learning potential models. Moreover, the corresponding DFT accuracy can be improved within 0.1 eV in 94% of instances by identifying and refining anomalous predictions through a hybrid chemical-heuristic and self-supervised outlier detection approach. We demonstrate that the remarkable performance of DBCata facilitates accelerated high-throughput computational screening for efficient alloy catalysts in the oxygen reduction reaction, highlighting the potential of DBCata as a powerful tool for catalyst design and optimisation.
[LG-28] Label-consistent clustering for evolving data
链接: https://arxiv.org/abs/2512.15210
作者: Ameet Gadekar,Aristides Gionis,Thibault Marette
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 26 pages
Abstract:Data analysis often involves an iterative process, where solutions must be continuously refined in response to new data. Typically, as new data becomes available, an existing solution must be updated to incorporate the latest information. In addition to seeking a high-quality solution for the task at hand, it is also crucial to ensure consistency by minimizing drastic changes from previous solutions. Applying this approach across many iterations, ensures that the solution evolves gradually and smoothly. In this paper, we study the above problem in the context of clustering, specifically focusing on the k -center problem. More precisely, we study the following problem: Given a set of points X , parameters k and b , and a prior clustering solution H for X , our goal is to compute a new solution C for X , consisting of k centers, which minimizes the clustering cost while introducing at most b changes from H . We refer to this problem as label-consistent k -center, and we propose two constant-factor approximation algorithms for it. We complement our theoretical findings with an experimental evaluation demonstrating the effectiveness of our methods on real-world datasets. Comments: 26 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2512.15210 [cs.DS] (or arXiv:2512.15210v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2512.15210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] Chorus: Harmonizing Context and Sensing Signals for Data-Free Model Customization in IoT
链接: https://arxiv.org/abs/2512.15206
作者: Liyu Zhang,Yejia Liu,Kwun Ho Liu,Runxi Huang,Xiaomin Ouyang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In real-world IoT applications, sensor data is usually collected under diverse and dynamic contextual conditions where factors such as sensor placements or ambient environments can significantly affect data patterns and downstream performance. Traditional domain adaptation or generalization methods often ignore such context information or use simplistic integration strategies, making them ineffective in handling unseen context shifts after deployment. In this paper, we propose Chorus, a context-aware, data-free model customization approach that adapts models to unseen deployment conditions without requiring target-domain data. The key idea is to learn effective context representations that capture their influence on sensor data patterns and to adaptively integrate them based on the degree of context shift. Specifically, Chorus first performs unsupervised cross-modal reconstruction between unlabeled sensor data and language-based context embeddings, while regularizing the context embedding space to learn robust, generalizable context representations. Then, it trains a lightweight gated head on limited labeled samples to dynamically balance sensor and context contributions-favoring context when sensor evidence is ambiguous and vice versa. To further reduce inference latency, Chorus employs a context-caching mechanism that reuses cached context representations and updates only upon detected context shifts. Experiments on IMU, speech, and WiFi sensing tasks under diverse context shifts show that Chorus outperforms state-of-the-art baselines by up to 11.3% in unseen contexts, while maintaining comparable latency on smartphone and edge devices.
[LG-30] BEAT2AASIST model with layer fusion for ESDD 2026 Challenge
链接: https://arxiv.org/abs/2512.15180
作者: Sanghyeok Chung,Eujin Kim,Donggun Kim,Gaeun Heo,Jeongbin You,Nahyun Lee,Sunmook Choi,Soyul Han,Seungsang Oh,Il-Youp Kwak
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 3 pages, 1 figure, challenge paper
Abstract:Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.
[LG-31] Understanding NTK Variance in Implicit Neural Representations
链接: https://arxiv.org/abs/2512.15169
作者: Chengguang Ou,Yixin Zhuang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Implicit Neural Representations (INRs) often converge slowly and struggle to recover high-frequency details due to spectral bias. While prior work links this behavior to the Neural Tangent Kernel (NTK), how specific architectural choices affect NTK conditioning remains unclear. We show that many INR mechanisms can be understood through their impact on a small set of pairwise similarity factors and scaling terms that jointly determine NTK eigenvalue variance. For standard coordinate MLPs, limited input-feature interactions induce large eigenvalue dispersion and poor conditioning. We derive closed-form variance decompositions for common INR components and show that positional encoding reshapes input similarity, spherical normalization reduces variance via layerwise scaling, and Hadamard modulation introduces additional similarity factors strictly below one, yielding multiplicative variance reduction. This unified view explains how diverse INR architectures mitigate spectral bias by improving NTK conditioning. Experiments across multiple tasks confirm the predicted variance reductions and demonstrate faster, more stable convergence with improved reconstruction quality.
[LG-32] An Efficient Gradient-Based Inference Attack for Federated Learning
链接: https://arxiv.org/abs/2512.15143
作者: Pablo Montaña-Fernández,Ines Ortega-Fernandez
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper was supported by the TRUMPET project, funded by the European Union under Grant Agreement No. 101070038
Abstract:Federated Learning is a machine learning setting that reduces direct data exposure, improving the privacy guarantees of machine learning models. Yet, the exchange of model updates between the participants and the aggregator can still leak sensitive information. In this work, we present a new gradient-based membership inference attack for federated learning scenarios that exploits the temporal evolution of last-layer gradients across multiple federated rounds. Our method uses the shadow technique to learn round-wise gradient patterns of the training records, requiring no access to the private dataset, and is designed to consider both semi-honest and malicious adversaries (aggregators or data owners). Beyond membership inference, we also provide a natural extension of the proposed attack to discrete attribute inference by contrasting gradient responses under alternative attribute hypotheses. The proposed attacks are model-agnostic, and therefore applicable to any gradient-based model and can be applied to both classification and regression settings. We evaluate the attack on CIFAR-100 and Purchase100 datasets for membership inference and on Breast Cancer Wisconsin for attribute inference. Our findings reveal strong attack performance and comparable computational and memory overhead in membership inference when compared to another attack from the literature. The obtained results emphasize that multi-round federated learning can increase the vulnerability to inference attacks, that aggregators pose a more substantial threat than data owners, and that attack performance is strongly influenced by the nature of the training dataset, with richer, high-dimensional data leading to stronger leakage than simpler tabular data.
[LG-33] Generalization and Feature Attribution in Machine Learning Models for Crop Yield and Anomaly Prediction in Germany
链接: https://arxiv.org/abs/2512.15140
作者: Roland Baatz
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures
Abstract:This study examines the generalization performance and interpretability of machine learning (ML) models used for predicting crop yield and yield anomalies in Germany’s NUTS-3 regions. Using a high-quality, long-term dataset, the study systematically compares the evaluation and temporal validation behavior of ensemble tree-based models (XGBoost, Random Forest) and deep learning approaches (LSTM, TCN). While all models perform well on spatially split, conventional test sets, their performance degrades substantially on temporally independent validation years, revealing persistent limitations in generalization. Notably, models with strong test-set accuracy, but weak temporal validation performance can still produce seemingly credible SHAP feature importance values. This exposes a critical vulnerability in post hoc explainability methods: interpretability may appear reliable even when the underlying model fails to generalize. These findings underscore the need for validation-aware interpretation of ML predictions in agricultural and environmental systems. Feature importance should not be accepted at face value unless models are explicitly shown to generalize to unseen temporal and spatial conditions. The study advocates for domain-aware validation, hybrid modeling strategies, and more rigorous scrutiny of explainability methods in data-driven agriculture. Ultimately, this work addresses a growing challenge in environmental data science: how can we evaluate generalization robustly enough to trust model explanations? Comments: 13 pages, 3 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07 (Primary), 62M45, 62P12 (Secondary) ACMclasses: I.2.6; F.2.2 Cite as: arXiv:2512.15140 [cs.LG] (or arXiv:2512.15140v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15140 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] rajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training
链接: https://arxiv.org/abs/2512.15123
作者: Mukur Gupta,Niharika Gupta,Saifur Rahman,Shantanu Pal,Chandan Karmakar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning models deployed on edge devices are increasingly used in safety-critical applications. However, their vulnerability to adversarial perturbations poses significant risks, especially in Federated Learning (FL) settings where identical models are distributed across thousands of clients. While adversarial training is a strong defense, it is difficult to apply in FL due to strict client-data privacy constraints and the limited compute available on edge devices. In this work, we introduce TrajSyn, a privacy-preserving framework that enables effective server-side adversarial training by synthesizing a proxy dataset from the trajectories of client model updates, without accessing raw client data. We show that TrajSyn consistently improves adversarial robustness on image classification benchmarks with no extra compute burden on the client device.
[LG-35] SigMA: Path Signatures and Multi-head Attention for Learning Parameters in fBm-driven SDEs
链接: https://arxiv.org/abs/2512.15088
作者: Xianglin Wu,Chiheb Ben Hammouda,Cornelis W. Oosterlee
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:
Abstract:Stochastic differential equations (SDEs) driven by fractional Brownian motion (fBm) are increasingly used to model systems with rough dynamics and long-range dependence, such as those arising in quantitative finance and reliability engineering. However, these processes are non-Markovian and lack a semimartingale structure, rendering many classical parameter estimation techniques inapplicable or computationally intractable beyond very specific cases. This work investigates two central questions: (i) whether integrating path signatures into deep learning architectures can improve the trade-off between estimation accuracy and model complexity, and (ii) what constitutes an effective architecture for leveraging signatures as feature maps. We introduce SigMA (Signature Multi-head Attention), a neural architecture that integrates path signatures with multi-head self-attention, supported by a convolutional preprocessing layer and a multilayer perceptron for effective feature encoding. SigMA learns model parameters from synthetically generated paths of fBm-driven SDEs, including fractional Brownian motion, fractional Ornstein-Uhlenbeck, and rough Heston models, with a particular focus on estimating the Hurst parameter and on joint multi-parameter inference, and it generalizes robustly to unseen trajectories. Extensive experiments on synthetic data and two real-world datasets (i.e., equity-index realized volatility and Li-ion battery degradation) show that SigMA consistently outperforms CNN, LSTM, vanilla Transformer, and Deep Signature baselines in accuracy, robustness, and model compactness. These results demonstrate that combining signature transforms with attention-based architectures provides an effective and scalable framework for parameter inference in stochastic systems with rough or persistent temporal structure.
[LG-36] PIP2 Net: Physics-informed Partition Penalty Deep Operator Network
链接: https://arxiv.org/abs/2512.15086
作者: Hongjin Mi,Huiqiang Lun,Changhong Mou,Yeyu Zhang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Operator learning has become a powerful tool for accelerating the solution of parameterized partial differential equations (PDEs), enabling rapid prediction of full spatiotemporal fields for new initial conditions or forcing functions. Existing architectures such as DeepONet and the Fourier Neural Operator (FNO) show strong empirical performance but often require large training datasets, lack explicit physical structure, and may suffer from instability in their trunk-network features, where mode imbalance or collapse can hinder accurate operator approximation. Motivated by the stability and locality of classical partition-of-unity (PoU) methods, we investigate PoU-based regularization techniques for operator learning and develop a revised formulation of the existing POU–PI–DeepONet framework. The resulting \emphPhysics-\emphinformed \emphPartition \emphPenalty Deep Operator Network (PIP ^2 Net) introduces a simplified and more principled partition penalty that improved the coordinated trunk outputs that leads to more expressiveness without sacrificing the flexibility of DeepONet. We evaluate PIP ^2 Net on three nonlinear PDEs: the viscous Burgers equation, the Allen–Cahn equation, and a diffusion–reaction system. The results show that it consistently outperforms DeepONet, PI-DeepONet, and POU-DeepONet in prediction accuracy and robustness.
[LG-37] Neural Modular Physics for Elastic Simulation
链接: https://arxiv.org/abs/2512.15083
作者: Yifei Li,Haixu Wu,Zeyi Xu,Tuur Stuyck,Wojciech Matusik
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Learning-based methods have made significant progress in physics simulation, typically approximating dynamics with a monolithic end-to-end optimized neural network. Although these models offer an effective way to simulation, they may lose essential features compared to traditional numerical simulators, such as physical interpretability and reliability. Drawing inspiration from classical simulators that operate in a modular fashion, this paper presents Neural Modular Physics (NMP) for elastic simulation, which combines the approximation capacity of neural networks with the physical reliability of traditional simulators. Beyond the previous monolithic learning paradigm, NMP enables direct supervision of intermediate quantities and physical constraints by decomposing elastic dynamics into physically meaningful neural modules connected through intermediate physical quantities. With a specialized architecture and training strategy, our method transforms the numerical computation flow into a modular neural simulator, achieving improved physical consistency and generalizability. Experimentally, NMP demonstrates superior generalization to unseen initial conditions and resolutions, stable long-horizon simulation, better preservation of physical properties compared to other neural simulators, and greater feasibility in scenarios with unknown underlying dynamics than traditional simulators.
[LG-38] he Semantic Architect: How FEAML Bridges Structured Data and LLM s for Multi-Label Tasks
链接: https://arxiv.org/abs/2512.15082
作者: Wanfu Gao,Zebin He,Jun Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing feature engineering methods based on large language models (LLMs) have not yet been applied to multi-label learning tasks. They lack the ability to model complex label dependencies and are not specifically adapted to the characteristics of multi-label tasks. To address the above issues, we propose Feature Engineering Automation for Multi-Label Learning (FEAML), an automated feature engineering method for multi-label classification which leverages the code generation capabilities of LLMs. By utilizing metadata and label co-occurrence matrices, LLMs are guided to understand the relationships between data features and task objectives, based on which high-quality features are generated. The newly generated features are evaluated in terms of model accuracy to assess their effectiveness, while Pearson correlation coefficients are used to detect redundancy. FEAML further incorporates the evaluation results as feedback to drive LLMs to continuously optimize code generation in subsequent iterations. By integrating LLMs with a feedback mechanism, FEAML realizes an efficient, interpretable and self-improving feature engineering paradigm. Empirical results on various multi-label datasets demonstrate that our FEAML outperforms other feature engineering methods.
[LG-39] Stock Pattern Assistant (SPA): A Deterministic and Explainable Framework for Structural Price Run Extraction and Event Correlation in Equity Markets
链接: https://arxiv.org/abs/2512.15008
作者: Sandeep Neela
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding how prices evolve over time often requires peeling back the layers of market noise to identify clear, structural behavior. Many of the tools commonly used for this purpose technical indicators, chart heuristics, or even sophisticated predictive models leave important questions unanswered. Technical indicators depend on platform-specific rules, and predictive systems typically offer little in terms of explanation. In settings that demand transparency or auditability, this poses a significant challenge. We introduce the Stock Pattern Assistant (SPA), a deterministic framework designed to extract monotonic price runs, attach relevant public events through a symmetric correlation window, and generate explanations that are factual, historical, and guardrailed. SPA relies only on daily OHLCV data and a normalized event stream, making the pipeline straight-forward to audit and easy to reproduce. To illustrate SPA’s behavior in practice, we evaluate it across four equities-AAPL, NVDA, SCHW, and PGR-chosen to span a range of volatility regimes and sector characteristics. Although the evaluation period is modest, the results demonstrate how SPA consistently produces stable structural decompositions and contextual narratives. Ablation experiments further show how deterministic segmentation, event alignment, and constrained explanation each contribute to interpretability. SPA is not a forecasting system, nor is it intended to produce trading signals. Its value lies in offering a transparent, reproducible view of historical price structure that can complement analyst workflows, risk reviews, and broader explainable-AI pipelines.
[LG-40] SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports
链接: https://arxiv.org/abs/2512.15003
作者: Sogol Masoumzadeh,Yufei Li,Shane McIntosh,Dániel Varró,Lili Wei
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: This is the author pre-print. The manuscript has been accepted for publication at SANER 2026!
Abstract:Monitoring issue tracker submissions is a crucial software maintenance activity. A key goal is the prioritization of high risk, security-related bugs. If such bugs can be recognized early, the risk of propagation to dependent products and endangerment of stakeholder benefits can be mitigated. To assist triage engineers with this task, several automatic detection techniques, from Machine Learning (ML) models to prompting Large Language Models (LLMs), have been proposed. Although promising to some extent, prior techniques often memorize lexical cues as decision shortcuts, yielding low detection rate specifically for more complex submissions. As such, these classifiers do not yet reach the practical expectations of a real-time detector of security-related issues. To address these limitations, we propose SEBERTIS, a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues, so that they can confidently detect fully unseen security-related issues. SEBERTIS capitalizes on fine-tuning bidirectional transformer architectures as Masked Language Models (MLMs) on a series of semantically equivalent vocabulary to prediction labels (which we call Semantic Surrogates) when they have been replaced with a mask. Our SEBERTIS-trained classifier achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports, substantially outperforming state-of-the-art issue classifiers, with 14.44%-96.98%, 15.40%-93.07%, and 14.90%-94.72% higher detection precision, recall, and F1-score over ML-based baselines. Our classifier also substantially surpasses LLM baselines, with an improvement of 23.20%-63.71%, 36.68%-85.63%, and 39.49%-74.53% for precision, recall, and F1-score.
[LG-41] Adaptive Partitioning and Learning for Stochastic Control of Diffusion Processes
链接: https://arxiv.org/abs/2512.14991
作者: Hanqing Jin,Renyuan Xu,Yanzhao Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Portfolio Management (q-fin.PM)
*备注:
Abstract:We study reinforcement learning for controlled diffusion processes with unbounded continuous state spaces, bounded continuous actions, and polynomially growing rewards: settings that arise naturally in finance, economics, and operations research. To overcome the challenges of continuous and high-dimensional domains, we introduce a model-based algorithm that adaptively partitions the joint state-action space. The algorithm maintains estimators of drift, volatility, and rewards within each partition, refining the discretization whenever estimation bias exceeds statistical confidence. This adaptive scheme balances exploration and approximation, enabling efficient learning in unbounded domains. Our analysis establishes regret bounds that depend on the problem horizon, state dimension, reward growth order, and a newly defined notion of zooming dimension tailored to unbounded diffusion processes. The bounds recover existing results for bounded settings as a special case, while extending theoretical guarantees to a broader class of diffusion-type problems. Finally, we validate the effectiveness of our approach through numerical experiments, including applications to high-dimensional problems such as multi-asset mean-variance portfolio selection.
[LG-42] Softly Constrained Denoisers for Diffusion Models
链接: https://arxiv.org/abs/2512.14980
作者: Victor M. Yeom Song,Severi Rissanen,Arno Solin,Samuel Kaski,Mingfei Sun
类目: Machine Learning (cs.LG)
*备注: 18 pages including appendix, 8 figures including appendix, preprint
Abstract:Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem, especially when the constraint is misspecified, a common issue when formulating constraints on scientific data. In this paper, instead of changing the loss or the sampling loop, we integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, and maintain enough flexibility to deviate from it when there is misspecification with observed data.
[LG-43] Deep Learning and Elicitability for McKean-Vlasov FBSDEs With Common Noise
链接: https://arxiv.org/abs/2512.14967
作者: Felipe J. P. Antunes,Yuri F. Saporito,Sebastian Jaimungal
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF)
*备注: 17 pages, 7 figures,
Abstract:We present a novel numerical method for solving McKean-Vlasov forward-backward stochastic differential equations (MV-FBSDEs) with common noise, combining Picard iterations, elicitability and deep learning. The key innovation involves elicitability to derive a path-wise loss function, enabling efficient training of neural networks to approximate both the backward process and the conditional expectations arising from common noise - without requiring computationally expensive nested Monte Carlo simulations. The mean-field interaction term is parameterized via a recurrent neural network trained to minimize an elicitable score, while the backward process is approximated through a feedforward network representing the decoupling field. We validate the algorithm on a systemic risk inter-bank borrowing and lending model, where analytical solutions exist, demonstrating accurate recovery of the true solution. We further extend the model to quantile-mediated interactions, showcasing the flexibility of the elicitability framework beyond conditional means or moments. Finally, we apply the method to a non-stationary Aiyagari–Bewley–Huggett economic growth model with endogenous interest rates, illustrating its applicability to complex mean-field games without closed-form solutions.
[LG-44] Intrusion Detection in Internet of Vehicles Using Machine Learning
链接: https://arxiv.org/abs/2512.14958
作者: Hop Le,Izzat Alsmadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The Internet of Vehicles (IoV) has evolved modern transportation through enhanced connectivity and intelligent systems. However, this increased connectivity introduces critical vulnerabilities, making vehicles susceptible to cyber-attacks such Denial-ofService (DoS) and message spoofing. This project aims to develop a machine learning-based intrusion detection system to classify malicious Controller Area network (CAN) bus traffic using the CiCIoV2024 benchmark dataset. We analyzed various attack patterns including DoS and spoofing attacks targeting critical vehicle parameters such as Spoofing-GAS - gas pedal position, Spoofing-RPM, Spoofing-Speed, and Spoofing-Steering_Wheel. Our initial findings confirm a multi-class classification problem with a clear structural difference between attack types and benign data, providing a strong foundation for machine learning models.
[LG-45] Boundary condition enforcement with PINNs: a comparative study and verification on 3D geometries
链接: https://arxiv.org/abs/2512.14941
作者: Conor Rowan,Kai Hampleman,Kurt Maute,Alireza Doostan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Since their advent nearly a decade ago, physics-informed neural networks (PINNs) have been studied extensively as a novel technique for solving forward and inverse problems in physics and engineering. The neural network discretization of the solution field is naturally adaptive and avoids meshing the computational domain, which can both improve the accuracy of the numerical solution and streamline implementation. However, there have been limited studies of PINNs on complex three-dimensional geometries, as the lack of mesh and the reliance on the strong form of the partial differential equation (PDE) make boundary condition (BC) enforcement challenging. Techniques to enforce BCs with PINNs have proliferated in the literature, but a comprehensive side-by-side comparison of these techniques and a study of their efficacy on geometrically complex three-dimensional test problems are lacking. In this work, we i) systematically compare BC enforcement techniques for PINNs, ii) propose a general solution framework for arbitrary three-dimensional geometries, and iii) verify the methodology on three-dimensional, linear and nonlinear test problems with combinations of Dirichlet, Neumann, and Robin boundaries. Our approach is agnostic to the underlying PDE, the geometry of the computational domain, and the nature of the BCs, while requiring minimal hyperparameter tuning. This work represents a step in the direction of establishing PINNs as a mature numerical method, capable of competing head-to-head with incumbents such as the finite element method.
[LG-46] Cloud Security Leverag ing AI: A Fusion-Based AISOC for Malware and Log Behaviour Detection
链接: https://arxiv.org/abs/2512.14935
作者: Nnamdi Philip Okonkwo,Lubna Luxmi Dhirani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Cloud Security Operations Center (SOC) enable cloud governance, risk and compliance by providing insights visibility and control. Cloud SOC triages high-volume, heterogeneous telemetry from elastic, short-lived resources while staying within tight budgets. In this research, we implement an AI-Augmented Security Operations Center (AISOC) on AWS that combines cloud-native instrumentation with ML-based detection. The architecture uses three Amazon EC2 instances: Attacker, Defender, and Monitoring. We simulate a reverse-shell intrusion with Metasploit, and Filebeat forwards Defender logs to an Elasticsearch and Kibana stack for analysis. We train two classifiers, a malware detector built on a public dataset and a log-anomaly detector trained on synthetically augmented logs that include adversarial variants. We calibrate and fuse the scores to produce multi-modal threat intelligence and triage activity into NORMAL, SUSPICIOUS, and HIGH_CONFIDENCE_ATTACK. On held-out tests the fusion achieves strong macro-F1 (up to 1.00) under controlled conditions, though performance will vary in noisier and more diverse environments. These results indicate that simple, calibrated fusion can enhance cloud SOC capabilities in constrained, cost-sensitive setups.
[LG-47] Low-rank MMSE filters Kronecker-product representation and regularization: a new perspective
链接: https://arxiv.org/abs/2512.14932
作者: Daniel Gomes de Pinho Zanco,Leszek Szczecinski,Jacob Benesty,Eduardo Vinicius Kuhn
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we propose a method to efficiently find the regularization parameter for low-rank MMSE filters based on a Kronecker-product representation. We show that the regularization parameter is surprisingly linked to the problem of rank selection and, thus, properly choosing it, is crucial for low-rank settings. The proposed method is validated through simulations, showing significant gains over commonly used methods.
[LG-48] ATLAS: Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs
链接: https://arxiv.org/abs/2512.14908
作者: Turja Kundu,Sanjukta Bhowmick
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:We present ATLAS (Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs), a novel graph learning algorithm that addresses two important challenges in graph neural networks (GNNs). First, the accuracy of GNNs degrades when the graph is heterophilic. Second, iterative feature aggregation limits the scalability of GNNs to large graphs. We address these challenges by extracting topological information about graph communities at multiple levels of refinement, concatenating community assignments to the feature vector, and applying multilayer perceptrons (MLPs) to the resulting representation. This provides topological context about nodes and their neighborhoods without invoking aggregation. Because MLPs are typically more scalable than GNNs, our approach applies to large graphs without the need for sampling. Across a wide set of graphs, ATLAS achieves comparable accuracy to baseline methods, with gains as high as 20 percentage points over GCN for heterophilic graphs with negative structural bias and 11 percentage points over MLP for homophilic graphs. Furthermore, we show how multi-resolution community features systematically modulate performance in both homophilic and heterophilic settings, opening a principled path toward explainable graph learning.
[LG-49] How Does Fourier Analysis Network Work? A Mechanism Analysis and a New Dual-Activation Layer Proposal
链接: https://arxiv.org/abs/2512.14873
作者: Sam Jeong,Hae Yong Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fourier Analysis Network (FAN) was recently proposed as a simple way to improve neural network performance by replacing part of ReLU activations with sine and cosine functions. Although several studies have reported small but consistent gains across tasks, the underlying mechanism behind these improvements has remained unclear. In this work, we show that only the sine activation contributes positively to performance, whereas the cosine activation tends to be detrimental. Our analysis reveals that the improvement is not a consequence of the sine function’s periodic nature; instead, it stems from the function’s local behavior near x = 0, where its non-zero derivative mitigates the vanishing-gradient problem. We further show that FAN primarily alleviates the dying-ReLU problem, in which a neuron consistently receives negative inputs, produces zero gradients, and stops learning. Although modern ReLU-like activations, such as Leaky ReLU, GELU, and Swish, reduce ReLU’s zero-gradient region, they still contain input domains where gradients remain significantly diminished, contributing to slower optimization and hindering rapid convergence. FAN addresses this limitation by introducing a more stable gradient pathway. This analysis shifts the understanding of FAN’s benefits from a spectral interpretation to a concrete analysis of training dynamics, leading to the development of the Dual-Activation Layer (DAL), a more efficient convergence accelerator. We evaluate DAL on three tasks: classification of noisy sinusoidal signals versus pure noise, MNIST digit classification, and ECG-based biometric recognition. In all cases, DAL models converge faster and achieve equal or higher validation accuracy compared to models with conventional activations.
[LG-50] Unreliable Uncertainty Estimates with Monte Carlo Dropout
链接: https://arxiv.org/abs/2512.14851
作者: Aslak Djupskås,Alexander Johannes Stasik,Signe Riemer-Sørensen
类目: Machine Learning (cs.LG)
*备注: Accepted for the Northern Lights Deep Learning 2026
Abstract:Reliable uncertainty estimation is crucial for machine learning models, especially in safety-critical domains. While exact Bayesian inference offers a principled approach, it is often computationally infeasible for deep neural networks. Monte Carlo dropout (MCD) was proposed as an efficient approximation to Bayesian inference in deep learning by applying neuron dropout at inference time \citepgal2016dropout. Hence, the method generates multiple sub-models yielding a distribution of predictions to estimate uncertainty. We empirically investigate its ability to capture true uncertainty and compare to Gaussian Processes (GP) and Bayesian Neural Networks (BNN). We find that MCD struggles to accurately reflect the underlying true uncertainty, particularly failing to capture increased uncertainty in extrapolation and interpolation regions as observed in Bayesian models. The findings suggest that uncertainty estimates from MCD, as implemented and evaluated in these experiments, is not as reliable as those from traditional Bayesian approaches for capturing epistemic and aleatoric uncertainty.
[LG-51] Evaluating Weather Forecasts from a Decision Makers Perspective
链接: https://arxiv.org/abs/2512.14779
作者: Kornelius Raeth,Nicole Ludwig
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Standard weather forecast evaluations focus on the forecaster’s perspective and on a statistical assessment comparing forecasts and observations. In practice, however, forecasts are used to make decisions, so it seems natural to take the decision-maker’s perspective and quantify the value of a forecast by its ability to improve decision-making. Decision calibration provides a novel framework for evaluating forecast performance at the decision level rather than the forecast level. We evaluate decision calibration to compare Machine Learning and classical numerical weather prediction models on various weather-dependent decision tasks. We find that model performance at the forecast level does not reliably translate to performance in downstream decision-making: some performance differences only become apparent at the decision level, and model rankings can change among different decision tasks. Our results confirm that typical forecast evaluations are insufficient for selecting the optimal forecast model for a specific decision task.
[LG-52] Compute the edge p-Laplacian centrality for air traffic network
链接: https://arxiv.org/abs/2512.14749
作者: Loc Hoang Tran,Bao Nguyen Tran,Luong Anh Tuan Nguyen
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 7 pages
Abstract:The problem that we would like to solve in this paper is to compute the edge p-Laplacian centrality for the air traffic network. In this problem, instead of computing the edge p-Laplacian centrality directly which is the very hard problem, we convert the air traffic network to the line graph. Finally, we will compute the node p-Laplacian centrality of the line graph which is equivalent to the edge p-Laplacian of the air traffic network. In this paper, the novel un-normalized graph (p-) Laplacian based ranking method will be developed based on the un-normalized graph p-Laplacian operator definitions such as the curvature operator of graph (i.e. the un-normalized graph 1-Laplacian operator) and will be used to compute the node p-Laplacian centrality of the line graph. The results from the experiments show that the un-normalized graph p-Laplacian ranking methods can be implemented successfully.
[LG-53] Inference Time Feature Injection: A Lightweight Approach for Real-Time Recommendation Freshness
链接: https://arxiv.org/abs/2512.14734
作者: Qiang Chen,Venkatesh Ganapati Hegde,Hongfei Li
类目: Machine Learning (cs.LG)
*备注: 3rd IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things, September 06-07, 2025, Central Michigan University, USA
Abstract:Many recommender systems in long-form video streaming reply on batch-trained models and batch-updated features, where user features are updated daily and served statically throughout the day. While efficient, this approach fails to incorporate a user’s most recent actions, often resulting in stale recommendations. In this work, we present a lightweight, model-agnostic approach for intra-day personalization that selectively injects recent watch history at inference time without requiring model retraining. Our approach selectively overrides stale user features at inference time using the recent watch history, allowing the system to adapt instantly to evolving preferences. By reducing the personalization feedback loop from daily to intra-day, we observed a statistically significant 0.47% increase in key user engagement metrics which ranked among the most substantial engagement gains observed in recent experimentation cycles. To our knowledge, this is the first published evidence that intra-day personalization can drive meaningful impact in long-form video streaming service, providing a compelling alternative to full real-time architectures where model retraining is required.
[LG-54] Where to Explore: A Reach and Cost-Aware Approach for Unbiased Data Collection in Recommender Systems
链接: https://arxiv.org/abs/2512.14733
作者: Qiang Chen,Venkatesh Ganapati Hegde
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: IEEE Conference on Cognitive Machine Intelligence Nov. 11-14, 2025, Pittsburgh, PA, USA
Abstract:Exploration is essential to improve long-term recommendation quality, but it often degrades short-term business performance, especially in remote-first TV environments where users engage passively, expect instant relevance, and offer few chances for correction. This paper introduces an approach for delivering content-level exploration safely and efficiently by optimizing its placement based on reach and opportunity cost. Deployed on a large-scale streaming platform with over 100 million monthly active users, our approach identifies scroll-depth regions with lower engagement and strategically introduces a dedicated container, the “Something Completely Different” row containing randomized content. Rather than enforcing exploration uniformly across the user interface (UI), we condition its appearance on empirically low-cost, high-reach positions to ensure minimal tradeoff against platform-level watch time goals. Extensive A/B testing shows that this strategy preserves business metrics while collecting unbiased interaction data. Our method complements existing intra-row diversification and bandit-based exploration techniques by introducing a deployable, behaviorally informed mechanism for surfacing exploratory content at scale. Moreover, we demonstrate that the collected unbiased data, integrated into downstream candidate generation, significantly improves user engagement, validating its value for recommender systems.
[LG-55] A data-driven approach to inferring travel trajectory during peak hours in urban rail transit systems
链接: https://arxiv.org/abs/2512.14728
作者: Jie He,Yong Qin,Jianyuan Guo,Xuan Sun,Xuanchuan Zheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Refined trajectory inference of urban rail transit is of great significance to the operation organization. In this paper, we develop a fully data-driven approach to inferring individual travel trajectories in urban rail transit systems. It utilizes data from the Automatic Fare Collection (AFC) and Automatic Vehicle Location (AVL) systems to infer key trajectory elements, such as selected train, access/egress time, and transfer time. The approach includes establishing train alternative sets based on spatio-temporal constraints, data-driven adaptive trajectory inference, and trave l trajectory construction. To realize data-driven adaptive trajectory inference, a data-driven parameter estimation method based on KL divergence combined with EM algorithm (KLEM) was proposed. This method eliminates the reliance on external or survey data for parameter fitting, enhancing the robustness and applicability of the model. Furthermore, to overcome the limitations of using synthetic data to validate the result, this paper employs real individual travel trajectory data for verification. The results show that the approach developed in this paper can achieve high-precision passenger trajectory inference, with an accuracy rate of over 90% in urban rail transit travel trajectory inference during peak hours.
[LG-56] Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example
链接: https://arxiv.org/abs/2512.14721
作者: Arno Appenzeller,Nick Terzer,André Hohmeyer,Jan-Philipp Redlich,Sabine Luttmann,Friedrich Feuerhake,Nadine S. Schaadt,Timm Intemann,Sarah Teuber-Hanselmann,Stefan Nikolin,Joachim Weis,Klaus Kraywinkel,Pascal Birnstill
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures
Abstract:The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
[LG-57] Is GPT -OSS All You Need? Benchmarking Large Language Models for Financial Intelligence and the Surprising Efficiency Paradox
链接: https://arxiv.org/abs/2512.14717
作者: Ziqian Bi,Danyang Zhang,Junhao Song,Chiung-Yi Tseng
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid adoption of large language models in financial services necessitates rigorous evaluation frameworks to assess their performance, efficiency, and practical applicability. This paper conducts a comprehensive evaluation of the GPT-OSS model family alongside contemporary LLMs across ten diverse financial NLP tasks. Through extensive experimentation on 120B and 20B parameter variants of GPT-OSS, we reveal a counterintuitive finding: the smaller GPT-OSS-20B model achieves comparable accuracy (65.1% vs 66.5%) while demonstrating superior computational efficiency with 198.4 Token Efficiency Score and 159.80 tokens per second processing speed [1]. Our evaluation encompasses sentiment analysis, question answering, and entity recognition tasks using real-world financial datasets including Financial PhraseBank, FiQA-SA, and FLARE FINERORD. We introduce novel efficiency metrics that capture the trade-off between model performance and resource utilization, providing critical insights for deployment decisions in production environments. The benchmark reveals that GPT-OSS models consistently outperform larger competitors including Qwen3-235B, challenging the prevailing assumption that model scale directly correlates with task performance [2]. Our findings demonstrate that architectural innovations and training strategies in GPT-OSS enable smaller models to achieve competitive performance with significantly reduced computational overhead, offering a pathway toward sustainable and cost-effective deployment of LLMs in financial applications.
[LG-58] A Bayesian latent class reinforcement learning framework to capture adaptive feedback-driven travel behaviour
链接: https://arxiv.org/abs/2512.14713
作者: Georges Sfeir,Stephane Hess,Thomas O. Hancock,Filipe Rodrigues,Jamal Amani Rad,Michiel Bliemer,Matthew Beck,Fayyaz Khan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 8 figures, 6 tables
Abstract:Many travel decisions involve a degree of experience formation, where individuals learn their preferences over time. At the same time, there is extensive scope for heterogeneity across individual travellers, both in their underlying preferences and in how these evolve. The present paper puts forward a Latent Class Reinforcement Learning (LCRL) model that allows analysts to capture both of these phenomena. We apply the model to a driving simulator dataset and estimate the parameters through Variational Bayes. We identify three distinct classes of individuals that differ markedly in how they adapt their preferences: the first displays context-dependent preferences with context-specific exploitative tendencies; the second follows a persistent exploitative strategy regardless of context; and the third engages in an exploratory strategy combined with context-specific preferences.
[LG-59] SGEMAS: A Self-Growing Ephemeral Multi-Agent System for Unsupervised Online Anomaly Detection via Entropic Homeostasis
链接: https://arxiv.org/abs/2512.14708
作者: Mustapha Hamdi(InnoDeep)
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Current deep learning approaches for physiological signal monitoring suffer from static topologies and constant energy consumption. We introduce SGEMAS (Self-Growing Ephemeral Multi-Agent System), a bio-inspired architecture that treats intelligence as a dynamic thermodynamic process. By coupling a structural plasticity mechanism (agent birth death) to a variational free energy objective, the system naturally evolves to minimize prediction error with extreme sparsity. An ablation study on the MIT-BIH Arrhythmia Database reveals that adding a multi-scale instability index to the agent dynamics significantly improves performance. In a challenging inter-patient, zero-shot setting, the final SGEMAS v3.3 model achieves a mean AUC of 0.570 ± 0.070, outperforming both its simpler variants and a standard autoencoder baseline. This result validates that a physics-based, energy-constrained model can achieve robust unsupervised anomaly detection, offering a promising direction for efficient biomedical AI.
[LG-60] he Graph-Embedded Hazard Model (GEHM): Stochastic Network Survival Dynamics on Economic Graphs
链接: https://arxiv.org/abs/2512.14705
作者: Diego Vallarino
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:This paper develops a nonlinear evolution framework for modelling survival dynamics on weighted economic networks by coupling a graph-based p -Laplacian diffusion operator with a stochastic structural drift. The resulting finite-dimensional PDE–SDE system captures how node-level survival reacts to nonlinear diffusion pressures while an aggregate complexity factor evolves according to an Itô process. Using accretive operator theory, nonlinear semigroup methods, and stochastic analysis, we establish existence and uniqueness of mild solutions, derive topology-dependent energy dissipation inequalities, and characterise the stability threshold separating dissipative, critical, amplifying, and explosive regimes. Numerical experiments on Barabási–Albert networks confirm that hub dominance magnifies nonlinear gradients and compresses stability margins, producing heavy-tailed survival distributions and occasional explosive behaviour.
[LG-61] High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations
链接: https://arxiv.org/abs/2512.15684
作者: Victor Léger,Florent Chatelain
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.
[LG-62] Prospects for quantum advantage in machine learning from the representability of functions
链接: https://arxiv.org/abs/2512.15661
作者: Sergi Masot-Llima,Elies Gil-Fuster,Carlos Bravo-Prieto,Jens Eisert,and Tommaso Guaita
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures, comments welcome
Abstract:Demonstrating quantum advantage in machine learning tasks requires navigating a complex landscape of proposed models and algorithms. To bring clarity to this search, we introduce a framework that connects the structure of parametrized quantum circuits to the mathematical nature of the functions they can actually learn. Within this framework, we show how fundamental properties, like circuit depth and non-Clifford gate count, directly determine whether a model’s output leads to efficient classical simulation or surrogation. We argue that this analysis uncovers common pathways to dequantization that underlie many existing simulation methods. More importantly, it reveals critical distinctions between models that are fully simulatable, those whose function space is classically tractable, and those that remain robustly quantum. This perspective provides a conceptual map of this landscape, clarifying how different models relate to classical simulability and pointing to where opportunities for quantum advantage may lie.
[LG-63] Learning continuous SOC-dependent thermal decomposition kinetics for Li-ion cathodes using KA-CRNNs
链接: https://arxiv.org/abs/2512.15628
作者: Benjamin C. Koenig,Sili Deng
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 17 pages, 10 figures, 2 tables
Abstract:Thermal runaway in lithium-ion batteries is strongly influenced by the state of charge (SOC). Existing predictive models typically infer scalar kinetic parameters at a full SOC or a few discrete SOC levels, preventing them from capturing the continuous SOC dependence that governs exothermic behavior during abuse conditions. To address this, we apply the Kolmogorov-Arnold Chemical Reaction Neural Network (KA-CRNN) framework to learn continuous and realistic SOC-dependent exothermic cathode-electrolyte interactions. We apply a physics-encoded KA-CRNN to learn SOC-dependent kinetic parameters for cathode-electrolyte decomposition directly from differential scanning calorimetry (DSC) data. A mechanistically informed reaction pathway is embedded into the network architecture, enabling the activation energies, pre-exponential factors, enthalpies, and related parameters to be represented as continuous and fully interpretable functions of the SOC. The framework is demonstrated for NCA, NM, and NMA cathodes, yielding models that reproduce DSC heat-release features across all SOCs and provide interpretable insight into SOC-dependent oxygen-release and phase-transformation mechanisms. This approach establishes a foundation for extending kinetic parameter dependencies to additional environmental and electrochemical variables, supporting more accurate and interpretable thermal-runaway prediction and monitoring.
[LG-64] A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
链接: https://arxiv.org/abs/2512.15606
作者: Carlos Couto,José Mourão,Mário A. T. Figueiredo,Pedro Ribeiro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures
Abstract:Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
[LG-65] Photonics-Enhanced Graph Convolutional Networks
链接: https://arxiv.org/abs/2512.15549
作者: Yuan Wang,Oleksandr Kyriienko
类目: Optics (physics.optics); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 12 pages, 6 figures
Abstract:Photonics can offer a hardware-native route for machine learning (ML). However, efficient deployment of photonics-enhanced ML requires hybrid workflows that integrate optical processing with conventional CPU/GPU based neural network architectures. Here, we propose such a workflow that combines photonic positional embeddings (PEs) with advanced graph ML models. We introduce a photonics-based method that augments graph convolutional networks (GCNs) with PEs derived from light propagation on synthetic frequency lattices whose couplings match the input graph. We simulate propagation and readout to obtain internode intensity correlation matrices, which are used as PEs in GCNs to provide global structural information. Evaluated on Long Range Graph Benchmark molecular datasets, the method outperforms baseline GCNs with Laplacian based PEs, achieving 6.3% lower mean absolute error for regression and 2.3% higher average precision for classification tasks using a two-layer GCN as a baseline. When implemented in high repetition rate photonic hardware, correlation measurements can enable fast feature generation by bypassing digital simulation of PEs. Our results show that photonic PEs improve GCN performance and support optical acceleration of graph ML.
[LG-66] Autonomous Pressure Control in MuVacAS via Deep Reinforcement Learning and Deep Learning Surrogate Models NEURIPS2025
链接: https://arxiv.org/abs/2512.15521
作者: Guillermo Rodriguez-Llorente,Galo Gallardo,Rodrigo Morant Navascués,Nikita Khvatkin Petrovsky,Anderson Sabogal,Roberto Gómez-Espinosa Martín
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, included in Machine Learning and the Physical Sciences Workshop @ NeurIPS 2025
Abstract:The development of nuclear fusion requires materials that can withstand extreme conditions. The IFMIF-DONES facility, a high-power particle accelerator, is being designed to qualify these materials. A critical testbed for its development is the MuVacAS prototype, which replicates the final segment of the accelerator beamline. Precise regulation of argon gas pressure within its ultra-high vacuum chamber is vital for this task. This work presents a fully data-driven approach for autonomous pressure control. A Deep Learning Surrogate Model, trained on real operational data, emulates the dynamics of the argon injection system. This high-fidelity digital twin then serves as a fast-simulation environment to train a Deep Reinforcement Learning agent. The results demonstrate that the agent successfully learns a control policy that maintains gas pressure within strict operational limits despite dynamic disturbances. This approach marks a significant step toward the intelligent, autonomous control systems required for the demanding next-generation particle accelerator facilities.
[LG-67] Online Partitioned Local Depth for semi-supervised applications
链接: https://arxiv.org/abs/2512.15436
作者: John D. Foley,Justin T. Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 2 figures
Abstract:We introduce an extension of the partitioned local depth (PaLD) algorithm that is adapted to online applications such as semi-supervised prediction. The new algorithm we present, online PaLD, is well-suited to situations where it is a possible to pre-compute a cohesion network from a reference dataset. After O(n^3) steps to construct a queryable data structure, online PaLD can extend the cohesion network to a new data point in O(n^2) time. Our approach complements previous speed up approaches based on approximation and parallelism. For illustrations, we present applications to online anomaly detection and semi-supervised classification for health-care datasets.
[LG-68] ColliderML: The First Release of an OpenDataDetector High-Luminosity Physics Benchmark Dataset
链接: https://arxiv.org/abs/2512.15230
作者: Doğa Elitez,Paul Gessinger,Daniel Murnane,Marcus Selchou Raaholt,Andreas Salzburger,Stine Kofoed Skov,Andreas Stefl,Anna Zaborowska
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
*备注: 28 pages
Abstract:We introduce ColliderML - a large, open, experiment-agnostic dataset of fully simulated and digitised proton-proton collisions in High-Luminosity Large Hadron Collider conditions ( \sqrts=14 TeV, mean pile-up \mu = 200 ). ColliderML provides one million events across ten Standard Model and Beyond Standard Model processes, plus extensive single-particle samples, all produced with modern next-to-leading order matrix element calculation and showering, realistic per-event pile-up overlay, a validated OpenDataDetector geometry, and standard reconstructions. The release fills a major gap for machine learning (ML) research on detector-level data, provided on the ML-friendly Hugging Face platform. We present physics coverage and the generation, simulation, digitisation and reconstruction pipeline, describe format and access, and initial collider physics benchmarks.
[LG-69] Adaptive Weighted Genetic Algorithm-Optimized SVR for Robust Long-Term Forecasting of Global Stock Indices for investment decisions
链接: https://arxiv.org/abs/2512.15113
作者: Mohit Beniwal
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:
Abstract:Long-term price forecasting remains a formidable challenge due to the inherent uncertainty over the long term, despite some success in short-term predictions. Nonetheless, accurate long-term forecasts are essential for high-net-worth individuals, institutional investors, and traders. The proposed improved genetic algorithm-optimized support vector regression (IGA-SVR) model is specifically designed for long-term price prediction of global indices. The performance of the IGA-SVR model is rigorously evaluated and compared against the state-of-the-art baseline models, the Long Short-Term Memory (LSTM), and the forward-validating genetic algorithm optimized support vector regression (OGA-SVR). Extensive testing was conducted on the five global indices, namely Nifty, Dow Jones Industrial Average (DJI), DAX Performance Index (DAX), Nikkei 225 (N225), and Shanghai Stock Exchange Composite Index (SSE) from 2021 to 2024 of daily price prediction up to a year. Overall, the proposed IGA-SVR model achieved a reduction in MAPE by 19.87% compared to LSTM and 50.03% compared to OGA-SVR, demonstrating its superior performance in long-term daily price forecasting of global indices. Further, the execution time for LSTM was approximately 20 times higher than that of IGA-SVR, highlighting the high accuracy and computational efficiency of the proposed model. The genetic algorithm selects the optimal hyperparameters of SVR by minimizing the arithmetic mean of the Mean Absolute Percentage Error (MAPE) calculated over the full training dataset and the most recent five years of training data. This purposefully designed training methodology adjusts for recent trends while retaining long-term trend information, thereby offering enhanced generalization compared to the LSTM and rolling-forward validation approach employed by OGA-SVR, which forgets long-term trends and suffers from recency bias.
[LG-70] Efficient Nudged Elastic Band Method using Neural Network Bayesian Algorithm Execution
链接: https://arxiv.org/abs/2512.14993
作者: Pranav Kakhandiki,Sathya Chitturi,Daniel Ratner,Sean Gasiorowski
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages, 12 figures
Abstract:The discovery of a minimum energy pathway (MEP) between metastable states is crucial for scientific tasks including catalyst and biomolecular design. However, the standard nudged elastic band (NEB) algorithm requires hundreds to tens of thousands of compute-intensive simulations, making applications to complex systems prohibitively expensive. We introduce Neural Network Bayesian Algorithm Execution (NN-BAX), a framework that jointly learns the energy landscape and the MEP. NN-BAX sequentially fine-tunes a foundation model by actively selecting samples targeted at improving the MEP. Tested on Lennard-Jones and Embedded Atom Method systems, our approach achieves a one to two order of magnitude reduction in energy and force evaluations with negligible loss in MEP accuracy and demonstrates scalability to 100-dimensional systems. This work is therefore a promising step towards removing the computational barrier for MEP discovery in scientifically relevant systems, suggesting that weeks-long calculations may be achieved in hours or days with minimal loss in accuracy.
[LG-71] Deep learning water-unsuppressed MRSI at ultra-high field for simultaneous quantitative metabolic susceptibility and myelin water imaging
链接: https://arxiv.org/abs/2512.14929
作者: Paul J. Weiser,Jiye Kim,Jongho Lee,Amirmohammad Shamaei,Gulnur Ungan,Malte Hoffmann,Antoine Klauser,Berkin Bilgic,Ovidiu C. Andronesi
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Purpose: Magnetic Resonance Spectroscopic Imaging (MRSI) maps endogenous brain metabolism while suppressing the overwhelming water signal. Water-unsuppressed MRSI (wu-MRSI) allows simultaneous imaging of water and metabolites, but large water sidebands cause challenges for metabolic fitting. We developed an end-to-end deep-learning pipeline to overcome these challenges at ultra-high field. Methods:Fast high-resolution wu-MRSI was acquired at 7T with non-cartesian ECCENTRIC sampling and ultra-short echo time. A water and lipid removal network (WALINET+) was developed to remove lipids, water signal, and sidebands. MRSI reconstruction was performed by DeepER and a physics-informed network for metabolite fitting. Water signal was used for absolute metabolite quantification, quantitative susceptibility mapping (QSM), and myelin water fraction imaging (MWF). Results: WALINET+ provided the lowest NRMSE ( 2%) in simulations and in vivo the smallest bias ( 20%) and limits-of-agreement (±63%) between wu-MRSI and ws-MRSI scans. Several metabolites such as creatine and glutamate showed higher SNR in wu-MRSI. QSM and MWF obtained from wu-MRSI and GRE showed good agreement with 0 ppm/5.5% bias and ±0.05 ppm/ ± 12.75% limits-of-agreement. Conclusion: High-quality metabolic, QSM, and MWF mapping of the human brain can be obtained simultaneously by ECCENTRIC wu-MRSI at 7T with 2 mm isotropic resolution in 12 min. WALINET+ robustly removes water sidebands while preserving metabolite signal, eliminating the need for water suppression and separate water acquisitions.
信息检索
[IR-0] MedNuggetizer: Confidence-Based Information Nugget Extraction from Medical Documents ECIR2026
链接: https://arxiv.org/abs/2512.15384
作者: Gregor Donabauer,Samy Ateia,Udo Kruschwitz,Maximilian Burger,Matthias May,Christian Gilfrich,Maximilian Haas,Julio Ruben Rodas Garzaro,Christoph Eckl
类目: Information Retrieval (cs.IR)
*备注: Preprint accepted at ECIR 2026
Abstract:We present MedNuggetizer, this https URL access is available upon request., a tool for query-driven extraction and clustering of information nuggets from medical documents to support clinicians in exploring underlying medical evidence. Backed by a large language model (LLM), \textitMedNuggetizer performs repeated extractions of information nuggets that are then grouped to generate reliable evidence within and across multiple documents. We demonstrate its utility on the clinical use case of \textitantibiotic prophylaxis before prostate biopsy by using major urological guidelines and recent PubMed studies as sources of information. Evaluation by domain experts shows that \textitMedNuggetizer provides clinicians and researchers with an efficient way to explore long documents and easily extract reliable, query-focused medical evidence.
[IR-1] ArcBERT: An LLM -based Search Engine for Exploring Integrated Multi-Omics Metadata
链接: https://arxiv.org/abs/2512.15365
作者: Gajendra Doniparthi,Shashank Balu Pandhare,Stefan Deßloch,Timo Mühlhaus
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.

