本篇博文主要内容为 2025-12-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-10)
今日共更新500篇论文,其中:
- 自然语言处理共31篇(Computation and Language (cs.CL))
- 人工智能共141篇(Artificial Intelligence (cs.AI))
- 计算机视觉共132篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共143篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
【速读】: 该论文旨在解决如何准确预测大型语言模型(Large Language Models, LLMs)在下游任务上的性能问题,传统方法依赖预训练损失等代理指标,难以可靠地外推模型在实际任务中的表现。其解决方案的关键在于提出一种直接建模训练预算与基准性能之间关系的框架,发现当token-to-parameter比例固定时,对数精度(log accuracy)遵循简单的幂律(power law)关系,从而实现比以往两阶段方法更稳定的外推效果,并进一步引入能跨不同token-to-parameter比例和推理计算量(inference compute)预测准确率的函数形式,显著提升了预测精度与实用性。
链接: https://arxiv.org/abs/2512.08894
作者: Jakub Krajewski,Amitis Shidani,Dan Busbridge,Sam Wiseman,Jason Ramapuram
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
zh
[NLP-1] oward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中仍存在的事实性偏差问题,即生成内容与提供的检索证据不一致或超出其范围的“忠实性失败”(faithfulness failures)。现有检测方法要么依赖大规模标注数据训练专用检测器,要么通过调用外部大语言模型(Large Language Models, LLMs)作为评判者,导致计算成本高昂;而基于LLM内部表示的方法准确率不足。为此,作者提出RAGLens——一种基于稀疏自编码器(Sparse Autoencoders, SAEs)的轻量级检测框架,其核心创新在于利用机制可解释性技术从LLM内部激活中解耦出与RAG幻觉高度相关的特征,并结合信息论驱动的特征选择和加性特征建模策略,实现高精度、低开销且具备可解释性的幻觉检测,从而为后续的后处理修正提供依据。
链接: https://arxiv.org/abs/2512.08892
作者: Guangzhi Xiong,Zhenghao He,Bohan Liu,Sanchit Sinha,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at this https URL.
zh
[NLP-2] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
【速读】: 该论文试图解决标准Transformer模型在深度增加时出现的“深度利用率不足”问题(Curse of Depth),即随着层数增多,后半部分层对最终输出分布的贡献显著降低,导致训练效率和推理性能受限。解决方案的关键在于通过渐进式深度增长机制(如MIDAS方法)实现模型结构的动态演化:首先逐步增加中间层(middle stacking),从而更有效地利用模型深度、重构残差流(residual stream)结构,并促进可置换计算模块(permutable computational blocks)的形成,最终提升下游推理任务的表现。
链接: https://arxiv.org/abs/2512.08819
作者: Ferdinand Kapl,Emmanouil Angelis,Tobias Höppe,Kaitlin Maile,Johannes von Oswald,Nino Scherrer,Stefan Bauer
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz AI (赫尔姆霍兹人工智能); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
zh
[NLP-3] Ask Answer and Detect: Role-Playing LLM s for Personality Detection with Question-Conditioned Mixture-of-Experts
【速读】: 该论文旨在解决当前基于社交媒体文本进行人格检测(personality detection)时面临的两大挑战:一是标签稀缺导致的监督信号不足,二是用户语言与抽象心理构念之间语义映射不明确的问题。解决方案的关键在于提出一种名为 ROME 的新框架,其核心创新是通过大语言模型(LLM)的角色扮演能力,模拟用户对标准化心理量表(psychometric questionnaires)的回答,从而将自由文本转化为基于问卷的可解释证据;在此基础上,引入一个以问题为条件的专家混合模块(question-conditioned Mixture-of-Experts module),联合建模用户帖子与问卷回答表示,并在多任务学习框架中将问卷作答作为辅助任务,显著增强人格预测的性能与可解释性。
链接: https://arxiv.org/abs/2512.08814
作者: Yifan Lyu,Liang Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou); University of International Business and Economics
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a “posts - user vector - labels” modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs’ role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).
zh
[NLP-4] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLM s
【速读】: 该论文旨在解决在联邦学习(Federated Learning, FL)环境中,如何有效对齐大型语言模型(Large Language Models, LLMs)以适应多样化人类偏好这一挑战,传统方法往往无法充分代表不同群体的观点。其解决方案的关键在于提出一种综合评估框架,用于系统性衡量不同奖励聚合策略在对齐质量与公平性之间的权衡;具体而言,通过在每个参与组本地评估生成结果并产生奖励信号,服务器仅聚合群体级别的奖励而不访问原始数据,并引入一种新颖的自适应机制——根据各组的历史对齐表现动态调整偏好权重。实验表明,该自适应方案在保持竞争性对齐得分的同时,显著提升了公平性表现,为构建真正多元且公平对齐的LLM提供了可行路径。
链接: https://arxiv.org/abs/2512.08786
作者: Mahmoud Srewa,Tianyu Zhao,Salma Elmalaki
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group’s historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
zh
[NLP-5] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
【速读】: 该论文旨在解决低资源语言(lower-resource languages)在偏好优化(preference optimization)过程中因缺乏流畅的奖励模型(reward model)而导致语言模型生成文本不流畅的问题。由于这些语言通常缺少由母语者撰写的标注数据以及能够生成高质量合成数据的语言模型,现有方法难以有效对齐偏好并保持流畅性。论文提出一种基于策略梯度的在线策略训练(on-policy training)方法,其关键在于无需依赖目标语言的指令微调数据或机器翻译数据,即可在偏好对齐的同时保持生成文本的流畅性,从而在挪威语书面语(Norwegian Bokmål)案例中通过母语者评估验证了该方法的有效性。
链接: https://arxiv.org/abs/2512.08777
作者: David Samuel,Lilja Øvrelid,Erik Velldal,Andrey Kutuzov
机构: Language Technology Group, University of Oslo (奥斯陆大学语言技术组)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
zh
[NLP-6] Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture AACL
【速读】: 该论文旨在解决手语识别领域中尚未充分探索的“手语定位”(Sign Language Spotting)问题,即在连续手语序列中检测特定查询手语视频是否存在。传统方法依赖中间词素识别或文本匹配,存在复杂度高、鲁棒性差的问题。本文提出一种端到端的解决方案,其关键在于采用仅包含编码器结构的骨干网络,直接基于从手语视频中提取的姿态关键点(pose keypoints)进行二分类判断,从而避免使用原始RGB帧带来的计算开销和视觉噪声干扰,显著提升了模型效率与准确性。
链接: https://arxiv.org/abs/2512.08738
作者: Samuel Ebimobowei Johnny,Blessed Guda,Emmanuel Enejo Aaron,Assane Gueye
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To appear at AACL-IJCNLP 2025 Workshop WSLP
Abstract:Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88% accuracy and 60.00% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at this https URL
zh
[NLP-7] Automatic Essay Scoring and Feedback Generation in Basque Language Learning LREC2026
【速读】: 该论文旨在解决低资源语言(如巴斯克语)中自动作文评分(AES)与反馈生成的缺乏公开数据集和有效模型的问题。其核心解决方案在于构建首个面向CEFR C1水平巴斯克语作文的公开数据集,包含3,200篇由专家标注的作文,涵盖正确性、丰富性、连贯性、衔接性和任务契合度等维度,并附有详细反馈与错误示例;同时,通过监督微调(SFT)开源大模型Latxa 8B/70B,在评分一致性与反馈质量上超越了闭源先进系统(如GPT-5和Claude Sonnet 4.5),并提出结合自动一致性指标与专家验证的新型反馈评估方法,确保生成反馈在教育学意义上具有针对性和广泛性。
链接: https://arxiv.org/abs/2512.08713
作者: Ekhi Azurmendi,Xabier Arregi,Oier Lopez de Lacalle
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to LREC 2026
Abstract:This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
zh
[NLP-8] An Agent ic AI System for Multi-Framework Communication Coding
链接: https://arxiv.org/abs/2512.08659
作者: Bohao Yang,Rui Yang,Joshua M. Biro,Haoyuan Wang,Jessica L. Handley,Brianna Richardson,Sophia Bessias,Nicoleta Economou-Zavlanos,Armando D. Bedoya,Monica Agrawal,Michael M. Zavlanos,Anand Chowdhury,Raj M. Ratwani,Kai Sun,Kathryn I. Pollak,Michael J. Pencina,Chuan Hong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-9] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问卷式提示(questionnaire-style prompts)下生成响应时缺乏系统性评估与可重复性的问题,尤其关注问卷呈现方式、提示扰动及响应生成方法对生成结果与人类回答一致性的影响。其解决方案的关键在于提出一个开源的 Python 框架 QSTN,支持基于问卷的模拟调查和标注任务,能够高效评估不同变量对响应质量的影响,并通过无代码用户界面降低实验门槛,从而显著提升 LLM 相关研究的可靠性与可复现性。
链接: https://arxiv.org/abs/2512.08646
作者: Maximilian Kreutner,Jens Rupprecht,Georg Ahnert,Ahmed Salem,Markus Strohmaier
机构: University of Mannheim (曼海姆大学); GESIS - Leibniz Institute for the Social Sciences (GESIS-莱布尼茨社会科学研究所); CSH Vienna (维也纳社会研究中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: The Python package is available at this https URL
Abstract:We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ( 40 million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
zh
[NLP-10] HealthcareNLP: where are we and what is next? LREC2026
【速读】: 该论文旨在解决当前医疗领域自然语言处理(Natural Language Processing, NLP)研究中存在的系统性遗漏问题,包括对合成数据生成以应对隐私挑战、可解释临床NLP以促进实际部署,以及检索增强生成(Retrieval-Augmented Generation, RAG)和大语言模型(Large Language Models, LLMs)与知识图谱(Knowledge Graphs, KGs)神经符号融合等重要方法的忽视。其解决方案的关键在于构建一个三层结构化的HealthcareNLP框架:数据/资源层涵盖标注规范、伦理审批、治理机制与合成数据;NLP评估层整合命名实体识别(Named Entity Recognition, NER)、关系抽取(Relation Extraction, RE)、情感分析及链接/编码任务,并推动可解释健康人工智能(Explainable HealthAI)的发展;患者层聚焦患者参与与互动(Patient Public Involvement and Engagement, PPIE)、健康素养、翻译简化与摘要生成,以及共享决策支持。通过这一分层架构,论文为医疗NLP提供了一个全面、结构清晰且面向实践的应用视角,并辅以动手环节提升受众实操能力。
链接: https://arxiv.org/abs/2512.08617
作者: Lifeng Han,Paul Rayson,Suzan Verberne,Andrew Moore,Goran Nenadic
机构: Leiden University Medical Centre (莱顿大学医学中心); Lancaster University (兰卡斯特大学); University of Manchester (曼彻斯特大学); The Leiden Institute of Advanced Computer Science (莱顿大学高级计算机科学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted Tutorial by LREC 2026 this https URL
Abstract:This proposed tutorial focuses on Healthcare Domain Applications of NLP, what we have achieved around HealthcareNLP, and the challenges that lie ahead for the future. Existing reviews in this domain either overlook some important tasks, such as synthetic data generation for addressing privacy concerns, or explainable clinical NLP for improved integration and implementation, or fail to mention important methodologies, including retrieval augmented generation and the neural symbolic integration of LLMs and KGs. In light of this, the goal of this tutorial is to provide an introductory overview of the most important sub-areas of a patient- and resource-oriented HealthcareNLP, with three layers of hierarchy: data/resource layer: annotation guidelines, ethical approvals, governance, synthetic data; NLP-Eval layer: NLP tasks such as NER, RE, sentiment analysis, and linking/coding with categorised methods, leading to explainable HealthAI; patients layer: Patient Public Involvement and Engagement (PPIE), health literacy, translation, simplification, and summarisation (also NLP tasks), and shared decision-making support. A hands-on session will be included in the tutorial for the audience to use HealthcareNLP applications. The target audience includes NLP practitioners in the healthcare application domain, NLP researchers who are interested in domain applications, healthcare researchers, and students from NLP fields. The type of tutorial is “Introductory to CL/NLP topics (HealthcareNLP)” and the audience does not need prior knowledge to attend this. Tutorial materials: this https URL
zh
[NLP-11] Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与多智能体系统在处理长时程推理任务时面临的两大挑战:一是推理能力受限于任务复杂度导致的稳定性不足,二是随着任务规模扩大而急剧上升的计算成本。其解决方案的关键在于提出一种分层多智能体架构,将推理过程分布在一个64×64的轻量级智能体网格中,并辅以选择性调用器(selective oracle)支持;通过空间课程学习(spatial curriculum)逐步扩展智能体的操作区域,使中心区域的智能体先掌握简单任务,再逐步过渡到边缘区域的复杂任务;同时引入负对数似然(Negative Log-Likelihood, NLL)作为置信度指标,结合Thompson Sampling课程管理器动态选择训练区域,从而实现基于智能体能力与NLL驱动奖励信号的自适应优化,显著提升了系统的稳定性、降低了对调用器的依赖,并增强了分布式协作下的长程推理能力。
链接: https://arxiv.org/abs/2512.08545
作者: Indrajit Kar,Kalathur Chenchu Kishore Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 22 pages, 2 tables, 9 figures
Abstract:Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
zh
[NLP-12] Beyond Real Weights: Hypercomplex Representations for Stable Quantization WACV
【速读】: 该论文旨在解决多模态语言模型(Multimodal Language Models, MLLMs)因参数量庞大而导致计算资源消耗高、部署效率低的问题。其核心挑战在于如何在不损害视觉特征与语言表示之间对齐能力的前提下,实现模型的高效压缩。解决方案的关键在于提出一种渐进式重参数化策略(progressive reparameterization strategy),通过逐步将密集型前馈网络(feed-forward network, FFN)模块替换为紧凑的参数化超复数乘法(Parameterized Hypercomplex Multiplication, PHM)层,并结合残差插值调度机制以及轻量级重建和知识蒸馏损失函数,确保PHM模块在训练过程中继承原密集模块的功能行为。该方法显著降低了模型参数量和浮点运算次数(FLOPs),同时保持了强大的多模态对齐性能,实现了推理速度的提升而无需牺牲输出质量。
链接: https://arxiv.org/abs/2512.08524
作者: Jawad Ibn Ahad,Maisha Rahman,Amrijit Biswas,Muhammad Rafsan Kabir,Robin Krambroeckers,Sifat Momen,Nabeel Mohammed,Shafin Rahman
机构: RobotBulls Labs (RobotBulls 实验室); North South University (北方南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted in Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract:Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
zh
[NLP-13] Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
【速读】: 该论文旨在解决在线游戏和社区中因匿名性导致的不当言论(inappropriate utterance)频发问题,这些问题容易升级为言语暴力甚至犯罪行为,亟需有效的检测技术以构建更安全的交流环境。解决方案的关键在于提出一种软归纳偏置(soft inductive bias)方法,通过显式定义推理视角来引导大语言模型的推理过程,从而提升决策合理性并减少推理错误。该方法在韩语大语言模型上进行微调,并通过定量性能对比与定性评估验证有效性,结果显示所提出的Kanana-1.5模型平均准确率达到87.0046%,相较标准监督学习提升约3.89%,表明该方法不仅超越了单纯的知识模仿,还能通过约束推理视角实现更精确、一致的判断。
链接: https://arxiv.org/abs/2512.08480
作者: Ju-Young Kim,Ji-Hong Park,Se-Yeon Lee,Sujin Park,Gun-Woo Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: in Chinese language, Published in the Proceedings of the 37th Annual Conference on Human and Language Technology, 2025, pp. 714-719. (English translation assisted by GPT)
Abstract:Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.
zh
[NLP-14] What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在机器翻译(Machine Translation, MT)和大语言模型(Large Language Models, LLMs)中因性别偏见导致的决策不透明问题,其核心目标是从“测量偏见”转向“探索偏见来源”。解决方案的关键在于利用对比解释(contrastive explanations)与显著性归因(saliency attribution)方法,识别源句中哪些输入词元(input tokens)会触发翻译模型在目标语言中选择特定性别标记,从而揭示模型性别决策的上下文依据。研究进一步通过对比人类对性别感知的判断与模型归因结果,发现二者存在显著重叠,并辅以语言学分析,验证了归因结果的合理性,为基于可解释性技术缓解性别偏见提供了实证基础与方法论支持。
链接: https://arxiv.org/abs/2512.08440
作者: Janiça Hackenbuchner,Arda Tezcan,Joke Daems
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
zh
[NLP-15] Are generative AI text annotations systematically biased?
【速读】: 该论文旨在解决生成式大语言模型(Generative Language Models, GLLMs)在文本标注任务中存在系统性偏差的问题,特别是其与人工标注结果不一致所导致的下游分析误差。研究通过概念复制(conceptual replication)方法,利用多种GLLMs(如Llama3.1:8b、Llama3.3:70b、GPT4o、Qwen2.5:72b)结合五类提示词对五个核心概念(政治内容、互动性、理性、不文明行为和意识形态)进行标注,并对比其与人工标注的一致性。关键发现是:尽管GLLMs在F1分数上表现良好,但其标注结果在出现频率上显著偏离人工标注,且不同GLLM间的标注高度相似,远超其与人工标注的重合度,表明存在系统性偏差;这种偏差无法仅通过F1分数差异来解释,说明评估指标需进一步优化以捕捉标注分布层面的偏移。
链接: https://arxiv.org/abs/2512.08404
作者: Sjoerd B. Stolwijk,Mark Boukes,Damian Trilling
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 1 table; version submitted to the International Communication Association Annual Conference in Cape Town 2026
Abstract:This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.
zh
[NLP-16] Ontology-Based Knowledge Graph Framework for Industrial Standard Documents via Hierarchical and Propositional Structuring
【速读】: 该论文旨在解决工业标准文档中复杂规则、条件约束与数值计算交织导致的知识图谱(Knowledge Graph, KG)构建困难问题。传统方法难以有效捕捉此类文档的层级结构与逻辑关系,从而限制了领域知识的精准表示与推理能力。解决方案的关键在于:首先将文档组织为层次化的语义结构,然后通过大语言模型(Large Language Model, LLM)对句子和表格进行原子命题分解,提取蕴含条件与数值规则的三元组,并将其整合进本体驱动的知识图谱;最终构建一个面向本体感知的KG-RAG(Retrieval-Augmented Generation)框架,在多跳问答、规则推理及有毒条款检测等任务上实现显著性能提升,验证了该方法在复杂工业文档中可靠且可扩展的知识表示可行性。
链接: https://arxiv.org/abs/2512.08398
作者: Jiin Park,Hyuna Jeon,Yoonseo Lee,Jisu Hong,Misuk Kim
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
zh
[NLP-17] he High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations
【速读】: 该论文旨在解决职场毒性(workplace toxicity)对组织运营效率影响难以量化的问题,因其伦理与实践上的限制,无法在人类受试者中直接重现冲突场景。解决方案的关键在于利用大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent Systems),构建可控的“社会学沙盒”来模拟一对一对抗性辩论,并通过蒙特卡洛方法重复数百次讨论,测量对话收敛时间(即达成结论所需的论点数量)。结果显示,包含“毒性”系统提示的处理组相较对照组平均延长约25%的对话时长,表明“毒性延迟效应”可作为企业与学术环境中社会摩擦导致财务损失的代理指标,同时验证了基于智能体建模是一种可复现且符合伦理的人类社会互动机制研究替代方法。
链接: https://arxiv.org/abs/2512.08345
作者: Benedikt Mangold
机构: Technische Hochschule Nürnberg Georg Simon Ohm (诺德豪大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 8 figures, 3 tables
Abstract:Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled “sociological sandbox”. We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with “toxic” system prompts. Our results demonstrate a statistically significant increase of approximately 25% in the duration of conversations involving toxic participants. We propose that this “latency of toxicity” serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.
zh
[NLP-18] Reasoning Models Ace the CFA Exams
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在特许金融分析师(Chartered Financial Analyst, CFA)考试中表现不佳的问题。此前研究指出,LLMs在CFA考试中成绩普遍不理想,而近期推理模型在多学科研究生级和专业考试中展现出显著进步。为此,作者通过评估当前最先进的推理模型在一套包含980道题目的模拟CFA考试(涵盖Level I至III)上的表现,发现多数先进模型已能通过全部三个级别。关键解决方案在于采用与先前研究一致的及格标准进行系统性评测,并利用最新一代推理模型(如Gemini 3.0 Pro、GPT-5等)在不同考试层级上取得优异成绩,尤其在Level I和Level III中分别达到97.6%和92.0%的高分,表明推理能力增强显著提升了模型在专业金融知识任务中的表现。
链接: https://arxiv.org/abs/2512.08270
作者: Jaisal Patel,Yunzhe Chen,Kaiwen He,Keyi Wang,David Li,Kairong Xiao,Xiao-Yang Liu
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); SecureFinAI Lab, Columbia University (哥伦比亚大学安全金融人工智能实验室); Department of Mathematics, Columbia University (哥伦比亚大学数学系); Business School, Columbia University (哥伦比亚大学商学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Finance (q-fin.GN)
备注:
Abstract:Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
zh
[NLP-19] ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access
【速读】: 该论文旨在解决临床试验数据获取不便捷、结构化程度低的问题,尤其是在单一数据源(如ClinicalTrials.gov)基础上难以全面覆盖和高效利用相关研究信息的局限。其解决方案的关键在于构建一个名为ClinicalTrialsHub的交互式搜索平台,通过集成原始数据并利用大语言模型(如GPT-5.1和Gemini-3-Pro)自动从PubMed文献中提取并结构化临床试验相关信息,从而将可用结构化数据量提升83.8%。该系统进一步实现用户查询到结构化数据库检索的转换,并提供基于证据的问答功能,答案可追溯至原文句子,显著提升了患者、临床医生、研究人员及政策制定者对高质量循证医学信息的访问效率与准确性。
链接: https://arxiv.org/abs/2512.08193
作者: Jiwoo Park,Ruoqi Liu,Avani Jagdale,Andrew Srisuwananukorn,Jing Zhao,Lang Li,Ping Zhang,Sachin Kumar
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from this http URL and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on this http URL alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
zh
[NLP-20] Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward
【速读】: 该论文旨在解决语言模型对短对抗后缀(adversarial suffixes)的脆弱性问题,即这些后缀能够可靠地改变模型预测结果,且现有方法如基于梯度搜索或规则的方法往往局限于单一任务或模型,泛化能力差。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架,将后缀视为策略(policy),利用近端策略优化(Proximal Policy Optimization, PPO)在冻结的目标模型上进行训练,并以校准后的交叉熵作为奖励函数来塑造奖励信号,从而消除标签偏差并聚合不同表面形式(surface forms)的信息,提升对抗后缀的迁移能力。实验表明,该方法生成的后缀在多个NLP基准数据集上显著降低模型准确率,且跨任务和跨模型的转移效果优于以往同类方法。
链接: https://arxiv.org/abs/2512.08131
作者: Sampriti Soor,Suklav Ghosh,Arijit Sur
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages
Abstract:Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
zh
[NLP-21] Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在零样本或少样本分类任务中对对抗性提示(adversarial prompts)敏感的问题,这类攻击通常依赖于特定任务或模型的触发词(triggers),导致效果难以比较且泛化能力差。其解决方案的关键在于提出了一种通用对抗后缀(universal adversarial suffix)——一种长度为4–10个token的短序列,通过附加到任意输入文本末尾即可显著降低多个任务和不同模型上的分类准确率与校准置信度。该方法采用Gumbel-Softmax松弛实现可微分训练,并引入熵正则化防止退化,同时通过掩码真实标签避免信息泄露,从而确保攻击的有效性和跨模型、跨任务的强迁移能力。
链接: https://arxiv.org/abs/2512.08123
作者: Sampriti Soor,Suklav Ghosh,Arijit Sur
机构: Indian Institute of Technology Guwahati, India(印度理工学院古瓦哈蒂分校)
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable “soft” form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
zh
[NLP-22] Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youdens J statistic
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中因分类器选择不当而导致的偏差问题,尤其是传统指标如准确率(Accuracy)、精确率(Precision)和F1分数在类别不平衡情况下对模型比较结果产生误导性影响的问题。其解决方案的关键在于引入Youden’s J统计量作为理论更合理的判别标准,该统计量直接衡量分类器区分正负类的能力,并与平衡准确率(Balanced Accuracy)存在线性等价关系。研究表明,使用平衡准确率进行分类器选择可显著提升评估的鲁棒性和可信度,避免因人为设定正类而扭曲模型间的行为 prevalence 估计。
链接: https://arxiv.org/abs/2512.08121
作者: Stephane Collot,Colin Fraser,Justin Zhao,William F. Shen,Timon Willi,Ilias Leontiadis
机构: Meta Superintelligence Labs (Meta 超智能实验室); Meta; University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 5 figures
Abstract:Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s J statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of J . Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.
zh
[NLP-23] Segment Embed and Align: A Universal Recipe for Aligning Subtitles to Signing
【速读】: 该论文旨在解决跨语言和跨领域的手语视频与字幕(即带时间戳的口语文本)对齐问题,传统方法通常依赖于特定语言或数据集的端到端训练,缺乏通用性。其解决方案的关键在于提出了一种名为“分段、嵌入与对齐”(Segment, Embed, and Align, SEA)的统一框架:首先利用预训练模型将手语视频帧序列分割为单个手势片段,再通过另一预训练模型将每个手势视频片段映射到与文本共享的潜在空间中,最后采用轻量级动态规划算法实现高效对齐,整个过程可在CPU上分钟级完成,且适用于从小词汇量到大规模连续语料的不同场景。
链接: https://arxiv.org/abs/2512.08094
作者: Zifan Jiang,Youngjoon Jang,Liliane Momeni,Gül Varol,Sarah Ebling,Andrew Zisserman
机构: University of Oxford (牛津大学); University of Zurich (苏黎世大学); KAIST; Ecole des Ponts, IP Paris (巴黎路桥学院, 巴黎文理研究大学); Univ Gustave Eiffel (居斯塔夫·埃菲尔大学); CNRS (法国国家科学研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA’s code and models are openly available.
zh
[NLP-24] Adaptation of Embedding Models to Financial Filings via LLM Distillation
【速读】: 该论文旨在解决生成式大语言模型(Generative Large Language Models, LLMs)在专业领域(如金融)中构建对话智能体时面临的三大挑战:计算成本高、延迟要求严苛,以及缺乏精准的领域特定相关性度量。现有嵌入模型虽能缓解前两个问题,但在金融等专业领域的信息检索性能不足。其解决方案的关键在于提出一个可扩展的训练流水线,以无标注语料为基础,利用通用检索嵌入模型作为初始基础,通过教师-学生架构的迭代交互机制,将大语言模型(LLM)判断的相关性知识蒸馏至轻量级检索器(bi-encoder)。该方法通过多轮迭代挖掘硬正例/负例样本并重新训练学生模型,逐步提升检索精度,在14类金融文件上实现平均27.7%的MRR@5提升和44.6%的平均DCG@5提升,显著优于传统监督微调方式,且无需人工标注,具备良好的成本效益。
链接: https://arxiv.org/abs/2512.08088
作者: Eliot Brenner,Dominic Seyler,Manjunath Hegde,Andrei Simion,Koustuv Dasgupta,Bing Xiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: In proceedings of LLM-Finance 2025 : The 2nd IEEE International Workshop on Large Language Models for Finance
Abstract:Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR \texttt@ 5, 44.6% improvement in mean DCG \texttt@ 5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model’s weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.
zh
[NLP-25] Short-Context Dominance: How Much Local Context Natural Language Actually Needs?
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在生成文本时存在的短上下文主导偏差问题,即大多数序列的下一个词仅依赖于其局部短前缀(如最后96个token),而长距离依赖信息常被忽略,导致输出分布偏向短上下文预测。解决方案的关键在于提出一种可实用的代理指标——分布感知最小上下文长度(Distributionally Aware MCL, DaMCL),该指标无需依赖真实下一个token即可衡量序列对短上下文的敏感性,并通过简单阈值判断识别出需要长上下文才能准确预测的挑战性序列。进一步地,作者设计了一种基于该检测器的解码算法,主动增强长程相关token的生成概率,从而缓解短上下文主导带来的偏差,在问答(QA)任务和多种模型架构中显著提升性能。
链接: https://arxiv.org/abs/2512.08082
作者: Vala Vakilian,Zimeng Wang,Ankit Singh Rawat,Christos Thrampoulidis
机构: University of British Columbia (不列颠哥伦比亚大学); Google DeepMind (谷歌深度心智)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages, 7 figures, includes appendix and references
Abstract:We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across QA tasks and model architectures, we confirm that mitigating the bias improves performance.
zh
[NLP-26] Beyond Unified Models: A Service-Oriented Approach to Low Latency Context Aware Phonemization for Real Time TTS
【速读】: 该论文旨在解决轻量级文本到语音(Text-to-Speech, TTS)系统中音素转换(Grapheme-to-Phoneme, G2P)质量与推理速度之间的权衡问题。现有高效TTS模型常依赖于轻量级音素转换器,难以处理上下文相关的发音挑战;而高精度的音素转换器虽具备更强的语言理解能力,却因计算开销大而无法满足实时性要求。解决方案的关键在于提出两种轻量级策略:一是实现上下文感知的音素转换机制,二是设计服务化架构,将复杂音素转换模块作为独立服务运行,从而解耦核心TTS引擎与高负载组件,有效突破延迟瓶颈。实验表明,该方案在保持实时响应的同时显著提升了发音准确性和语言学合理性,适用于离线及终端设备部署场景。
链接: https://arxiv.org/abs/2512.08006
作者: Mahta Fetrat,Donya Navabi,Zahra Dehghanian,Morteza Abolghasemi,Hamid R. Rabiee
机构: Sharif University of Technology (谢里夫理工大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications. Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2512.08006 [cs.SD] (or arXiv:2512.08006v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2512.08006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-27] Accelerating Urban Science Research with AI Urban Scientist
【速读】: 该论文旨在解决城市科学领域中如何将海量、碎片化且跨学科的数据转化为对城市运行与演化机制的连贯解释这一根本性挑战。其解决方案的关键在于构建一个以知识驱动的AI城市科学家(AI Urban Scientist),该系统基于数千篇高质量研究提炼出的假设、同行评审信号、数据集和分析模式,采用协调的多智能体框架实现从问题提出到结果合成的全流程自动化探究,从而在城市科学研究中充当主动合作者,而非仅限于辅助工具。
链接: https://arxiv.org/abs/2512.07849
作者: Tong Xia,Jiankun Zhang,Ruiwen You,Ao Xu,Linghao Zhang,Tengyao Tu,Jingzhi Wang,Jinghua Piao,Yunke Zhang,Fengli Xu,Yong Li
机构: Vanke School of Public Health, Tsinghua University, Beijing, China; Zhongguancun Academy, Beijing, China; Department of Electronic Engineering, Tsinghua University, Beijing, China
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Cities are complex, adaptive systems whose underlying principles remain difficult to disentangle despite unprecedented data abundance. Urban science therefore faces a fundamental challenge: converting vast, fragmented and interdisciplinary information into coherent explanations of how cities function and evolve. The emergence of AI scientists, i.e., agents capable of autonomous reasoning, hypothesis formation and data-driven experimentation, offers a new pathway toward accelerating this transformation, yet general-purpose systems fall short of the domain knowledge and methodological depth required for urban science research. Here we introduce a knowledge-driven AI Urban Scientist, built from hypotheses, peer-review signals, datasets and analytical patterns distilled from thousands of high-quality studies, and implemented as a coordinated multi-agent framework for end-to-end inquiry. The system generates structured hypotheses, retrieves and harmonizes heterogeneous datasets, conducts automated empirical analysis and simulation, and synthesizes insights in forms compatible with urban scientific reasoning. By providing reusable analytical tools and supporting community-driven extensions, the AI Urban Scientist lowers barriers to advanced urban analytics and acts not merely as an assistant but as an active collaborator in revealing the mechanisms that shape urban systems and in guiding the design of more resilient and equitable cities.
zh
[NLP-28] MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工业级推荐与搜索系统中因高计算开销导致的吞吐量瓶颈问题,尤其是在交叉编码器(cross-encoder)排序系统中,由于需将用户、查询和物品信息全部作为长上下文输入,造成预填充(prefill-heavy)负载过重。解决方案的关键在于提出 MixLM 框架,通过引入“混合交互”(mix-interaction)机制,用少量嵌入(embedding)token 替代原始文本 token 表示物品信息:具体而言,将整个物品目录预先编码为少量嵌入 token 并缓存于近线(nearline)存储,在在线推理时直接使用这些嵌入表示,从而将物品描述的上下文长度从数千文本 token 显著压缩至几个嵌入 token,大幅降低输入长度并提升系统吞吐量,同时保持交叉编码器的语义表达能力。实证表明,该方法在相同延迟预算下使吞吐量提升 10.0 倍,并支持全流量部署 LLM 驱动的搜索,带来显著的 DAU 增益。
链接: https://arxiv.org/abs/2512.07846
作者: Guoyao Li,Ran He,Shusen Jing,Kayhan Behdin,Yubo Wang,Sundara Raman Ramachandran,Chanh Nguyen,Jian Sheng,Xiaojing Ma,Chuanrui Zhu,Sriram Vasudevan,Muchen Wu,Sayan Ghosh,Lin Su,Qingquan Song,Xiaoqing Wang,Zhipeng Wang,Qing Lan,Yanning Chen,Jingwei Wu,Luke Simon,Wenjing Zhang,Qi Guo,Fedor Borisyuk
机构: LinkedIn(领英)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking systems often create long context prefill-heavy workloads, as the model has to be presented with the user, query and item information. To this end, we propose MixLM, a novel LLM-based ranking framework, which significantly improves the system throughput via reducing the input context length, while preserving the semantic strength of cross-encoder rankers. In contrast to a standard ranking system where the context is presented to the model as pure text, we propose to use mix-interaction, a mixture of text and embedding tokens to represent the input. Specifically, MixLM encodes all items in the catalog into a few embedding tokens and stores in a nearline cache. The encoded item descriptions are used during online inference, effectively reducing the item length from a few thousand text tokens to a few embedding tokens. We share insights from deploying our MixLM framework to a real-world search application at LinkedIn, including a detailed discussion of our training pipelines, as well as a thorough analysis of our online serving infrastructure optimization. Comparing with strong baselines, MixLM increased throughput by 10.0x under the same latency budget, while maintaining relevance metrics. The efficiency gains delivered by MixLM enabled full-traffic deployment of LLM-powered search, which resulted in a significant 0.47% increase in Daily Active Users (DAU) in online A/B tests.
zh
[NLP-29] hreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因串行解码导致的高延迟问题,尤其是在复杂任务中,尽管通过增加推理时间可以提升推理性能,但效率瓶颈限制了实际应用。现有自适应并行推理方法要么依赖监督式行为克隆(behavior cloning),要么在真实任务上准确率显著低于主流的串行长链思维(long chain-of-thought, CoT)基线,且常需定制化推理引擎以支持部署。解决方案的关键在于提出 ThreadWeaver 框架,其核心创新包括:1)两阶段并行轨迹生成器,用于构建大规模高质量带并行标注的 CoT 数据用于监督微调;2)基于前缀树(trie-based)的训练-推理协同设计,使模型可在任意现成的自回归推理引擎上实现并行推理,无需修改位置嵌入或键值缓存(KV caches);3)面向并行化的强化学习框架,引导模型在准确性与并行效率之间取得平衡。实验表明,基于 Qwen3-8B 训练的 ThreadWeaver 在六个数学推理基准上达到与先进串行模型相当的准确率(平均 71.9%,AIME24 达 79.9%),同时实现最高达 1.53 倍的 token 延迟加速,建立了准确率与效率之间的新帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2512.07843
作者: Long Lian,Sida Wang,Felix Juefei-Xu,Tsu-Jui Fu,Xiuyu Li,Adam Yala,Trevor Darrell,Alane Suhr,Yuandong Tian,Xi Victoria Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver’s performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
zh
[NLP-30] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在低资源语言场景下性能显著下降的问题。现有方法通常仅局限于文本模态或依赖机器翻译,导致模型仅能生成“薄描述”(thin descriptions),而忽视了多模态信息丰富性和文化语境的扎根性(cultural groundedness),这两者对服务低资源语言用户至关重要。解决方案的关键在于提出一种双源策略(dual-source strategy),分别针对语言能力(linguistic capability)和文化意识(cultural awareness)两个核心目标收集数据:利用本地网络中的替代文本(alt-text)增强文化相关性,使用MLLM自动生成的图像描述提升语言能力。基于此策略构建的MELLA数据集,在多个低资源语言上进行微调后,显著提升了模型在不同MLLM骨干架构上的表现,实现了“厚描述”(thick descriptions)生成,验证了文化知识与语言能力双重增强的有效性。
链接: https://arxiv.org/abs/2508.05502
作者: Yufei Gao,Jiaying Fei,Nuo Chen,Ruirui Chen,Guohang Yan,Yunshi Lan,Botian Shi
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); East China Normal University (华东师范大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Institute of High Performance Computing, ASTAR (高性能计算研究所,ASTAR)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at this https URL.
zh
计算机视觉
[CV-0] Astra: General Interactive World Model with Autoregressive Denoising
【速读】:该论文旨在解决现有世界模型在长时程未来预测中的局限性,尤其是针对通用场景下多样动作交互能力不足的问题。当前方法难以实现对真实世界中复杂动作(如机器人抓取、相机运动)的精确控制与长期一致性预测。解决方案的关键在于提出Astra——一个基于自回归去噪架构的交互式通用世界模型,其核心创新包括:1)引入时间因果注意力机制以聚合历史观测并支持流式输出;2)设计噪声增强的历史记忆模块,平衡响应速度与时间连贯性;3)提出动作感知适配器(action-aware adapter),直接将动作信号注入去噪过程以实现精准动作控制;4)构建动作专家混合机制(mixture of action experts),动态路由异构动作模态,提升在探索、操作和相机控制等多样化任务中的泛化能力。这些设计共同实现了高保真、长期一致且可交互的视频预测性能。
链接: https://arxiv.org/abs/2512.08931
作者: Yixuan Zhu,Jiaqi Feng,Wenzhao Zheng,Yuan Gao,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL
Abstract:Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
zh
[CV-1] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
【速读】:该论文旨在解决当前基于视觉基础模型(如VGGT)进行新视角合成(Novel View Synthesis, NVS)时存在的3D特征几何一致性不足的问题,这限制了其在NVS和相机位姿估计任务中的性能。解决方案的关键在于提出Selfi框架,通过自监督方式利用模型自身输出作为伪真值(pseudo-ground-truth),训练一个轻量级特征适配器(feature adapter),该适配器采用基于重投影的一致性损失函数,将原始VGGT的特征映射到一个新的几何对齐特征空间中,从而显式地增强多视角间的3D空间邻近性约束,最终显著提升重建质量和下游任务表现。
链接: https://arxiv.org/abs/2512.08930
作者: Youming Deng,Songyou Peng,Junyi Zhang,Kathryn Heal,Tiancheng Sun,John Flynn,Steve Marschner,Lucy Chai
机构: Cornell University (康奈尔大学); Google(谷歌); UC Berkeley
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach – 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
zh
[CV-2] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
【速读】:该论文旨在解决从视频中理解并重建动态场景的复杂几何结构与运动状态这一计算机视觉领域的难题,即4D重建问题(4D reconstruction)。其解决方案的关键在于提出了一种名为D4RT的前馈式模型,该模型采用统一的Transformer架构联合推断深度、时空对应关系及完整相机参数;核心创新是一种新颖的查询机制,避免了密集帧级解码带来的高计算开销以及多任务专用解码器的复杂性管理,从而实现对任意时空点3D位置的独立且灵活探测,显著提升了训练与推理效率,并在多种4D重建任务中达到新的性能上限。
链接: https://arxiv.org/abs/2512.08924
作者: Chuhan Zhang,Guillaume Le Moing,Skanda Koppula,Ignacio Rocco,Liliane Momeni,Junyu Xie,Shuyang Sun,Rahul Sukthankar,Joëlle K Barral,Raia Hadsell,Zoubin Ghahramani,Andrew Zisserman,Junlin Zhang,Mehdi SM Sajjadi
机构: Google DeepMind(谷歌深度思维); University College London(伦敦大学学院); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: this https URL.
zh
[CV-3] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
【速读】:该论文旨在解决文本感知图像修复(Text-Aware Image Restoration, TAIR)任务中生成式 AI(Generative AI)模型因缺乏显式语言知识而导致的文本幻觉问题。现有扩散模型虽具备强大的图像生成先验,但在处理包含退化文本内容的图像时,常产生不准确或虚构的文本信息。解决方案的关键在于提出统一框架UniT,其核心创新是将扩散 Transformer(DiT)、视觉-语言模型(VLM)与文本检测模块(TSM)以迭代方式融合:VLM从退化图像中提取文本内容以提供显式文本引导,TSM在每个去噪步骤中基于扩散特征生成中间OCR预测,使VLM能动态优化引导信息;最终DiT利用这些多源线索恢复精细文本细节并有效抑制幻觉,从而在SA-Text和Real-Text基准上实现高保真文本重建与最优端到端F1分数性能。
链接: https://arxiv.org/abs/2512.08922
作者: Jin Hyeon Kim,Paul Hyunbin Cho,Claire Kim,Jaewon Min,Jaeeun Lee,Jihye Park,Yeji Choi,Seungryong Kim
机构: KAIST AI; Samsung Electronics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong gen- erative priors for general image restoration, they often pro- duce text hallucinations in text-centric tasks due to the ab- sence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an it- erative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermedi- ate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine- grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
zh
[CV-4] LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
【速读】:该论文旨在解决夜间环境下基于摄像头的感知系统性能显著下降的问题,传统方法依赖环境光照且无法适应低光场景。其核心解决方案是提出 Lighting-driven Dynamic Active Sensing (LiDAS),一种闭环主动照明系统,通过将现成的视觉感知模型与高分辨率远光灯结合,动态预测最优光照场以最大化下游感知性能——即在空旷区域减少光照、将光能重新分配至目标物体区域。关键创新在于利用主动照明控制实现对白天训练模型的零样本夜间泛化能力,无需重新训练即可提升检测和分割指标(如mAP50提升18.7%、mIoU提升5.0%),同时降低40%能耗,从而提供一种低成本且高效的夜间鲁棒感知方案。
链接: https://arxiv.org/abs/2512.08912
作者: Simon de Moreau,Andrei Bursuc,Hafid El-Idrissi,Fabien Moutarde
机构: Valeo.ai(维信诺); Valeo(维信诺); Mines Paris PSL(矿业巴黎-PSL); Mines Paris PSL(矿业巴黎-PSL)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint. 12 pages, 9 figures. Project page: this https URL
Abstract:Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
zh
[CV-5] Self-Evolving 3D Scene Generation from a Single Image
【速读】:该论文旨在解决从单张图像生成高质量、带纹理的3D场景这一基础性挑战,现有方法受限于对象中心的训练方式,在复杂、大规模场景中难以保持结构和纹理的真实性。其解决方案的关键在于提出EvoScene框架,该框架无需训练即可通过三个迭代阶段(空间先验初始化、视觉引导的3D场景网格生成、空间引导的新视角生成)交替利用3D生成模型的几何推理能力和视频生成模型的视觉知识,从而逐步提升场景的几何稳定性和视图一致性纹理,并实现未见区域的补全。
链接: https://arxiv.org/abs/2512.08905
作者: Kaizhi Zheng,Yue Fan,Jing Gu,Zishuo Xu,Xuehai He,Xin Eric Wang
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Santa Barbara (加州大学圣芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages–Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation–EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
zh
[CV-6] UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation
【速读】:该论文旨在解决内容感知版面生成(content-aware layout generation)任务中缺乏统一模型的问题,即现有方法难以同时处理多种输入约束条件(如元素类型、尺寸或关系),且通常需要为不同条件设计独立模型参数,无法实现端到端的统一生成。解决方案的关键在于提出UniLayDiff:一种统一的扩散Transformer模型,首次通过将布局约束视为独立模态,并采用多模态扩散Transformer框架来建模背景图像、布局元素与多样约束之间的复杂交互;同时利用LoRA(Low-Rank Adaptation)在预训练基础上微调以集成关系约束,从而在保持统一架构的同时提升整体版面质量与条件生成能力。
链接: https://arxiv.org/abs/2512.08897
作者: Zeyang Liu,Le Wang,Sanping Zhou,Yuxuan Wu,Xiaolong Sun,Gang Hua,Haoxiang Li
机构: Xi’an Jiaotong University (西安交通大学); Amazon.com, Inc. (亚马逊公司); Pixocial Technology (像素社交科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.
zh
[CV-7] No Labels No Problem: Training Visual Reason ers with Multimodal Verifiers ALT
【速读】:该论文旨在解决视觉推理(Visual Reasoning)中同时实现精确目标定位(Object Grounding)与复杂空间关系理解的难题。现有方法主要分为两类:仅依赖语言的链式思维(Chain-of-Thought)方法需要大规模图像-查询-答案对监督,而程序合成(Program-Synthesis)方法虽无需训练但存在逻辑错误和定位不准的问题。解决方案的关键在于提出一种无标注训练框架,通过AI驱动的验证器(Verifier)提升推理与定位能力:其中大语言模型(LLM)验证器利用强化学习优化推理过程,视觉语言模型(VLM)验证器通过自动化难负样本挖掘增强视觉定位准确性,从而在不依赖真实标签的情况下显著提升模型性能。
链接: https://arxiv.org/abs/2512.08889
作者: Damiano Marsili,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL
Abstract:Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: this https URL
zh
[CV-8] Accelerated Rotation-Invariant Convolution for UAV Image Segmentation
【速读】:该论文旨在解决无人机(UAV)航拍图像中目标对象因任意朝向导致的传统语义分割网络(如U-Net)因卷积操作缺乏旋转不变性而性能下降的问题。其核心挑战在于:在保持高精度的同时,如何高效实现多方向旋转不变的卷积运算,避免传统方法通过扩展滤波器银行(filter bank)带来的计算开销和内存流量激增。解决方案的关键在于提出一种GPU优化的旋转不变卷积框架,摒弃了常规矩阵乘法卷积所需的“数据降维”(im2col)步骤,并利用对称旋转滤波器之间的结构化数据共享机制,显著减少内存访问冗余与计算重复;进一步将该方法推广至任意非对称旋转角度,实现了在多种输入尺寸下相比CUDNN加速20–55%、能耗降低15–45%,且集成到U-Net后使分割准确率提升最高达6%。
链接: https://arxiv.org/abs/2512.08888
作者: Manduhu Manduhu,Alexander Dow,Gerard Dooly,James Riordan
机构: University of the West of Scotland (西苏格兰大学); University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20–55% faster training and 15–45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256(\times)256 inputs, and 32% speedup and 23% lower energy usage on 1024(\times)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2512.08888 [cs.CV] (or arXiv:2512.08888v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.08888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-9] SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
【速读】:该论文旨在解决遥感图像中视觉定位(visual grounding)精度不足的问题,特别是在复杂场景下模型难以准确匹配自然语言指令与图像中特定目标位置的挑战。解决方案的关键在于提出一种新颖的结构化定位机制,通过在预训练视觉语言模型(VLM)上微调多样化的指令跟随任务,并引入专用控制标记(control tokens)来接口一个独立的定位模块,从而实现语言与空间信息的联合推理,显著提升模型在卫星影像中的目标定位准确性。
链接: https://arxiv.org/abs/2512.08881
作者: Aysim Toker,Andreea-Maria Oncescu,Roy Miles,Ismail Elezi,Jiankang Deng
机构: Huawei London Research Center(华为伦敦研究中心); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model’s ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
zh
[CV-10] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
【速读】:该论文旨在解决低分辨率图像(Low-Resolution Image, LRI)在图像描述生成任务中的性能瓶颈问题。传统方法依赖大型模型(如Transformer)进行特征编码虽能提升效果,但存在计算资源消耗大、内存占用高及训练困难等弊端,难以在资源受限场景下部署。解决方案的关键在于提出一种名为SOLI(Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning)的方法,其核心是利用孪生网络(Siamese network)架构优化图像潜在嵌入(latent embedding),通过双路径神经网络结构实现高效且准确的图像到文本映射,从而在保持性能的同时显著降低计算开销,适用于轻量化和资源受限环境下的图像描述生成任务。
链接: https://arxiv.org/abs/2512.08873
作者: Jing Jie Tan,Anissa Mokraoui,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum
机构: 1. University of Tunku Abdul Rahman (马来亚大学·敦拉萨大学学院); 2. Sorbonne Paris Cité (索邦巴黎西岱大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages
Abstract:Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
zh
[CV-11] ri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference AAAI2026
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在几何推理任务中缺乏可验证性与可控性的问题,尤其是在面对现实场景变化(如相机姿态改变和背景干扰)时性能显著下降的局限性。其解决方案的关键在于构建一个结构化且可控的基准测试平台——Tri-Bench,该平台聚焦于平面三角形几何推理任务,通过严格隔离相对几何关系并引入两个部署关键因素:相机位姿(平面 vs. 倾斜)和场景上下文中的物体干扰(10种日常物品),从而系统评估VLM在真实环境下的鲁棒性和推理能力。特别地,研究采用单一固定提示(prompt)并显式定义图像边界作为参考框架,以诱导模型利用单应性(homography)进行3D空间推理,而非依赖图像平面内的2D线索,从而实现对模型输出的可验证控制。
链接: https://arxiv.org/abs/2512.08860
作者: Amit Bendkhale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures. Code and data: this https URL . Accepted to the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract:Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
zh
[CV-12] Generation is Required for Data-Efficient Perception
【速读】:该论文试图解决的问题是:在机器视觉中,是否需要生成式方法(即通过解码器反演获得内部表征)才能实现人类水平的视觉感知,尤其是 compositional generalization(组合泛化能力),这是人类感知的核心特征之一。研究表明,尽管当前最成功的视觉模型多为非生成式(encoder-based),但它们难以在缺乏特定归纳偏置的情况下实现组合泛化;而生成式方法(decoder-based)则可通过在解码器上施加合适的归纳偏置并进行有效反演(如基于梯度的在线搜索或生成回放(generative replay)),从而自然地实现组合泛化。解决方案的关键在于:将归纳偏置显式约束于解码器结构,并通过解码器反演机制来获取表征,而非依赖对编码器的复杂正则化或架构限制。
链接: https://arxiv.org/abs/2512.08854
作者: Jack Brady,Bernhard Schölkopf,Thomas Kipf,Simon Buchholz,Wieland Brendel
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Tübingen AI Center (图宾根人工智能中心); ELLIS Institute (ELLIS研究所); Google DeepMind (谷歌DeepMind)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
Abstract:It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today’s most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.
zh
[CV-13] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient Unlimited-Input Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长序列时面临的两个核心问题:一是基于窗口的注意力机制(window attention)在序列长度超过窗口大小时性能下降;二是线性注意力(linear attention)在信息密集型任务(如光学字符识别OCR和文档理解)中表现不佳。解决方案的关键在于提出InfiniteVL架构,其创新性地将滑动窗口注意力(Sliding Window Attention, SWA)与门控DeltaNet相结合,从而在保持线性计算复杂度的同时提升多模态性能。此外,通过三阶段训练策略(蒸馏预训练、指令微调和长序列SFT),InfiniteVL仅用不到领先VLMs 2%的训练数据即可实现媲美Transformer基线模型的性能,并具备稳定的长期记忆保留能力,在推理速度上相比FlashAttention-2加速的同类模型提升超3.6倍,且内存占用恒定。
链接: https://arxiv.org/abs/2512.08829
作者: Hongyuan Tao,Bencheng Liao,Shaoyu Chen,Haoran Yin,Qian Zhang,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Horizon Robotics (地平线机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, conference or other essential info
Abstract:Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at this https URL.
zh
[CV-14] raining-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在域变化下性能下降以及在新域中微调时计算资源消耗过大的问题。其解决方案的关键在于提出一种无需训练的双曲适配器方法(Training-free Dual Hyperbolic Adapters, T-DHA),通过将语义概念间的层次关系映射到双曲空间(Hyperbolic Space)而非传统欧氏空间,利用庞加莱球模型(Poincaré ball model)对层次结构数据进行高效嵌入,从而在更少特征维度下实现更强的表示能力和分类鲁棒性,结合负样本学习进一步提升准确率与泛化性能。
链接: https://arxiv.org/abs/2512.08820
作者: Yi Zhang,Chun-Wun Cheng,Junyi He,Ke Yu,Yushun Tang,Carola-Bibiane Schönlieb,Zhihai He,Angelica I. Aviles-Rivero
机构: Shenzhen University (深圳大学); Southern University of Science and Technology (南方科技大学); University of Cambridge (剑桥大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Transactions on Multimedia (TMM)
Abstract:Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textitTraining-free Dual Hyperbolic Adapters (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
zh
[CV-15] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance
【速读】:该论文旨在解决文档阴影去除(document shadow removal)问题,其核心挑战在于如何在消除阴影的同时精确保留文本边缘、线条等高频细节(high-frequency details),以提升数字化文档的清晰度和后续光学字符识别(optical character recognition, OCR)性能。解决方案的关键在于提出了一种新颖的马特视觉Transformer(matte vision transformer, MatteViT)框架,融合空间域与频域信息,并引入两项关键技术:一是轻量级高频增强模块(high-frequency amplification module, HFAM),用于分解并自适应放大高频成分;二是基于连续亮度的阴影马特图(shadow matte),通过自建数据集与生成器构建,提供从早期处理阶段即开始的精准空间引导,从而实现细粒度结构的高保真恢复。
链接: https://arxiv.org/abs/2512.08789
作者: Chaewon Kim,Seoyeon Lee,Jonghyuk Park
机构: Kookmin University (酷克敏大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 5 tables
Abstract:Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
zh
[CV-16] LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models
【速读】:该论文旨在解决当前个性化视觉生成模型(personalized visual generative models)在实际应用中面临的效率与适配性问题,即现有方法如低秩适应(Low-Rank Adaptation, LoRA)依赖任务特定数据且优化过程耗时较长,而基于超网络(hypernetwork-based)的方法难以将细粒度用户提示(user prompts)映射到复杂的LoRA参数分布,导致实用性受限。解决方案的关键在于提出一种通用框架LoFA,其核心创新是识别出LoRA中基模型与适配参数之间的结构化分布模式(structured distribution patterns),并设计两级超网络:第一阶段预测反映关键适配区域的相对分布模式,第二阶段利用该模式引导最终LoRA权重的高效生成。此方法可在数秒内完成高质量个性化先验预测,显著优于需数小时优化的传统LoRA方法。
链接: https://arxiv.org/abs/2512.08785
作者: Yiming Hao,Mutian Xu,Chongjie Ye,Jie Qin,Shunlin Lu,Yipeng Qin,Xiaoguang Han
机构: SSE, CUHKSZ; FNii-Shenzhen; Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHKSZ; SDS, CUHKSZ; Cardiff University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: this https URL.
zh
[CV-17] Refining Visual Artifacts in Diffusion Models via Explainable AI-based Flaw Activation Maps
【速读】:该论文旨在解决扩散模型(diffusion models)在图像生成过程中存在的伪影(artifacts)和不真实区域的问题,这些问题严重影响了生成图像的质量与真实性。解决方案的关键在于提出了一种自精炼扩散(self-refining diffusion)框架,其核心是利用可解释人工智能(XAI)技术构建缺陷高亮器(flaw highlighter),生成缺陷激活图(FAMs),用于识别图像中的瑕疵区域;随后在前向过程中放大这些区域的噪声,在反向过程中聚焦于这些区域进行优化,从而显著提升重建质量。该方法在多种扩散模型和任务(如图像生成、文生图、图像修复)中均表现出强大鲁棒性,且使FID指标最高提升27.3%,证明了XAI不仅能增强模型可解释性,还能主动参与图像精炼过程。
链接: https://arxiv.org/abs/2512.08774
作者: Seoyeon Lee,Gwangyeol Yu,Chaewon Kim,Jonghyuk Park
机构: Kookmin University (국민대학교)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 7 tables
Abstract:Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Fréchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.
zh
[CV-18] Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
【速读】:该论文旨在解决现有视频生成模型中运动控制粒度粗、可扩展性差的问题,从而限制了其在实际应用中的表现。解决方案的关键在于提出一种名为Wan-Move的简单且可扩展的框架,通过将原始条件特征直接转化为运动感知特征来实现精细的运动引导。具体而言,该方法首先用密集点轨迹表示物体运动,进而将这些轨迹投影到潜在空间,并沿每条轨迹传播第一帧特征,生成对齐的时空特征图,作为更新后的潜在条件输入至现成的图像到视频模型(如Wan-I2V-14B),无需修改架构即可实现高精度运动控制,同时避免了辅助运动编码器的需求和基模型微调的复杂性。
链接: https://arxiv.org/abs/2512.08765
作者: Ruihang Chu,Yefei He,Zhekai Chen,Shiwei Zhang,Xiaogang Xu,Bin Xia,Dingdong Wang,Hongwei Yi,Xihui Liu,Hengshuang Zhao,Yu Liu,Yingya Zhang,Yujiu Yang
机构: Tongyi Lab, Alibaba Group(阿里巴巴集团); Tsinghua University(清华大学); HKU(香港大学); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurlPS 2025. Code and data available at this https URL
Abstract:We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame’s features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro’s commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move’s superior motion quality. Code, models, and benchmark data are made publicly available.
zh
[CV-19] Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices
【速读】:该论文旨在解决医学影像领域中高精度计算机视觉模型在边缘设备上部署受限的问题,以及因数据隐私保护要求而难以实现集中式训练的挑战。其核心解决方案是提出一种基于偏度(skewness)引导的剪枝方法,通过分析多模态Swin Transformer中多头自注意力(Multi-Head Self-Attention)和多层感知机(Multi-Layer Perceptron)层输出分布的统计偏度,实现对关键结构的智能选择性剪枝。该方法在横向联邦学习(horizontal Federated Learning)环境中验证有效,在不损失分类准确性的前提下显著降低模型复杂度,实验表明紧凑型Swin Transformer模型体积减少约36%。
链接: https://arxiv.org/abs/2512.08751
作者: Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
机构: University of Hull(赫尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.
zh
[CV-20] A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation
【速读】:该论文旨在解决工业级蘑菇(Agaricus Bisporus)栽培中计算机视觉模型训练所需高质量、精确标注数据稀缺且成本高昂的问题。传统方法依赖真实世界图像进行检测与分割建模,但其标注过程耗时费力,难以规模化;而现有合成数据又常因缺乏真实感导致模型在实际场景中泛化能力差。解决方案的关键在于提出一种融合Blender 3D渲染与约束扩散模型(constrained diffusion model)的新颖工作流,能够在保持对三维场景配置和标注信息完全控制的前提下,自动生成高保真度的合成图像,从而实现无需专业图形学知识即可生成兼具真实感与结构可控性的数据集。实验表明,基于该方法生成的两个各含6000张图像(总计超25万实例)的数据集,可使Mask R-CNN模型在零样本设置下达到当前最优的分割性能(F1=0.859),验证了其在农业视觉任务中的有效性与通用性。
链接: https://arxiv.org/abs/2512.08747
作者: Artúr I. Károly,Péter Galambos
机构: Antal Bejczy Center for Intelligent Robotics, Research and Innovation Center of Obuda University (布达佩斯欧布达大学研究与创新中心), Budapest, Hungary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures
Abstract:Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.
zh
[CV-21] Mitigating Individual Skin Tone Bias in Skin Lesion Classification through Distribution-Aware Reweighting
【速读】:该论文旨在解决医学影像中皮肤病变分类模型存在的个体层面不公平问题,传统基于粗粒度亚组分类的方法难以捕捉肤色分布中的个体差异,可能导致少数肤色群体被忽视。其解决方案的关键在于将肤色视为连续属性而非离散类别,并采用核密度估计(Kernel Density Estimation, KDE)建模肤色分布,进而通过十二种统计距离度量量化不同肤色分布间的差异,提出一种基于距离的重加权(Distance-based Reweighting, DRW)损失函数,以纠正少数肤色在训练数据中的代表性不足问题。实验表明,相较于传统的分组重加权策略,该方法在CNN与Transformer模型上均能更有效提升个体公平性表现,尤其在Fidelity Similarity(FS)、Wasserstein Distance(WD)、Hellinger Metric(HM)和Harmonic Mean Similarity(HS)等指标下效果显著。
链接: https://arxiv.org/abs/2512.08733
作者: Kuniko Paxton,Zeinab Dehghani,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
机构: University of Hull(赫尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.
zh
[CV-22] SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
【速读】:该论文旨在解决训练-free开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)在遥感场景中的精确定位难题,尤其是在存在大量密集且小型目标时,现有基于CLIP的方法常因模块组合复杂或定位精度不足而表现受限。其解决方案的关键在于首次探索将可提示的Segment Anything Model 3 (SAM 3) 直接应用于遥感OVSS任务,无需任何训练:一是设计了一种掩码融合策略,整合SAM 3的语义分割头(semantic segmentation head)与Transformer解码器(instance head)输出,以提升地物覆盖完整性;二是利用presence head提供的存在分数对场景中不存在的类别进行过滤,从而降低由庞大词汇表和图像块级处理导致的误检率。实验表明,这一简单适配即可实现有竞争力的性能,验证了SAM 3在遥感开放词汇分割中的潜力。
链接: https://arxiv.org/abs/2512.08730
作者: Kaiyu Li,Shengqi Zhang,Yupeng Deng,Zhi Wang,Deyu Meng,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3’s semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at this https URL.
zh
[CV-23] Multi-domain performance analysis with scores tailored to user preferences
【速读】:该论文旨在解决多领域评估中算法性能平均值的解释性问题,即如何合理地计算和理解跨不同应用领域的加权平均性能指标。其核心挑战在于:不同领域数据分布差异显著,直接取算术平均可能掩盖关键性能特征,导致对模型真实表现的误判。解决方案的关键在于引入概率框架,将性能视为概率测度(如归一化混淆矩阵),并识别出仅当使用特定权重(由用户偏好决定)时,加权平均才等价于对各领域性能的线性组合。基于此理论,作者定义了四个与用户偏好相关的典型领域类别——最易、最难、主导和瓶颈域,并构建了适用于二分类任务的新可视化工具,从而实现对跨域性能的精准分析与可解释性增强。
链接: https://arxiv.org/abs/2512.08715
作者: Sébastien Piérard,Adrien Deliège,Marc Van Droogenbroeck
机构: University of Liège(列日大学)
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.
zh
[CV-24] Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth
【速读】:该论文旨在解决全向单目深度估计(Full Surround Monocular Depth Estimation, FSMDE)中两个关键问题:一是基于基础模型(foundation model)直接迁移导致的高计算开销,难以满足实时性需求;二是模型通常仅预测相对深度(relative depth),缺乏度量尺度(metric-scale depth)一致性。解决方案的关键在于提出一种新颖的知识蒸馏策略,通过融合传统分类任务中的知识蒸馏机制与深度分箱(depth binning)模块,实现从大模型到轻量化学生网络的鲁棒深度知识迁移。具体包括:(1) 跨交互知识蒸馏(cross-interaction knowledge distillation),将基础模型输出的尺度不变深度分箱概率蒸馏至学生网络,并引导其从真值深度中推断度量尺度的分箱中心;(2) 视图关系知识蒸馏(view-relational knowledge distillation),编码相邻视角间的结构关系以增强跨视图深度一致性。该方法在DDAD和nuScenes数据集上验证了有效性,同时实现了性能与效率的良好平衡,满足实时应用需求。
链接: https://arxiv.org/abs/2512.08700
作者: Kyumin Hwang,Wonhyeok Choi,Kiljoon Han,Wonjoon Choi,Minwoo Choi,Yongcheon Na,Minwoo Park,Sunghoon Im
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院); Hyundai Motor Company(现代汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE Robotics and Automation Letters (RA-L) 2026
Abstract:Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme–traditionally used in classification–with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.
zh
[CV-25] What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance
【速读】:该论文旨在解决当前行人重识别(Person Re-Identification, ReID)方法虽然精度高但缺乏可解释性的问题,即明确模型在判断行人身份时依赖哪些高层语义属性。解决方案的关键在于提出MoSAIC-ReID框架,这是一个基于专家混合(Mixture-of-Experts)的可解释ReID方法,其核心创新包括:采用LoRA(Low-Rank Adaptation)构建与单一属性绑定的专家模块,并引入一个“oracle路由器”实现对属性贡献的可控归因分析。通过该设计,研究者能够系统量化不同属性(如服装颜色、内在特征等)对重识别性能的影响,从而揭示哪些语义线索具有显著作用,而哪些(如配饰)则影响有限。
链接: https://arxiv.org/abs/2512.08697
作者: Athena Psalta,Vasileios Tsironis,Konstantinos Karantzalos
机构: National Technical University of Athens (国立技术大学雅典分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at this https URL
zh
[CV-26] Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds
【速读】:该论文旨在解决当前3D点云自监督学习(Self-Supervised Learning, SSL)中生成式方法(如基于掩码自编码器 Masked Autoencoders, MAE)难以有效捕捉高层判别特征,从而在线性探测(linear probing)等下游任务上表现不佳的问题。同时,尽管对比学习(Contrastive Learning, CL)在图像数据中表现出优异的判别特征表示能力,但其在3D点云中的应用仍较为匮乏,且直接将2D对比学习方法迁移至3D时无法充分建模局部几何细节。解决方案的关键在于提出一种双分支中心-周围对比(Center-Surrounding Contrast, CSCon)框架:通过分别对中心区域和周围区域进行掩码处理,构建具有中心偏置与周围偏置的双分支输入,以增强几何信息的表达;同时引入patch-level对比损失,进一步提升高层语义特征与局部敏感性的平衡。该设计使模型在多种评估协议下均取得显著性能提升,尤其在线性探测任务中超越现有方法。
链接: https://arxiv.org/abs/2512.08673
作者: Shaofeng Zhang,Xuanqi Chen,Xiangdong Zhang,Sitong Wu,Junchi Yan
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures
Abstract:Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbfCenter-\textbfSurrounding \textbfContrast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf7.9%, \textbf6.7%, and \textbf10.3% on the three variants of ScanObjectNN, respectively. The code will be made publicly available.
zh
[CV-27] Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank
【速读】:该论文旨在解决当前基于去噪的生成模型(如扩散模型、流匹配模型)在视觉合成任务中面临的训练成本高和表征学习效率低的问题。现有方法虽通过引入判别式表征(discriminative representations)提升性能,但依赖外部预训练编码器导致额外计算开销和域偏移问题。其解决方案的关键在于提出一种无需外部编码器的插件式训练框架 \mname,该框架采用记忆库机制动态维护大量负样本队列,从而解耦负样本数量与批次大小的关系,在不显著增加计算成本的前提下提供高质量对比学习信号;同时引入低维投影头以降低内存与带宽消耗。此设计使模型具备自包含性、推理无额外参数与计算开销,并实现更快收敛与更优生成质量。
链接: https://arxiv.org/abs/2512.08648
作者: Shaofeng Zhang,Xuanqi Chen,Ning Liao,Haoxiang Zhao,Xiaoxing Wang,Haoru Tan,Sitong Wu,Xiaosong Jia,Qi Fan,Junchi Yan
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); HKU (香港大学); CUHK (香港中文大学); Fudan University (复旦大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 19 figures
Abstract:The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose \mname, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. \mname offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, \mname achieves a state-of-the-art FID of \textbf2.40 within 400k steps, significantly outperforming comparable methods.
zh
[CV-28] C-DIRA: Computationally Efficient Dynamic ROI Routing and Domain-Invariant Adversarial Learning for Lightweight Driver Behavior Recognition
【速读】:该论文旨在解决车载摄像头驱动分心行为识别中轻量化模型难以捕捉细粒度行为线索、导致在未见驾驶员或不同环境下性能下降,以及基于感兴趣区域(Region of Interest, ROI)的方法增加计算开销的问题。其核心解决方案是提出一种计算高效的动态ROI路由与域不变对抗学习框架(C-DIRA),关键在于通过显著性驱动的Top-K ROI池化和融合分类实现局部特征提取与整合,并引入动态ROI路由机制,仅对高难度样本进行ROI推理以减少冗余计算;同时利用伪域标签和对抗学习策略学习对驾驶员和背景变化鲁棒的域不变特征,从而在保证精度的同时显著降低浮点运算次数(FLOPs)和延迟,提升模型在模糊、低光照等视觉退化场景及跨域测试中的稳定性。
链接: https://arxiv.org/abs/2512.08647
作者: Keito Inoshita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA’s effectiveness in achieving compactness, efficiency, and generalization.
zh
[CV-29] Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)图像生成模型内部过程“黑箱化”问题,即模型生成图像的步骤缺乏可解释性与可控性,限制了人类对生成过程的观察、干预及可靠性保障。其解决方案的关键在于提出链式图像生成(Chain-of-Image Generation, CoIG)框架,该框架将图像生成重构为一个类人、分步、语义明确的序列化过程:首先利用大语言模型(Large Language Model, LLM)将复杂提示分解为一系列简单、可执行的指令;随后由图像生成模型按计划逐步执行,每一步聚焦单一语义实体,从而实现中间状态的直接监控。通过引入两个新指标——CoIG可读性(CoIG Readability)和因果相关性(Causal Relevance),该框架在提升生成过程可监测性的同时,有效缓解了实体坍缩(entity collapse)问题,且具备模型无关性,可适配任意图像生成模型。
链接: https://arxiv.org/abs/2512.08645
作者: Young Kyung Kim,Oded Schlesinger,Yuzhou Zhao,J. Matias Di Martino,Guillermo Sapiro
机构: Duke University (杜克大学); Princeton University (普林斯顿大学); Universidad Católica del Uruguay (乌拉圭天主教大学); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures
Abstract:While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a “black box.” This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.
zh
[CV-30] Aerial Vision-Language Navigation with a Unified Framework for Spatial Temporal and Embodied Reasoning
【速读】:该论文旨在解决无人机(UAV)在复杂城市环境中仅依赖单目RGB图像和自然语言指令进行视觉-语言导航(Vision-and-Language Navigation, VLN)的问题,尤其针对现有方法依赖全景图像、深度信息或里程计等高成本输入所导致的系统集成复杂性和部署困难。其解决方案的关键在于提出一个统一的空中VLN框架,该框架完全基于自参考单目RGB观测与自然语言指令,将导航任务建模为下一词预测问题,并通过提示引导的多任务学习联合优化空间感知、轨迹推理与动作预测;同时引入关键帧选择策略以减少视觉冗余,并结合动作合并与标签重加权机制缓解长尾监督不平衡问题,从而实现稳定高效的多任务协同训练。
链接: https://arxiv.org/abs/2512.08639
作者: Huilin Xu,Zhuoyang Liu,Yixiang Luomei,Feng Xu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review, 12 pages, 9 figures
Abstract:Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
zh
[CV-31] See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在智能手机操作中依赖Android Debug Bridge (ADB)进行数据传输与指令执行的问题,从而限制了其在非ADB支持的安卓设备上的应用。解决方案的关键在于提出了一种名为See-Control的新框架,该框架通过直接物理交互方式,利用低自由度(low-DoF)机器人臂实现对智能手机的操作,无需ADB或系统后端访问权限,从而提供了一个平台无关的智能体控制方案。其核心创新包括:一个包含155个任务的Embodied Smartphone Operation (ESO)基准测试集、基于MLLM的具身智能体生成机器人控制指令的能力,以及一个丰富标注的操作片段数据集,为未来家庭机器人在真实环境中完成依赖智能手机的任务提供了可行路径。
链接: https://arxiv.org/abs/2512.08629
作者: Haoyu Zhao,Weizhong Ding,Yuhao Yang,Zheng Tian,Linyi Yang,Kun Shao,Jun Wang
机构: University College London (伦敦大学学院); Imperial College London (帝国理工学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); ShanghaiTech University (上海科技大学); Southern University of Science and Technology (南方科技大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
zh
[CV-32] rajectory Densification and Depth from Perspective-based Blur
【速读】:该论文旨在解决手持拍摄场景下因相机旋转运动导致的透视模糊(perspective-based blur)问题,该模糊现象在长曝光条件下尤为显著,且其强度与物体所处的空间位置密切相关。为应对这一挑战,作者提出了一种基于视频流中模糊模式和密集轨迹联合估计的度量深度(metric depth)方法。解决方案的关键在于:首先利用现成的视觉编码器和点跟踪器提取视频信息,随后通过窗口嵌入(windowed embedding)与多窗口聚合(multi-window aggregation)策略估算深度图,并借助视觉-语言模型(vision-language model)对光学算法获得的稀疏轨迹进行稠密化处理,从而实现高精度的深度重建与轨迹恢复。
链接: https://arxiv.org/abs/2512.08627
作者: Tianchen Qiu,Qirun Zhang,Jiajian He,Zhengyue Zhuge,Jiahui Xu,Yueting Chen
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.
zh
[CV-33] OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
【速读】:该论文旨在解决传统单目SLAM(Simultaneous Localization and Mapping,同时定位与建图)系统在开放世界环境中缺乏语义理解能力的问题,尤其是现有方法通常依赖深度传感器或封闭集语义模型,限制了其在复杂、动态场景中的可扩展性和适应性。解决方案的关键在于提出OpenMonoGS-SLAM,首次将3D高斯溅射(3D Gaussian Splatting, 3DGS)与开放词汇语义理解相结合,通过引入视觉基础模型(Visual Foundation Models, VFMs),如MASt3R用于几何重建、SAM和CLIP实现开放词汇语义分割,从而在无深度输入和无3D语义标注的情况下,仅依靠自监督学习目标完成高质量的单目跟踪、建图及语义感知。此外,设计了一种专门用于管理高维语义特征的记忆机制,有效构建高保真度的高斯语义特征图,显著提升了系统在闭集与开集语义分割任务上的性能表现。
链接: https://arxiv.org/abs/2512.08625
作者: Jisang Yoo,Gyeongjin Kang,Hyun-kyu Ko,Hyeonwoo Yu,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
zh
[CV-34] Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning AAAI2026
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-Training, CLIP)模型在少样本学习(few-shot learning)中因文本模板与图像样本之间的相似性(template-sample similarity, TSS)引入的偏差问题。这种偏差导致模型依赖模板的语义接近度而非真实的图像到类别对齐,从而降低分类准确率和鲁棒性。解决方案的关键在于引入“空提示”(empty prompts),即不含类别信息的文本输入,用于捕获无偏的模板特征并校正TSS偏差:在预训练阶段,通过空提示揭示并减少CLIP编码器内的模板诱导偏差;在少样本微调阶段,设计偏差校准损失(bias calibration loss)强制图像与其类别之间建立正确的对齐关系,使模型聚焦于相关的视觉线索。实验表明,该方法显著降低了由TSS引起的性能波动,提升了分类精度和鲁棒性。
链接: https://arxiv.org/abs/2512.08606
作者: Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Zhimeng Huang,Yuhua Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, Association for the Advancement of Artificial Intelligence (AAAI2026, poster)
Abstract:The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of “emptiness” without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at this https URL.
zh
[CV-35] Automated Pollen Recognition in Optical and Holographic Microscopy Images DATE
【速读】:该论文旨在解决光学显微镜与全息显微镜图像中花粉颗粒的自动检测与分类问题,尤其聚焦于兽医细胞学应用场景。研究发现,尽管在光学图像上使用YOLOv8s进行检测(mAP50达91.3%)和MobileNetV3L进行分类(准确率97%)表现优异,但在灰度全息图像上的性能显著偏低。解决方案的关键在于通过自动化标注(automated labeling)和边界框区域扩大(bounding box area enlargement)对数据集进行扩展,从而有效缩小了两种成像模态间的性能差距:检测mAP50从2.49%提升至13.3%,分类准确率从42%提升至54%。这表明深度学习方法可与低成本无透镜数字全息显微技术结合,实现可靠且经济的花粉分析。
链接: https://arxiv.org/abs/2512.08589
作者: Swarn Singh Warshaneyan,Maksims Ivanovs,Blaž Cugmas,Inese Bērziņa,Laura Goldberga,Mindaugas Tamosiunas,Roberts Kadiķis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 08 pages, 10 figures, 04 tables, 20 references. Date of Conference: 13-14 June 2025 Date Added to IEEE Xplore: 10 July 2025 Electronic ISBN: 979-8-3315-0969-9 Print on Demand(PoD) ISBN: 979-8-3315-0970-5 DOI: https://doi.org/10.1109/AICCONF64766.2025.11064260 Conference Location: Prague, Czech Republic Online Access: this https URL
Abstract:This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.
zh
[CV-36] Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery
【速读】:该论文旨在解决开放手术视频录制中因外科医生频繁遮挡摄像机视角而导致的视场遮挡问题,从而影响教学与研究用途。传统方法通过在无影灯上安装多个摄像头以提高获得无遮挡画面的概率,但每次调整灯光位置后需手动对齐图像,效率低下且易出错。其解决方案的关键在于提出一种全自动图像重对齐与视图合成方法:首先识别照明系统移动导致的帧变化,随后自动重新校准这些帧,并基于遮挡程度选择最优视角生成固定视角的连续视频流。该方法显著提升了视频质量与观看舒适度,在用户研究中也获得了外科医生的高度认可。
链接: https://arxiv.org/abs/2512.08577
作者: Yuna Kato,Shohei Mori,Hideo Saito,Yoshifumi Takatsume,Hiroki Kajita,Mariko Isogawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons’ preferences for each option.
zh
[CV-37] From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis
【速读】:该论文旨在解决如何从多重免疫荧光(multiplex immunofluorescence, mIF)图像中有效提取肿瘤微环境(tumor microenvironment, TME)特征以提升肺癌患者生存风险分层的问题。其核心挑战在于捕捉不同细胞类型之间复杂的局部与全局相互作用关系,而传统方法难以充分建模这些高维、非线性关联。解决方案的关键是提出HiGINE——一种基于分层图结构的分析方法,通过编码细胞邻域内的局部和全局交互信息(包括细胞类型和形态特征),结合多模态融合策略(将癌症分期与mIF衍生特征整合),显著提升了生存预测的准确性、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2512.08572
作者: Olle Edgren Schüllerqvist,Jens Baumann,Joakim Lindblad,Love Nordling,Artur Mezheyeuski,Patrick Micke,Nataša Sladoje
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures
Abstract:The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE – a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.
zh
[CV-38] Instance-Aware Test-Time Segmentation for Continual Domain Shifts
【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中因固定或批次级阈值导致的伪标签可靠性不足问题,尤其在语义分割任务中,由于每个图像需进行密集且多类别的预测,传统方法难以应对类别和实例间差异化的难度变化。解决方案的关键在于提出一种细粒度、类与实例感知的自适应伪标签调整机制:通过动态分析图像内部置信度分布,对不同类别和样本分别调整伪标签强度,并平衡学习权重以优先优化受域偏移影响最严重的类别,从而提升监督信号的可靠性并有效抑制误差累积,显著改善连续适应过程中的性能稳定性与准确性。
链接: https://arxiv.org/abs/2512.08569
作者: Seunghwan Lee,Inyoung Jung,Hojoon Lee,Eunil Park,Sungeun Hong
机构: SKKU(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.
zh
[CV-39] Modular Neural Image Signal Processing
【速读】:该论文旨在解决传统神经图像信号处理(Neural ISP)方法在可扩展性、调试难度大、泛化能力弱以及难以适配用户偏好风格等方面的局限性。其解决方案的关键在于提出了一种模块化神经ISP框架,通过引入多个可独立控制的中间处理阶段,实现了高精度渲染的同时提升了系统的可扩展性、可调试性、对未见过相机的泛化能力及用户偏好风格的灵活性。该框架为后续交互式图像编辑工具提供了高质量且可无限重渲染的基础,且所有变体参数规模适中(0.5M至3.9M),在多个测试集上均表现出优异的定性和定量性能。
链接: https://arxiv.org/abs/2512.08564
作者: Mahmoud Afifi,Zhongling Wang,Ran Zhang,Michael S. Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: this https URL
zh
[CV-40] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
【速读】:该论文旨在解决人类大脑如何表征视觉概念及其在皮层中编码区域的问题,这是神经科学领域长期面临的挑战。现有研究受限于脑信号的高维度与复杂性、视觉概念空间的庞大性,导致多数研究规模小、依赖人工分析、聚焦特定区域和属性,且缺乏系统验证。解决方案的关键在于提出一个大规模、自动化的框架,通过两个核心阶段实现:首先利用无监督数据驱动分解方法从fMRI活动中发现候选可解释模式;其次通过识别最强烈激发这些模式的自然图像集,并生成其共享视觉意义的自然语言描述来解释每个模式。为实现高效扩展,该框架引入自动化流程,测试多种候选解释、量化可靠性评分并选择最一致的描述,从而揭示数千个跨越多种视觉概念的可解释模式,包括此前未报道的细粒度表征。
链接: https://arxiv.org/abs/2512.08560
作者: Navve Wasserman,Matias Cosarinsky,Yuval Golbari,Aude Oliva,Antonio Torralba,Tamar Rott Shaham,Michal Irani
机构: Weizmann Institute of Science (魏茨曼科学研究所); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.
zh
[CV-41] SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds
【速读】:该论文旨在解决LiDAR点云目标检测中计算效率低的问题,尤其是在处理连续扫描数据时冗余的卷积操作导致的高延迟。其解决方案的关键在于利用LiDAR扫描的连续性,通过引入一个具有短步长的滑动时间窗口来识别帧间发生变化的区域,并采用基于散射(scatter-based)的卷积算法实现时间维度上的数据复用,从而构建出一种名为Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR) 的新型稀疏卷积机制。该方法仅对点云中变化的部分进行计算,显著减少了每轮前向传播中的卷积运算量,在保持与传统稀疏卷积相当精度的前提下,实现了最高达6.61倍的处理速度提升。
链接: https://arxiv.org/abs/2512.08557
作者: Alexander Dow,Manduhu Manduhu,Matheus Santos,Ben Bartlett,Gerard Dooly,James Riordan
机构: University of the West of Scotland (西苏格兰大学); University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 Pages, 26 Figures, This work has been submitted to the IEEE Sensors Journal for possible publication
Abstract:This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.
zh
[CV-42] An Iteration-Free Fixed-Point Estimator for Diffusion Inversion
【速读】:该论文旨在解决扩散模型中扩散反演(diffusion inversion)的计算效率与精度问题,特别是固定点迭代方法因迭代复杂性和超参数选择困难而导致的高计算成本。解决方案的关键在于提出一种无需迭代的固定点估计器:首先推导出理想反演步骤下的固定点显式表达式,发现其包含未知的数据预测误差;随后引入误差近似机制,利用前一步反演中的可计算误差来近似当前步骤的未知误差,从而获得一个可计算且无偏、方差低的固定点近似表达式,实现了高效且稳定的图像重建性能。
链接: https://arxiv.org/abs/2512.08547
作者: Yifei Chen,Kaiyu Song,Yan Pan,Jianxing Yu,Jian Yin,Hanjiang Lai
机构: Sun Yat-sen University (中山大学); School of Computer Science and Engineering (计算机科学与工程学院); School of Artificial Intelligence (人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.
zh
[CV-43] A Novel Wasserstein Quaternion Generative Adversarial Network for Color Image Generation
【速读】:该论文旨在解决现有生成模型在彩色图像生成中忽视颜色通道间相关性的问题,从而导致色差(chromatic aberration)现象,并且缺乏对彩色图像数据分布的系统性理论分析与度量方法。其解决方案的关键在于提出一种新的四元数 Wasserstein 距离(quaternion Wasserstein distance),并构建其对偶理论;通过引入四元数凸集分离定理和四元数 Farkas 引理,推导出强对偶形式以求解四元数线性规划问题,进而设计出一种基于该距离的新型 Wasserstein 四元数生成对抗网络(Wasserstein quaternion generative adversarial network)。实验表明,该模型在生成效率和图像质量上均优于传统的(四元数)生成对抗网络(GAN)和 Wasserstein GAN。
链接: https://arxiv.org/abs/2512.08542
作者: Zhigang Jia,Duan Wang,Hengkai Wang,Yajun Xie,Meixiang Zhao,Xiaoyu Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.
zh
[CV-44] Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation
【速读】:该论文旨在解决自回归(Autoregressive, AR)与扩散模型(Diffusion Model)混合框架在图像生成中面临的高延迟问题,该问题源于AR模块的串行生成和扩散过程的迭代去噪机制。解决方案的关键在于提出一个统一的AR-扩散联合优化框架Fast-ARDiff:首先,引入基于熵感知的推测策略,使草稿模型生成更符合目标模型熵特性的表示,降低因草稿过自信导致的拒绝率;其次,将扩散模块嵌入端到端训练框架,并通过动态调度器优先优化AR部分以引导后续扩散步骤,同时采用结合轨迹匹配与分布匹配的联合蒸馏机制实现高效且稳定的扩散解码,仅需极少步数即可完成高质量合成;此外,在推理阶段利用AR模块浅层特征熵预筛低熵草稿,避免冗余计算,显著提升整体效率。
链接: https://arxiv.org/abs/2512.08537
作者: Zhen Zou,Xiaoxiao Ma,Jie Huang,Zichao Yu,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); JD Joy future AI (京东未来人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive(AR)-diffusion hybrid paradigms combine AR’s structured modeling with diffusion’s photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model’s entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256 \times 256, TransDiff attains 4.3 \times lossless speedup, and NextStep-1 achieves 3 \times acceleration on text-conditioned generation. Code will be available at this https URL.
zh
[CV-45] Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
【速读】:该论文旨在解决当前3D-native生成模型在几何结构可靠性的基础上难以实现逼真外观的问题,其核心障碍在于缺乏多样且高质量的真实世界3D资产(real-world 3D assets),尤其是具有丰富纹理细节的数据,这主要受限于场景尺度多样性、物体非刚性运动以及3D扫描设备精度不足等因素。解决方案的关键在于提出Photo3D框架,该框架基于GPT-4o-Image模型生成的图像数据构建结构对齐的多视角合成流程,并建立与3D几何配对的细节增强多视角数据集;进一步设计了一种现实细节增强方案,通过感知特征适配(perceptual feature adaptation)和语义结构匹配(semantic structure matching)来在保持3D几何结构一致性的同时强化外观真实感。该方案通用性强,适用于多种3D-native生成器,并针对几何-纹理耦合与解耦两种范式分别设计训练策略,从而显著提升整体光真实感3D生成性能。
链接: https://arxiv.org/abs/2512.08535
作者: Xinyue Liang,Zhinyuan Ma,Lingchen Sun,Yanjun Guo,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.
zh
[CV-46] PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation
【速读】:该论文旨在解决数字生成与编辑油画作品时面临的挑战,尤其是如何在保持油画独特笔触动态和风格一致性的同时实现细粒度控制。现有方法受限于训练数据分布且主要针对真实照片进行修改,难以适配油画这种高度抽象的艺术形式。解决方案的关键在于提出一个统一的多模态框架,通过三个核心技术突破实现:首先,在训练阶段引入空间对齐与语义增强条件策略,将掩码和草图映射为空间约束,并利用参考图像和自然语言提示编码上下文嵌入作为特征约束,从而实现对象级语义对齐;其次,为克服数据稀缺问题,设计基于笔触渲染(Stroke-Based Rendering, SBR)的自监督风格迁移流程,模拟油画修复中的局部重绘过程,生成包含保留笔触纹理的大规模成对训练数据;最后,在推理阶段采用AdaIN(Adaptive Instance Normalization)操作融合多源特征,确保输出风格的一致性。
链接: https://arxiv.org/abs/2512.08534
作者: Zhangli Hu,Ye Chen,Jiajun Yao,Bingbing Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
zh
[CV-47] MVP: Multiple View Prediction Improves GUI Grounding
【速读】:该论文旨在解决现有GUI grounding模型在坐标预测上的不稳定性问题,即微小的视觉扰动(如裁剪几个像素)会导致预测结果剧烈波动,严重影响高分辨率图像中小UI元素的定位精度。解决方案的关键在于提出一种无需训练的多视角推理框架——Multi-View Prediction (MVP),其核心思想是通过聚合多个精心裁剪视角下的预测结果来识别正确坐标:首先利用指令到图像的注意力得分引导生成多样化视图(Attention-Guided View Proposal),再通过空间聚类选择最密集簇的中心点作为最终预测(Multi-Coordinates Clustering),从而有效抑制单视角预测中的噪声与异常值。
链接: https://arxiv.org/abs/2512.08529
作者: Yunzhu Zhang,Zeyu Pan,Zhengwen Zeng,Shuheng Shen,Changhua Meng,Linchao Zhu
机构: Zhejiang University (浙江大学); Hangzhou Dianzi University (杭州电子科技大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP’s effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at this https URL.
zh
[CV-48] hinking with Images via Self-Calling Agent
【速读】:该论文旨在解决多模态链式思维(interleaved multimodal Chain-of-Thought, iMCoT)在强化学习优化过程中因高质量推理数据稀缺而导致的训练困难问题。其核心解决方案是提出一种名为Self-Calling Chain-of-Thought (sCoT) 的新范式,关键在于将原本需要显式交错融合视觉与语言信息的iMCoT重构为仅依赖语言模型的链式思维结构,并通过参数共享的虚拟子代理(subagents)在隔离上下文中并行求解原子级子任务,从而避免了复杂的模态交错操作,显著提升了训练效率与效果。此外,sCoT采用群体相对策略优化(group-relative policy optimization)来增强有效推理行为的学习,实验表明其在HR-Bench 4K上相较强基线提升整体推理性能达1.9%,同时节省约75% GPU计算时间。
链接: https://arxiv.org/abs/2512.08511
作者: Wenxi Yang,Yuzhong Zhao,Fang Wan,Qixiang Ye
机构: University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code would be released at this https URL soon
Abstract:Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to 1.9% with \sim 75% fewer GPU hours compared to strong baseline approaches. Code is available at this https URL.
zh
[CV-49] OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds
【速读】:该论文旨在解决从激光雷达(LiDAR)点云中重建建筑物时,如何在不同点密度和噪声干扰下准确捕捉建筑表面的问题。其核心解决方案是提出OCCDiff方法,该方法基于占据函数(occupancy function)空间中的潜在扩散过程(latent diffusion process),结合函数自编码器架构以生成可在任意位置评估的连续占据函数。关键创新在于引入点编码器(point encoder)提供条件特征以约束最终占据预测,并融合多模态特征用于潜在空间生成,同时采用多任务训练策略提升特征表示的多样性和鲁棒性,从而实现高保真且物理一致的建筑表面重建。
链接: https://arxiv.org/abs/2512.08506
作者: Jialu Sui,Rui Liu,Hongsheng Zhang
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.
zh
[CV-50] Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
【速读】:该论文旨在解决条件扩散模型(conditional diffusion models)在生成过程中存在的文本与图像语义错位(text/image misalignment)及幻觉问题,这些问题会导致生成结果偏离用户提示(prompt),而现有检测方法需等待完整生成完成后才能评估对齐度,导致计算成本高昂。解决方案的关键在于提出NoisyCLIP方法,该方法首次在反向扩散过程的噪声潜在空间中利用双编码器(dual encoders)实时测量提示与潜在表示之间的语义对齐度,从而实现生成过程中的早期对齐检测。实验表明,NoisyCLIP可在保持98% CLIP对齐性能的同时将计算成本降低50%,显著提升了生成效率与质量控制能力。
链接: https://arxiv.org/abs/2512.08505
作者: Vasco Ramos,Regev Cohen,Idan Szpektor,Joao Magalhaes
机构: NOVA University of Lisbon(里斯本新大学); Google Research(谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.
zh
[CV-51] Disrupting Hierarchical Reasoning : Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
【速读】:该论文旨在解决多模态大推理模型(Multi-modal Large Reasoning Models, MLRMs)通过层级式思维链推理从个人图像中精准推断地理位置所带来的隐私风险问题。现有针对感知类模型的隐私保护技术无法有效应对MLRMs所具备的复杂多步推理能力,因其能够分析环境线索并建立精细的概念依赖关系。解决方案的关键在于提出一种名为ReasonBreak的对抗性框架,其核心创新是基于概念层次结构生成有针对性的扰动(concept-aware perturbations),而非均匀噪声;该方法精准打击推理链中的关键概念依赖关系,从而破坏特定推理步骤并引发后续推理阶段的连锁失效,实现对地理信息泄露的有效阻断。
链接: https://arxiv.org/abs/2512.08503
作者: Jiaming Zhang,Che Wang,Yang Cao,Longtao Huang,Wei Yang Bryan Lim
机构: Nanyang Technological University (南洋理工大学); Peking University (北京大学); Institute of Science Tokyo (东京科学研究所); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs’ sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbfReasonBreak, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbfGeoPrivacy-6K, a comprehensive dataset comprising 6,341 ultra-high-resolution images ( \geq 2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak’s superior effectiveness, achieving a 14.4% improvement in tract-level protection (33.8% vs 19.4%) and nearly doubling block-level protection (33.5% vs 16.8%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
zh
[CV-52] Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
【速读】:该论文旨在解决从视频数据中直接合成真实且多样化的3D角色运动行为的挑战,尤其针对现有方法依赖昂贵或稀缺的3D动作捕捉数据、以及基于物理仿真进行模仿时难以生成符合物理规律的姿势的问题。其关键解决方案是提出Mimic2DM框架,该框架仅使用从视频中提取的广泛可用的2D关键点轨迹来学习控制策略,通过最小化重投影误差训练出一个通用的单视角2D运动跟踪策略;当在不同或略有差异视角下训练时,该策略还能通过多视角聚合获得3D运动跟踪能力,并结合基于Transformer的自回归2D运动生成器构建分层控制结构,从而无需显式3D运动数据即可合成物理合理且多样化的运动,涵盖舞蹈、足球运球和动物运动等场景。
链接: https://arxiv.org/abs/2512.08500
作者: Jianan Li,Xiao Chen,Tao Huang,Tien-Tsin Wong
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Monash University (莫纳什大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: this https URL
zh
[CV-53] On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs
【速读】:该论文旨在解决单目RGB流在实时3D重建中因视场角(Field of View, FOV)受限而导致的三维覆盖不完整问题,尤其是在动态场景下难以实现无漂移轨迹估计与高效在线重建的挑战。其解决方案的关键在于提出首个适用于多相机阵列的实时增量式3D重建框架:通过分层相机初始化策略实现无需标定的粗略跨相机对齐,并结合轻量级多相机束调整(Multi-camera Bundle Adjustment)稳定轨迹;同时引入无冗余高斯采样策略与频率感知优化调度机制,在显著减少高斯粒子数量和优化迭代次数的同时,保障重建精度与效率,从而实现仅用原始多视角视频流即可在2分钟内完成数百米场景的高质量三维重建。
链接: https://arxiv.org/abs/2512.08498
作者: Yijia Guo,Tong Hu,Zhiwei Li,Liwen Hu,Keming Qian,Xitong Lin,Shengbo Chen,Tiejun Huang,Lei Ma
机构: Peking University (北京大学); Nanchang University (南昌大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.
zh
[CV-54] mporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中概念形成的时间动态性问题,即探究噪声如何逐步转化为特定语义概念(如年龄)并锁定去噪轨迹的时机。传统评估方法仅关注最终输出,忽视了生成过程中的动态特性,而本文通过提出一种无需训练且模型无关的分析框架PCI(Prompt-Conditioned Intervention),以Concept Insertion Success(CIS)为核心指标——定义为在某一时间步插入概念后其在最终图像中得以保留的概率——来量化概念在扩散路径上的形成时机与稳定性。解决方案的关键在于利用CIS度量揭示不同扩散阶段对特定概念的敏感性差异,从而识别出最有利于干预的时点,实现无需访问模型内部结构或重新训练即可提升文本驱动图像编辑的质量与可控性。
链接: https://arxiv.org/abs/2512.08486
作者: Ada Gorgun,Fawaz Sammani,Nikos Deligiannis,Bernt Schiele,Jonas Fischer
机构: Max Planck Institute for Informatics (马普研究所); Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
Abstract:Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: this https URL
zh
[CV-55] Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
【速读】:该论文旨在解决当前神经渲染(Neural Rendering)领域中,尤其是3D Gaussian Splatting(3DGS)相关视图方案存在的碎片化、部署复杂、对动态内容和生成式模型支持不足等问题。其核心解决方案是提出一个名为Visionary的开源、原生Web平台,通过基于WebGPU的高效渲染引擎与每帧ONNX推理结合,实现轻量级浏览器内实时渲染,并引入标准化的Gaussian Generator合约,支持算法插件化更新每帧高斯分布,同时可集成前馈式生成后处理模块。该设计显著降低了3DGS家族方法的复现、对比与部署门槛,成为统一的“世界模型载体”(World Model Carrier),兼容重建与生成范式。
链接: https://arxiv.org/abs/2512.08478
作者: Yuning Gong,Yifei Liu,Yifan Zhan,Muyao Niu,Xueying Li,Yuanjun Liao,Jiaming Chen,Yuanyuan Gao,Jiaqi Chen,Minming Chen,Li Zhou,Yuning Zhang,Wei Wang,Xiaoqing Hou,Huaxi Huang,Shixiang Tang,Le Ma,Dingwen Zhang,Xue Yang,Junchi Yan,Yanchi Zhang,Yinqiang Zheng,Xiao Sun,Zhihang Zhong
机构: Shanghai AI Laboratory (上海人工智能实验室); Sichuan University (四川大学); The University of Tokyo (东京大学); Shanghai Jiao Tong University (上海交通大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, “click-to-run” browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in this http URL library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.
zh
[CV-56] ContextDrag : Precise Drag -Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention
【速读】:该论文旨在解决拖拽式图像编辑(drag-based image editing)中因未能充分利用参考图像的上下文信息(如细粒度纹理细节)而导致编辑结果一致性与保真度不足的问题。解决方案的关键在于提出了一种名为ContextDrag的新范式,其核心创新包括:(1)引入一种无需微调或反演即可保留参考图像丰富上下文线索的Context-preserving Token Injection(CTI)机制,通过潜空间逆映射(Latent-space Reverse Mapping, LRM)算法将无噪参考特征精准注入目标位置;(2)设计Position-Consistent Attention(PCA),对参考token进行位置一致性重编码并采用重叠感知掩码以消除无关特征干扰,从而在语义和纹理层面实现高保真的一致性控制。
链接: https://arxiv.org/abs/2512.08477
作者: Huiguo He,Pengyu Yan,Ziqi Yi,Weizhi Zhong,Zheng Liu,Yejun Tang,Huan Yang,Kun Gai,Guanbin Li,Lianwen Jin
机构: South China University of Technology (华南理工大学); Kuaishou Technology (快手科技); The University of Hong Kong (香港大学); Shenzhen Loop Area Institute (深圳环区研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
zh
[CV-57] am-Aware Football Player Tracking with SAM: An Appearance-Based Approach to Occlusion Recovery
【速读】:该论文旨在解决足球场景中因频繁遮挡、外观相似性和快速运动导致的球员跟踪难题。其关键解决方案是提出一种基于Segment Anything Model (SAM) 的轻量化跟踪方法,结合CSRT跟踪器与球衣颜色特征的外观模型,构建了一个团队感知的跟踪系统;其中,SAM用于精准初始化,HSV直方图-based重识别技术则显著提升了遮挡恢复能力,从而在复杂密集场景下实现了高精度和鲁棒性的球员跟踪性能。
链接: https://arxiv.org/abs/2512.08467
作者: Chamath Ranasinghe,Uthayasanker Thayasivam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.
zh
[CV-58] Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
【速读】:该论文旨在解决现有基于子集选择(subset selection)的可解释方法在分布内(in-distribution, ID)场景下表现良好,但在分布外(out-of-distribution, OOD)条件下可靠性显著下降的问题,具体表现为解释结果冗余、不稳定且对不确定性敏感。其解决方案的关键在于提出一种融合子模优化(submodular subset selection)与逐层梯度不确定性估计(layer-wise, gradient-based uncertainty estimation)的框架,通过自适应权重扰动估计不确定性,并利用该不确定性引导子模优化过程,从而实现更具多样性与信息量的子集选择,提升解释的鲁棒性与保真度(fidelity),且无需额外训练或辅助模型。
链接: https://arxiv.org/abs/2512.08445
作者: Madhav Gupta,Vishak Prasad C,Ganesh Ramakrishnan
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.
zh
[CV-59] Leverag ing Multispectral Sensors for Color Correction in Mobile Cameras
【速读】:该论文旨在解决多光谱(Multispectral, MS)成像系统中颜色校正的精度与稳定性问题,尤其针对现有方法将颜色校正流程分阶段处理、过早丢弃多光谱数据导致性能受限的问题。其解决方案的关键在于提出一个统一的端到端学习框架,该框架联合利用高分辨率RGB传感器与辅助的低分辨率MS传感器的数据,在单一模型中整合完整的颜色校正流水线,从而生成色度一致且准确的输出结果。通过重构两种先进的图像到图像转换架构并构建专用训练数据集,实验表明该方法相比仅使用RGB或仅依赖MS的基线方案,颜色误差最多降低50%,显著提升了颜色校正的准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.08441
作者: Luca Cogo,Marco Buzzelli,Simone Bianco,Javier Vazquez-Corral,Raimondo Schettini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.
zh
[CV-60] LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training
【速读】:该论文旨在解决腹腔镜手术图像分割中因标注数据稀缺和不同术式间语义不一致导致的泛化能力不足问题。现有方法通常仅在自然领域基础模型(如SAM)上进行少量监督微调,本质上仅为领域适配器而非真正的手术基础模型,难以应对手术目标的多样性。解决方案的关键在于提出LapFM,其核心是Hierarchical Concept Evolving Pre-training(分层概念演化预训练)范式:首先构建腹腔镜概念层次结构(LCH),通过父子查询嵌入的分层掩码解码器将解剖结构、组织和器械等异构实体统一为具有跨粒度语义一致性的可扩展知识结构;其次设计基于置信度驱动的演化标注机制,迭代生成并过滤伪标签,逐步将高质量无标注样本纳入训练,最终形成包含114K图像-掩码对的大规模基准LapBench-114K。此方法显著提升了通用腹腔镜分割任务中的粒度自适应泛化性能。
链接: https://arxiv.org/abs/2512.08439
作者: Qing Xu,Kun Yuan,Yuxiang Luo,Yuhao Zhai,Wenting Duan,Nassir Navab,Zhen Chen
机构: University of Lincoln, UK; University of Nottingham, UK; University of Nottingham Ningbo China, China; University of Strasbourg, France; Technical University of Munich, Germany; Waseda University, Japan; Shandong University, China; Yale University, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at this https URL.
zh
[CV-61] SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking WACV2026
【速读】:该论文旨在解决在密集堆积的工业场景中准确恢复目标物体6D位姿(即三维位置与三维姿态)的难题,此类场景常因遮挡、反光及无纹理表面等因素导致传统方法失效。其解决方案的关键在于提出一种全稀疏的深度仅(depth-only)6D位姿估计框架:首先通过多视角深度图融合生成细粒度3D点云或稀疏截断有符号距离场(Truncated Signed Distance Field, TSDF),随后引入分阶段热力图机制以生成场景自适应的关注先验,从而在不同分辨率下引导计算聚焦于前景区域,降低高分辨率下的内存开销;同时设计密度感知稀疏Transformer模块,动态关注自我遮挡与3D数据非均匀分布问题。最终采用新颖的逐体素投票策略对整个场景进行一体化处理,实现任意数量目标物体的同时位姿预测,在IPD和MV-YCB多视角数据集上验证了其在复杂杂乱环境中的优越性能。
链接: https://arxiv.org/abs/2512.08430
作者: Nico Leuze,Maximilian Hoh,Samed Doğan,Nicolas R.-Peña,Alfred Schoettl
机构: University of Applied Science Munich (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to WACV 2026. Preprint version
Abstract:Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
zh
[CV-62] owards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视频时因内存开销过大而导致只能有限帧数输入的问题。其核心解决方案是提出一种名为One-shot video-Clip based Retrieval AuGmentation(OneClip-RAG)的新范式,关键在于通过一个新颖的查询引导式视频分块算法(query-guided video chunking),将视频片段分割与跨模态检索统一在一个处理步骤中,从而避免冗余计算;同时利用视频片段增强知识完整性与语义连贯性,显著提升长视频理解能力,并在多个主流MLLM上验证了其高效性和性能优势。
链接: https://arxiv.org/abs/2512.08410
作者: Tao Chen,Shaobo Ju,Qiong Wu,Chenxin Fang,Kun Zhang,Jun Peng,Hui Li,Yiyi Zhou,Rongrong Ji
机构: Xiamen University (厦门大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
zh
[CV-63] SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
【速读】:该论文旨在解决视频中人体三维重建(Human Mesh Recovery, HMR)的时序不一致性与遮挡鲁棒性差的问题。现有基于图像的方法(如SAM 3D Body)在处理视频时采用逐帧推理,导致人体姿态和形状在时间维度上波动明显,且在遮挡场景下性能显著下降。解决方案的关键在于无需额外训练,利用视频中固有的时空连续性:首先通过可提示的视频分割模型生成身份一致的masklets(即短时掩码片段),再通过一个遮挡感知模块(Occlusion-Aware module)恢复缺失区域;最终,这些优化后的masklets引导SAM 3D Body生成时序一致的全身网格轨迹,并结合基于填充的并行策略实现多人体高效推理。实验表明,该方法在真实复杂视频中显著提升了时序稳定性和遮挡鲁棒性。
链接: https://arxiv.org/abs/2512.08406
作者: Mingqi Gao,Yunqi Miao,Jungong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: this https URL.
zh
[CV-64] owards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries
【速读】:该论文旨在解决电子监测(Electronic Monitoring, EM)系统在渔业管理中产生的大量视频数据难以人工审核的问题,提出了一种基于深度学习的鱼类再识别(Fish Re-identification, Re-ID)自动化方案。解决方案的关键在于构建了名为AutoFish的新数据集以模拟EM场景,并设计了一个优化的深度学习流水线:通过硬三元组挖掘(hard triplet mining)结合特定于数据集的图像变换策略(包括归一化处理),显著提升了Re-ID性能;实验表明,基于视觉Transformer的Swin-T架构优于传统卷积神经网络ResNet-50,在mAP@k和Rank-1准确率上分别达到41.65%和90.43%,且主要挑战来自同种鱼类个体间的区分困难(Intra-species errors),其中视角不一致的影响远大于部分遮挡。
链接: https://arxiv.org/abs/2512.08400
作者: Samitha Nuwan Thilakarathna,Ercan Avsar,Martin Mathias Nielsen,Malte Pedersen
机构: DTU Aqua - National Institute of Aquatic Resources, Technical University of Denmark(丹麦技术大学水产国家研究所); Visual Analysis and Perception Laboratory, Aalborg University(奥尔堡大学视觉分析与感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted for publication at Northern Lights Deep Learning (NLDL) Conference 2025
Abstract:Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: this https URL
zh
[CV-65] Detection of Digital Facial Retouching utilizing Face Beauty Information
【速读】:该论文旨在解决面部美化(Facial Retouching)对生物特征识别系统(Biometric System)带来的挑战,即未经检测的图像修饰可能干扰人脸识别准确性,从而影响系统安全性与可靠性。其解决方案的关键在于利用人工智能驱动的特征提取方法,结合面部美感评估算法的变化分析,探索是否可通过人脸美观度特征提升检测灵敏度;在攻击性美化算法未知的情况下,实现了单张图像检测的1.1% D-EER(Detection Equal Error Rate),表明该方法具备较强的泛化能力与实用性。
链接: https://arxiv.org/abs/2512.08397
作者: Philipp Srock,Juan E. Tapia,Christoph Busch
机构: Hochschule Darmstadt, University of Applied Sciences (达姆施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial retouching to beautify images is widely spread in social media, advertisements, and it is even applied in professional photo studios to let individuals appear younger, remove wrinkles and skin impurities. Generally speaking, this is done to enhance beauty. This is not a problem itself, but when retouched images are used as biometric samples and enrolled in a biometric system, it is one. Since previous work has proven facial retouching to be a challenge for face recognition systems,the detection of facial retouching becomes increasingly necessary. This work proposes to study and analyze changes in beauty assessment algorithms of retouched images, assesses different feature extraction methods based on artificial intelligence in order to improve retouching detection, and evaluates whether face beauty can be exploited to enhance the detection rate. In a scenario where the attacking retouching algorithm is unknown, this work achieved 1.1% D-EER on single image detection.
zh
[CV-66] Simultaneous Enhancement and Noise Suppression under Complex Illumination Conditions
【速读】:该论文旨在解决复杂光照条件下图像质量退化问题,尤其针对现有增强方法在噪声放大或特定光照下效果不佳的局限性。解决方案的关键在于提出一种联合优化框架:首先利用梯度域加权引导滤波(gradient-domain weighted guided filter, GDWGIF)精确估计光照并提升图像质量;随后通过Retinex模型将图像分解为光照层与反射层,并对二者分别进行并行处理——光照层用于校正照明条件,反射层用于增强细节;最后结合多曝光融合与线性拉伸策略优化动态范围。该方法在真实场景数据集上验证,显著优于当前主流方法,在对比度增强和噪声抑制方面均取得更好性能。
链接: https://arxiv.org/abs/2512.08378
作者: Jing Tao,You Li,Banglei Guan,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); China Astronaut Research and Training Center (中国航天员科研训练中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted and officially published by IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT
Abstract:Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.
zh
[CV-67] he Unseen Bias: How Norm Discrepancy in Pre-Norm MLLM s Leads to Visual Information Loss
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因预归一化(Pre-Norm)架构导致的视觉 token 与文本 token 之间范数失衡问题。这种失衡引发“不对称更新动态”,即高范数的视觉 token 表示惯性显著,其语义更新速度远低于文本 token,从而严重损害跨模态特征融合效果。解决方案的关键在于:在视觉投影层后插入一个精心初始化的 LayerNorm 层,以强制对齐视觉与文本 token 的范数分布,从而缓解上述不对称更新现象。实验证明,这一简单干预在 LLaVA-1.5 架构上显著提升多模态任务性能,并意外改善了纯文本任务表现,表明该设计有助于构建更均衡、更通用的模型能力。
链接: https://arxiv.org/abs/2512.08374
作者: Bozhou Li,Xinda Xue,Sihan Yang,Yang Shi,Xinlong Chen,Yushuo Guan,Yuanxing Zhang,Wentao Zhang
机构: Peking University (北京大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Xi’an Jiaotong University (西安交通大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an asymmetric update dynamic,'' where high-norm visual tokens exhibit a representational inertia,‘’ causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic – the persistence of norm disparity and the resulting asymmetric update rates – is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
zh
[CV-68] SCU-CGAN: Enhancing Fire Detection through Synthetic Fire Image Generation and Dataset Augmentation
【速读】:该论文旨在解决家庭火灾检测模型因缺乏足够火情数据而导致性能受限的问题。其解决方案的关键在于提出SCU-CGAN模型,该模型融合U-Net架构、CBAM(Convolutional Block Attention Module)以及额外的判别器,能够从非火情图像中生成高质量的逼真火焰图像,从而扩充训练数据集。实验表明,该方法显著提升了YOLOv5 nano模型在mAP@0.5:0.95指标上的性能,增幅达56.5%,验证了生成式数据增强策略的有效性。
链接: https://arxiv.org/abs/2512.08362
作者: Ju-Young Kim,Ji-Hong Park,Gun-Woo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for main track at MobieSec 2024 (not published in the proceedings)
Abstract:Fire has long been linked to human life, causing severe disasters and losses. Early detection is crucial, and with the rise of home IoT technologies, household fire detection systems have emerged. However, the lack of sufficient fire datasets limits the performance of detection models. We propose the SCU-CGAN model, which integrates U-Net, CBAM, and an additional discriminator to generate realistic fire images from nonfire images. We evaluate the image quality and confirm that SCU-CGAN outperforms existing models. Specifically, SCU-CGAN achieved a 41.5% improvement in KID score compared to CycleGAN, demonstrating the superior quality of the generated fire images. Furthermore, experiments demonstrate that the augmented dataset significantly improves the accuracy of fire detection models without altering their structure. For the YOLOv5 nano model, the most notable improvement was observed in the mAP@0.5:0.95 metric, which increased by 56.5%, highlighting the effectiveness of the proposed approach.
zh
[CV-69] Conditional Morphogenesis: Emergent Generation of Structural Digits via Neural Cellular Automata
【速读】:该论文旨在解决现有不同iable神经细胞自动机(Differentiable Neural Cellular Automata, NCA)在结构生成任务中缺乏类条件控制能力的问题,尤其是如何从单一通用种子出发,基于类别信息生成具有明确拓扑结构的多样化目标(如MNIST数字)。其解决方案的关键在于提出了一种新型条件神经细胞自动机(Conditional Neural Cellular Automata, c-NCA)架构,通过向每个细胞的局部感知场注入一个one-hot编码的类别向量(class vector),使一组统一的局部规则能够打破对称性并自组织为十种不同的几何吸引子(geometric attractors),从而实现类条件结构生成。该方法严格遵循局部性和平移等变性(translation equivariance)约束,不依赖全局接收场,具备生物合理性与计算轻量化优势。
链接: https://arxiv.org/abs/2512.08360
作者: Ali Sakour
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 5 figures. Code available at: this https URL
Abstract:Biological systems exhibit remarkable morphogenetic plasticity, where a single genome can encode various specialized cellular structures triggered by local chemical signals. In the domain of Deep Learning, Differentiable Neural Cellular Automata (NCA) have emerged as a paradigm to mimic this self-organization. However, existing NCA research has predominantly focused on continuous texture synthesis or single-target object recovery, leaving the challenge of class-conditional structural generation largely unexplored. In this work, we propose a novel Conditional Neural Cellular Automata (c-NCA) architecture capable of growing distinct topological structures - specifically MNIST digits - from a single generic seed, guided solely by a spatially broadcasted class vector. Unlike traditional generative models (e.g., GANs, VAEs) that rely on global reception fields, our model enforces strict locality and translation equivariance. We demonstrate that by injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors. Experimental results show that our c-NCA achieves stable convergence, correctly forming digit topologies from a single pixel, and exhibits robustness characteristic of biological systems. This work bridges the gap between texture-based NCAs and structural pattern formation, offering a lightweight, biologically plausible alternative for conditional generation.
zh
[CV-70] rackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels NEURIPS2025
【速读】:该论文旨在解决现有单目3D跟踪方法在分离相机运动与前景动态运动方面的不足,以及无法密集追踪视频中新出现动态目标的问题。其解决方案的关键在于提出了一种名为TrackingWorld的新颖流水线,通过引入一个高效的跟踪上采样器(tracking upsampler),将任意稀疏的2D轨迹提升为稠密的2D轨迹;进一步地,通过对所有帧应用该上采样器并消除重叠区域中的冗余轨迹,实现对新出现物体的泛化追踪;最终,基于优化框架估计相机位姿和2D轨迹对应的3D坐标,将稠密2D轨迹反投影至以世界为中心的3D坐标系中,从而实现高精度、稠密的3D轨迹追踪。
链接: https://arxiv.org/abs/2512.08358
作者: Jiahao Lu,Weitao Xiong,Jiacheng Deng,Peng Li,Tianyu Huang,Zhiyang Dou,Cheng Lin,Sai-Kit Yeung,Yuan Liu
机构: HKUST(香港科技大学); USTC(中国科学技术大学); CUHK(香港中文大学); HKU(香港大学); XMU(厦门大学); MUST(澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Project Page: this https URL
Abstract:Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.
zh
[CV-71] DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation
【速读】:该论文旨在解决在功能性磁共振成像(fMRI)中BOLD图像因噪声、伪影或数据缺失导致信息丢失的问题,从而影响下游任务的准确性。其解决方案的关键在于提出DINO-BOLDNet,一个基于DINOv3引导的多切片注意力框架:通过冻结的自监督DINOv3编码器提取单切片结构特征,利用独立的切片注意力模块融合邻近切片的上下文信息,并采用多尺度生成解码器恢复精细的功能对比度;同时引入基于DINO特征空间的感知损失以确保预测结果与真实BOLD图像在结构和纹理上的一致性。此方法首次实现了从T1w图像直接生成均值BOLD图像,验证了自监督Transformer在结构到功能映射中的潜力。
链接: https://arxiv.org/abs/2512.08337
作者: Jianwei Wang,Qing Wang,Menglan Ruan,Rongjun Ge,Chunfeng Yang,Yang Chen,Chunming Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.
zh
[CV-72] HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting
【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian splatting)在渲染真实世界场景中复杂反射时存在的渲染速度慢和内存占用高的问题。其解决方案的关键在于提出了一种新的混合溅射机制(Hybrid Splatting, HybridSplat),核心创新是引入“反射烘焙高斯追踪”(reflection-baked Gaussian tracing):在渲染过程中,将视点相关的反射信息烘焙至每个高斯原语(Gaussian primitive)内部,并采用基于区块的高斯溅射方法进行反射渲染;随后通过统一的混合溅射框架将反射高斯原语与基础高斯原语融合,实现高保真场景重建。此外,论文还设计了流水线级加速和反射敏感的高斯剪枝策略,在显著减少模型参数量(相比同类基于光线追踪的高斯溅射基线减少4倍)的同时,使渲染速度提升约7倍,且保持高质量的反射渲染效果。
链接: https://arxiv.org/abs/2512.08334
作者: Chang Liu,Hongliang Yuan,Lianghao Zhang,Sichao Wang,Jianwei Guo,Shi-Sheng Huang
机构: Beijing Normal University (北京师范大学); Xiaomi Inc (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.
zh
[CV-73] Bi2MAC: Bimodal Bi-Adaptive Mask-Aware Convolution for Remote Sensing Pansharpening
【速读】:该论文旨在解决传统基于深度学习的图像融合方法在处理遥感图像时难以适应特征表示中区域异质性(regional heterogeneity)的问题。现有自适应卷积方法虽试图缓解此问题,但普遍存在计算成本过高且对异质区域建模能力有限的缺陷。其解决方案的关键在于提出一种双模态双向自适应掩码感知卷积(Bimodal Bi-Adaptive Mask-Aware Convolution, Bi²MAC),通过设计轻量级模块生成软掩码(soft mask)和硬掩码(hard mask),分别用于初步调制输入特征与引导不同类型的区域进入独立处理分支:冗余特征被送入紧凑分支进行低成本全局处理,而异质特征则路由至聚焦分支以投入更多计算资源进行细粒度建模,从而实现高效且精准的区域差异化处理。
链接: https://arxiv.org/abs/2512.08331
作者: Xianghong Xiao,Zeyu Xia,Zhou Fei,Jinliang Xiao,Haorui Chen,Liangjian Deng
机构: UESTC(电子科技大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pansharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to generate a high-resolution multispectral image (HRMS). Conventional deep learning-based methods are inherently limited in their ability to adapt to regional heterogeneity within feature representations. Although various adaptive convolution methods have been proposed to address this limitation, they often suffer from excessive computational costs and a limited ability to capture heterogeneous regions in remote sensing images effectively. To overcome these challenges, we propose Bimodal Bi-Adaptive Mask-Aware Convolution (Bi^2MAC), which effectively exploits information from different types of regions while intelligently allocating computational resources. Specifically, we design a lightweight module to generate both soft and hard masks, which are used to modulate the input features preliminarily and to guide different types of regions into separate processing branches, respectively. Redundant features are directed to a compact branch for low-cost global processing. In contrast, heterogeneous features are routed to a focused branch that invests more computational resources for fine-grained modeling. Extensive experiments on multiple benchmark datasets demonstrate that Bi^2MAC achieves state-of-the-art (SOTA) performance while requiring substantially lower training time and parameter counts, and the minimal computational cost among adaptive convolution models.
zh
[CV-74] PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models IJCNN2025
【速读】:该论文旨在解决3D点云数据在自监督表示学习中的难题,特别是现有对比学习方法易过拟合、掩码自动编码器(Mask Autoencoder)难以处理无序点云的问题。解决方案的关键在于提出一种名为PointDico的新模型,通过知识蒸馏机制融合扩散模型(diffusion model)与对比学习模型的优势:以扩散模型作为教师指导对比模型的学习过程,并引入分层金字塔条件生成器实现多尺度几何特征提取,同时采用双通道设计有效整合局部与全局上下文信息,从而在ScanObjectNN和ShapeNetPart等基准上达到新的SOTA性能。
链接: https://arxiv.org/abs/2512.08330
作者: Pengbo Li,Yiding Sun,Haozhe Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCNN 2025
Abstract:Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textitPointDico, a novel model that seamlessly integrates these methods. \textitPointDico learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textitPointDico achieves a new state-of-the-art in 3D representation learning, \textite.g., \textbf94.32% accuracy on ScanObjectNN, \textbf86.5% Inst. mIoU on ShapeNetPart.
zh
[CV-75] Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models
【速读】:该论文旨在解决当前图像保护机制(如Glaze和Nightshade)中扰动的内部结构、可检测性及表征行为尚不明确的问题,这些问题限制了对生成式AI(Generative AI)防护策略的理解与优化。其解决方案的关键在于构建一个统一的可解释人工智能(Explainable AI)分析框架,整合白盒特征空间检查与黑盒信号级探测,通过潜在空间聚类、特征通道激活分析、基于遮挡的空间敏感性映射以及频域表征等多维度方法,揭示保护扰动本质上是低熵、结构化的,并与图像内容紧密耦合于表征、空间和频域三个层面,而非造成全局表征漂移。这一发现表明,保护信号虽视觉上隐匿但具稳定可检测性,为未来生成式AI系统的防御设计与检测策略提供了理论依据。
链接: https://arxiv.org/abs/2512.08329
作者: Michael R. Martin,Garrick Chan,Kwan-Liu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 17 figures, 1 table, 5 algorithms, preprint
Abstract:Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
zh
[CV-76] Low Rank Support Quaternion Matrix Machine
【速读】:该论文旨在解决传统颜色图像分类方法中对RGB三通道信息建模不足的问题,尤其是未能有效保留彩色通道间强耦合关系的局限性。其解决方案的关键在于提出一种基于四元数(Quaternion)表示的低秩支持矩阵机模型(Low-rank Support Quaternion Matrix Machine, LSQMM),将RGB三通道视为纯四元数以利用四元数代数结构完整保留通道间的内在关联;同时引入四元数核范数正则项作为矩阵核范数在四元数域的自然扩展,从而促进由强相关彩色通道产生的低秩结构,在优化过程中通过ADMM(交替方向乘子法)迭代算法高效求解,显著提升了分类精度、鲁棒性和计算效率。
链接: https://arxiv.org/abs/2512.08327
作者: Wang Chen,Ziyan Luo,Shuangyue Wang
机构: Beijing Jiaotong University (北京交通大学); Henan Agricultural University (河南农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.
zh
[CV-77] GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification
【速读】:该论文旨在解决现有视频运动放大(Video Motion Magnification, VMM)方法在微小运动位移下难以区分光子噪声与真实微运动的问题,尤其针对主流Eulerian方法因依赖解耦表征学习(如纹理、形状和频域方案)而导致的噪声抑制不足。其解决方案的关键在于提出GeoDiffMM——一种基于扩散模型的Lagrangian VMM框架,通过引入光学流作为几何先验来实现结构一致性的运动放大:首先设计无噪声光学流增强策略,合成不含光子噪声的非刚性运动场以监督模型学习更准确的几何感知光学流;其次构建条件去噪扩散运动放大器,将光学流作为几何先验并引入可学习的放大因子控制幅度,从而选择性放大符合场景语义与结构的运动成分,同时抑制无关内容扰动;最后通过基于光学流的视频合成实现高保真度运动回映射。
链接: https://arxiv.org/abs/2512.08325
作者: Xuedeng Liu,Jiabao Guo,Zheng Zhang,Fei Wang,Zhi Liu,Dan Guo
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.
zh
[CV-78] Detecting Dental Landmarks from Intraoral 3D Scans: the 3DTeethLand challenge MICCAI2024
【速读】:该论文旨在解决口腔正畸临床中3D牙齿标志点(tooth landmark)精准检测的问题,这一任务对实现个性化诊疗方案制定和治疗进展监测具有重要意义。其核心挑战在于个体间牙齿几何结构的复杂性和显著差异性。解决方案的关键在于引入基于深度学习的先进算法,并通过举办3DTeethLand挑战赛,首次发布公开可用的3D牙齿标志点数据集,从而推动社区开发更精确、可靠的自动化检测方法,为临床实践提供技术支持。
链接: https://arxiv.org/abs/2512.08323
作者: Achraf Ben-Hamadou,Nour Neifar,Ahmed Rekik,Oussama Smaoui,Firas Bouzguenda,Sergi Pujades,Niels van Nistelrooij,Shankeeth Vinayahalingam,Kaibo Shi,Hairong Jin,Youyi Zheng,Tibor Kubík,Oldřich Kodym,Petr Šilling,Kateřina Trávníčková,Tomáš Mojžiš,Jan Matula,Jeffry Hartanto,Xiaoying Zhu,Kim-Ngan Nguyen,Tudor Dascalu,Huikai Wu,and Weijie Liu,Shaojie Zhuang,Guangshun Wei,Yuanfeng Zhou
机构: Centre de Recherche en Numérique de Sfax (数字研究中心); Laboratory of Signals, Systems, Artificial Intelligence and Networks (信号、系统、人工智能与网络实验室); Technopôle de Sfax (萨法克斯技术园区); Inria (法国国家信息与自动化研究院); Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔综合理工学院); LJK (利莫日数学实验室); Udini (Udini)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024, 3DTeethLand, Challenge report, under review
Abstract:Teeth landmark detection is a critical task in modern clinical orthodontics. Their precise identification enables advanced diagnostics, facilitates personalized treatment strategies, and supports more effective monitoring of treatment progress in clinical dentistry. However, several significant challenges may arise due to the intricate geometry of individual teeth and the substantial variations observed across different individuals. To address these complexities, the development of advanced techniques, especially through the application of deep learning, is essential for the precise and reliable detection of 3D tooth landmarks. In this context, the 3DTeethLand challenge was held in collaboration with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2024, calling for algorithms focused on teeth landmark detection from intraoral 3D scans. This challenge introduced the first publicly available dataset for 3D teeth landmark detection, offering a valuable resource to assess the state-of-the-art methods in this task and encourage the community to provide methodological contributions towards the resolution of their problem with significant clinical implications.
zh
[CV-79] GeoDM: Geometry-aware Distribution Matching for Dataset Distillation
【速读】:该论文旨在解决现有数据集蒸馏(dataset distillation)方法在处理高维数据时因局限于欧几里得空间而无法捕捉数据内在几何结构的问题,特别是忽略了真实数据常位于低维流形上的特性。其解决方案的关键在于提出一种几何感知的分布匹配框架GeoDM,该框架在欧几里得、双曲和球面流形的笛卡尔积空间中进行操作,能够统一建模平坦、层次化和循环结构;同时引入可学习的曲率与权重参数以适配不同几何特性,并设计最优传输损失以提升分布保真度,从而实现更贴近原始数据流形结构的蒸馏效果。
链接: https://arxiv.org/abs/2512.08317
作者: Xuhui Li,Zhengquan Luo,Zihui Cui,Zhiqiang Xu
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbfGeoDM, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
zh
[CV-80] rrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite Real-Time Terrain Generation
【速读】:该论文旨在解决传统程序化噪声函数(如Perlin噪声)在生成地形时存在的局限性,即难以实现高保真度和大尺度一致性的问题。其核心解决方案是提出Terrain Diffusion,一种融合扩散模型精度与程序化噪声优势的新范式。关键创新在于InfiniteDiffusion算法,它实现了无限范围的无缝实时合成,并通过多层级扩散模型堆叠结合紧凑的拉普拉斯编码(Laplacian encoding),在地球尺度动态范围内稳定输出;同时引入开放源代码的无限张量框架(infinite-tensor framework)和少量步骤的一致性蒸馏技术,使扩散模型成为可实用的程序化世界生成基础,能够无限制地合成整个行星级别的连贯、可控地形。
链接: https://arxiv.org/abs/2512.08309
作者: Alexander Goslin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project website: this https URL Code: this https URL
Abstract:For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.
zh
[CV-81] OpenSubject: Leverag ing Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
【速读】:该论文旨在解决当前主体驱动图像生成模型在复杂场景中难以保持参考主体身份一致性的问题,尤其是在多主体场景下容易出现身份偏差。解决方案的关键在于构建一个大规模、高质量的视频衍生数据集 OpenSubject,其包含 250 万样本和 435 万张图像,并通过四阶段流程实现高保真主体信息保留:首先进行视频筛选以确保图像质量;其次利用视觉-语言模型(VLM)进行跨帧主体挖掘与配对;再次引入基于分割图引导的外绘(outpainting)和框引导的内补(inpainting)技术合成参考图像,辅以几何感知增强和不规则边界侵蚀策略以提升鲁棒性;最后通过 VLM 验证并生成短/长描述。实验表明,基于该数据集训练可显著提升生成与操控任务在复杂场景下的身份保真度和一致性表现。
链接: https://arxiv.org/abs/2512.08294
作者: Yexin Liu,Manyuan Zhang,Yueze Wang,Hongyu Li,Dian Zheng,Weiming Zhang,Changsheng Lu,Xunliang Cai,Yan Feng,Peng Pei,Harry Yang
机构: HKUST(香港科技大学); Meituan(美团); BAAI(北京人工智能研究院); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
zh
[CV-82] PAVAS: Physics-Aware Video-to-Audio Synthesis
【速读】:该论文旨在解决当前视频到音频(Video-to-Audio, V2A)生成模型普遍依赖视觉特征、缺乏对物理规律建模的问题,从而导致合成声音在物理合理性上不足。现有方法虽能实现良好的感知质量和时序同步,但未考虑真实世界中声音由物体质量、运动轨迹等物理因素决定的本质特性。解决方案的关键在于提出Physics-Aware Video-to-Audio Synthesis (PAVAS),其核心是引入物理驱动的音频适配器(Physics-Driven Audio Adapter, Phy-Adapter),该模块通过物理参数估计器(Physical Parameter Estimator, PPE)获取对象级物理信息,包括利用视觉语言模型(Vision-Language Model, VLM)推断物体质量,以及基于分割的动态三维重建模块计算物体运动轨迹以获得速度。这些物理线索被整合进潜空间扩散模型中,使生成音频能够反映底层物理机制,显著提升声学与物理属性的一致性。
链接: https://arxiv.org/abs/2512.08282
作者: Oh Hyun-Bin,Yuhta Takida,Toshimitsu Uesaka,Tae-Hyun Oh,Yuki Mitsufuji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit this https URL for demo videos.
zh
[CV-83] Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation ICRA2025
【速读】:该论文旨在解决多机器人远程操作中缺乏统一全局坐标系的问题,特别是在无标记(fiducial-free)且未部署深度传感器的场景下,如何实现多操作者对多个机器人位姿的实时感知与共享。解决方案的关键在于提出了一种零样本(zero-shot)传感器融合流程——Zero-Splat TeleAssist,其核心是结合视觉语言分割(vision-language segmentation)、单目深度估计(monocular depth)、加权主成分分析(weighted-PCA)姿态提取以及3D高斯点绘(3D Gaussian Splatting, 3DGS),从而从普通闭路电视(CCTV)视频流中构建一个共享的6自由度(6-DoF)世界模型,使每个操作者都能获得其他机器人在全局空间中的实时位置和朝向信息。
链接: https://arxiv.org/abs/2512.08271
作者: Srijan Dokania,Dharini Raghavan
机构: Khoury College of Computer Sciences at Northeastern University (东北大学计算机科学学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Published and Presented at 3rd Workshop on Human-Centric Multilateral Teleoperation in ICRA 2025
Abstract:We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
zh
[CV-84] EgoX: Egocentric Video Generation from a Single Exocentric Video
【速读】:该论文旨在解决从单个第三人称(exocentric)视频生成第一人称(egocentric)视频的问题,该任务因极端相机位姿变化和视场重叠度低而极具挑战性,需在保持可见内容忠实性的同时,以几何一致的方式合成未观测区域。解决方案的关键在于提出EgoX框架:首先通过轻量级LoRA(Low-Rank Adaptation)适配大规模视频扩散模型的时空知识,其次引入统一的条件策略,通过通道和宽度维度拼接融合第三人称与第一人称先验信息,最后设计几何引导的自注意力机制,选择性关注空间相关区域,从而保障生成视频的几何一致性与高视觉保真度。
链接: https://arxiv.org/abs/2512.08269
作者: Taewoong Kang,Kinam Kim,Dohyeon Kim,Minho Park,Junha Hyung,Jaegul Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, project page : this https URL
Abstract:Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.
zh
[CV-85] RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR RADAR and Camera
【速读】:该论文旨在解决自动驾驶车辆中LiDAR、RADAR与相机传感器的外参标定(extrinsic calibration)难题,尤其针对动态环境中因机械振动和累积漂移导致的标定不准确问题。解决方案的关键在于提出一种端到端可训练的深度学习框架RLCNet,能够实现多模态传感器的在线同步标定;其核心创新包括引入加权移动平均与异常值剔除机制,以支持实时运行并提升对参数漂移的鲁棒性,从而在复杂工况下保持高精度与稳定性。
链接: https://arxiv.org/abs/2512.08262
作者: Hafeez Husain Cholakkal,Stefano Arrigoni,Francesco Braghin
机构: Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.
zh
[CV-86] SFP: Real-World Scene Recovery Using Spatial and Frequency Priors
【速读】:该论文旨在解决真实场景下多退化类型(如散射、模糊等)导致的图像恢复难题,现有方法通常依赖单一先验或基于合成数据训练的复杂网络架构,难以适应多样化的实际应用场景。其解决方案的关键在于提出空间域与频域联合先验(Spatial and Frequency Priors, SFP):在空间域,利用退化图像的逆向投影特性估计场景透射图以恢复散射退化;在频域,构建自适应频率增强掩码,通过两个新提出的先验估计关键参数——一是退化图像各通道直流(DC)分量均值近似于清晰图像对应通道均值,二是清晰图像低径向频率(<0.001)能量占比约为总谱能的1%。最终采用加权融合策略整合空间域恢复、频域增强及输入图像显著特征,实现对多种退化条件下的鲁棒恢复。
链接: https://arxiv.org/abs/2512.08254
作者: Yun Liu,Tao Li,Cosmin Ancuti,Wenqi Ren,Weisi Lin
机构: Southwest University (西南大学); Universitatea Politehnica Timisoara (蒂米什瓦拉理工大学); Sun Yat-sen University (中山大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 13 figures
Abstract:Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image’s direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.
zh
[CV-87] Query-aware Hub Prototype Learning for Few-Shot 3D Point Cloud Semantic Segmentation
【速读】:该论文旨在解决少样本3D点云语义分割(Few-shot 3D Point Cloud Semantic Segmentation, FS-3DSeg)中因原型偏差(prototype bias)导致的性能下降问题。现有基于度量的原型学习方法仅从支持集(support set)生成原型,未考虑其与查询集(query set)之间的语义相关性,在分布偏移(distribution shift)场景下易过拟合支持集特征,从而影响泛化能力。解决方案的关键在于提出一种查询感知枢纽原型(Query-aware Hub Prototype, QHP)学习方法:首先设计枢纽原型生成(Hub Prototype Generation, HPG)模块,通过构建支持与查询点之间的二分图结构识别高频连接的支持枢纽,并生成与查询相关的原型以增强跨集合语义建模;其次引入原型分布优化(Prototype Distribution Optimization, PDO)模块,利用纯度加权对比损失对原型表示进行精炼,将劣质枢纽和边界模糊原型拉近至对应类别中心,有效缓解原型偏差并缩小原型与查询集间的语义鸿沟。
链接: https://arxiv.org/abs/2512.08253
作者: YiLin Zhou,Lili Wei,Zheming Xu,Ziyi Chen,Congyan Lang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.
zh
[CV-88] Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection AAAI-26
【速读】:该论文旨在解决当前知识蒸馏(Knowledge Distillation, KD)方法在相机-based时序3D目标检测中无法有效利用未来帧信息的问题。现有KD方法通常局限于空间特征蒸馏或时序关系蒸馏,且依赖严格的帧对齐,导致在线学生模型难以学习到教师模型中蕴含的未来帧知识。解决方案的关键在于提出一种稀疏查询驱动的未来时序知识蒸馏(Future Temporal Knowledge Distillation, FTKD)框架:首先设计了一种未来感知的特征重建策略,使学生模型能够在不依赖严格帧对齐的情况下捕捉未来特征;其次引入未来引导的logit蒸馏机制,利用教师模型稳定的前景与背景上下文增强学生模型的预测鲁棒性。该方法在nuScenes数据集上实现了最高达1.3 mAP和1.3 NDS的性能提升,同时保持推理成本不变。
链接: https://arxiv.org/abs/2512.08247
作者: Haowen Zheng,Hu Zhu,Lu Deng,Weihao Gu,Yang Yang,Yanyan Liang
机构: HAOMO.AI Technology; 1University of Science and Technology of China (中国科学技术大学); 2Tsinghua University (清华大学); 3Beijing Institute of Technology (北京理工大学); 4Zhejiang University (浙江大学); 5Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI-26
Abstract:Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher’s stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
zh
[CV-89] Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI
【速读】:该论文旨在解决乳腺病变分割中局部特征提取不充分、全局依赖关系建模能力弱以及边界模糊和结构连续性差等问题。其解决方案的关键在于提出了一种新型的深度混合残差-SwinCA-Net(Residual-SwinCA-Net)分割框架:首先通过嵌入残差卷积神经网络(CNN)模块增强局部相关特征的提取能力;其次,设计定制化的Swin Transformer块引入内部残差路径,以稳定梯度传播、细化局部模式并促进全局特征融合;此外,结合拉普拉斯高斯区域算子(Laplacian-of-Gaussian regional operator)与边界导向算子,提升组织连续性和细粒度结构过渡的显式表达;最后,在解码器阶段逐级应用多尺度通道注意力与挤压模块(MSCAS)和像素注意力模块(Pixel-Attention),实现对显著特征的自适应聚焦、冗余激活抑制及恶性病灶像素的空间语义加权,从而在BUSI公开数据集上实现了99.29%平均准确率、98.74% IoU和0.9041 Dice系数,显著优于传统CNN与Vision Transformer(ViT)方法。
链接: https://arxiv.org/abs/2512.08243
作者: Saeeda Naz,Saddam Hussain Khan(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 Pages, 10 Figures, 4 Tables
Abstract:A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
zh
[CV-90] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理图像时因输入大量视觉补丁标记(visual patch tokens)而导致的二次计算复杂度问题,该问题严重限制了模型的内存效率和上下文窗口容量。传统方法在连续压缩与离散量化之间存在权衡:连续压缩会稀释高层语义信息(如物体身份),而离散量化则丢失细粒度特征(如纹理)。解决方案的关键在于提出一种名为HTC-VLM的混合框架,通过双通道机制分离语义与外观表示——连续路径保留ViT补丁以捕捉细粒度细节,离散路径利用多粒度矢量量化(MGVQ)将信息投影为四个符号锚点(symbolic anchors);二者融合成580-token混合序列,并通过解耦注意力掩码和瓶颈层压缩为单一voco token,从而实现高效且语义 grounded 的表示。该设计在7个基准测试中平均性能保留率达87.2%,显著优于连续基线(81.0%),验证了其在效率与保真度之间的平衡能力。
链接: https://arxiv.org/abs/2512.08240
作者: Jusheng Zhang,Xiaoyang Guo,Kaitong Cai,Qinhan Lv,Yijia Fan,Wenhao Chai,Jian Wang,Keze Wang
机构: Sun Yat-sen University (中山大学); Princeton University (普林斯顿大学); Snap Inc (Snap Inc)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
zh
[CV-91] FastBEV: Fast by Algorithm Deployable by Design
【速读】:该论文旨在解决纯摄像头鸟瞰图(Bird’s-Eye-View, BEV)感知中高性能与车载部署可行性之间的根本矛盾,其瓶颈源于对计算密集型视图变换和平台特定定制内核的深度依赖。解决方案的关键在于提出 FastBEV++ 框架,遵循“算法快速”(Fast by Algorithm)与“设计可部署”(Deployable by Design)两大原则:首先通过将传统单体投影分解为标准的 Index-Gather-Reshape 流水线,并结合确定性预排序策略,实现仅使用原生操作符(如 Gather、矩阵乘法)完成视图变换,从而消除对专用 CUDA 内核的依赖并确保完全兼容 TensorRT;其次利用该结构无缝集成端到端的深度感知融合机制,结合时序聚合与鲁棒数据增强,显著提升 BEV 几何保真度,在 nuScenes 基准上达到 0.359 NDS 的新 SOTA 性能,同时在汽车级硬件(如 Tesla T4)上保持超过 134 FPS 的实时性能,实现了无需自定义插件即可高精度部署的成熟可扩展架构。
链接: https://arxiv.org/abs/2512.08237
作者: Yuanpeng Chen,Hui Song,Wei Tao,ShanHui Mo,Shuang Zhang,Xiao Hua,TianKun Zhao
机构: iMotion Automotive Technology (Suzhou) Co., Ltd; Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement of camera-only Bird’s-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the “Deployable by Design” principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is “Fast by Algorithm”, leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV this http URL validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: this https URL
zh
[CV-92] Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems
【速读】:该论文旨在解决当前深度补全(depth completion)方法中因稀疏深度采样方式不真实而导致的性能瓶颈问题。现有方法通常采用均匀随机采样生成稀疏深度图,忽略了真实传感器在不同几何结构和空间位置上表现出的非均匀可靠性。解决方案的关键在于提出一种基于法向量引导的稀疏深度采样策略:利用主成分分析(PCA)对RGB-D点云进行表面法向量估计,并据此计算每个像素的深度可靠性指标,进而依据该分布采样稀疏深度点。该方法提升了训练数据的真实性,使模型更贴近实际传感器行为,在NYU Depth v2数据集上的实验表明其能提高深度补全精度、减少边缘区域伪影并增强鲁棒性。
链接: https://arxiv.org/abs/2512.08229
作者: Tony Salloom,Dandi Zhou,Xinhai Sun
机构: Saisuode (Shanghai) Intelligent Technology Co., Ltd. (Synthoid.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.
zh
[CV-93] MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
【速读】:该论文旨在解决多模态模型(Multimodal Models, MMs)在链式思维(Chain-of-Thought, CoT)推理中缺乏对视觉证据的真正依赖性和逻辑一致性验证的问题。现有基准主要关注生成能力,忽视了对推理链条是否符合视觉事实和逻辑有效性的评估。为填补这一空白,作者提出MM-CoT诊断性基准,其关键在于要求模型从多个事件链中选择唯一满足两个正交约束的选项:(i) 视觉一致性(visual consistency),确保每一步推理均基于可观察的视觉证据;(ii) 逻辑连贯性(logical coherence),确保因果关系与常识合理性。通过设计对抗性干扰项来破坏其中任一约束,从而暴露模型的不同推理缺陷。实验证明,即使最先进的视觉-语言模型在该基准上表现不佳,表明生成流畅性与真实推理可靠性之间存在显著差距,且MM-CoT与现有基准相关性低,验证了其测量维度的独特性。
链接: https://arxiv.org/abs/2512.08228
作者: Jusheng Zhang,Kaitong Cai,Xiaoyang Guo,Sidi Liu,Qinhan Lv,Ruiqi Chen,Jing Yang,Yijia Fan,Xiaofei Sun,Jian Wang,Ziliang Chen,Liang Lin,Keze Wang
机构: Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团); Snap Inc (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
zh
[CV-94] New VVC profiles targeting Feature Coding for Machines ICIP2025
【速读】:该论文旨在解决在 split inference 系统中,传统视频编码标准(如 VVC)因依赖人类视觉系统感知模型而无法高效压缩神经网络中间特征的问题。由于这些特征具有抽象性、稀疏性和任务特定性,传统的基于感知保真的编码策略不再适用。解决方案的关键在于针对机器感知特性重新设计编码工具,在 MPEG-AI 特征编码 for Machines (FCM) 标准框架下,对 VVC 的各个编码组件进行工具级分析,从而提出三种轻量级的 VVC 编码配置:Fast、Faster 和 Fastest。这三者分别在保持较高压缩效率的同时显著提升编码速度,实现了 BD-Rate 与编码延迟之间的权衡优化,适用于低延迟机器视觉任务中的特征传输场景。
链接: https://arxiv.org/abs/2512.08227
作者: Md Eimran Hossain Eimon,Ashan Perera,Juan Merlos,Velibor Adzic,Hari Kalva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at ICIP 2025 workshop on Coding for Machines
Abstract:Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.
zh
[CV-95] SOP2: Transfer Learning with Scene-Oriented Prompt Pool on 3D Object Detection
【速读】:该论文旨在解决如何将大规模预训练模型(如基于Waymo数据集训练的模型)有效迁移到其他3D目标检测场景中的问题,核心挑战在于如何通过少量参数调整实现跨场景适应。其解决方案的关键在于引入一种面向场景的提示池(Scene-Oriented Prompt Pool, SOP²),系统性地探究提示标记(prompt tokens)与提示生成器(prompt generator)的作用,并利用结构化提示池增强模型在不同场景下的泛化能力,从而提升3D目标检测任务的适应性和性能。
链接: https://arxiv.org/abs/2512.08223
作者: Ching-Hung Cheng,Hsiu-Fu Wu,Bing-Chen Wu,Khanh-Phong Bui,Van-Tin Luu,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Internet of Things Laboratory, Chunghwa Telecom Laboratories (中华电信实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise of Large Language Models (LLMs) such as GPT-3, these models exhibit strong generalization capabilities. Through transfer learning techniques such as fine-tuning and prompt tuning, they can be adapted to various downstream tasks with minimal parameter adjustments. This approach is particularly common in the field of Natural Language Processing (NLP). This paper aims to explore the effectiveness of common prompt tuning methods in 3D object detection. We investigate whether a model trained on the large-scale Waymo dataset can serve as a foundation model and adapt to other scenarios within the 3D object detection field. This paper sequentially examines the impact of prompt tokens and prompt generators, and further proposes a Scene-Oriented Prompt Pool (\textbfSOP ^2 ). We demonstrate the effectiveness of prompt pools in 3D object detection, with the goal of inspiring future researchers to delve deeper into the potential of prompts in the 3D field.
zh
[CV-96] VisKnow: Constructing Visual Knowledge Base for Object Understanding
【速读】:该论文旨在解决当前计算机视觉中对象理解能力不足的问题,即现有方法仅能提供类别标签(object recognition),难以实现对对象类别的深入感知,如组成部分、外观特征、类别间关系及上下文背景知识等多维度信息的整合。为实现更全面的对象理解,论文提出构建一个结构化的视觉知识库(Visual Knowledge Base),其核心解决方案是设计并实现VisKnow框架,该框架通过融合专家设计与大规模模型应用,从图像区域标注和对齐文本知识中提取多模态对象级知识,并以图结构形式组织。关键创新在于将视觉与语言模态的知识进行系统化整合,形成可支持零样本识别、细粒度视觉问答(VQA)等任务的结构化知识资源,从而推动视觉理解向更高层次的认知推理演进。
链接: https://arxiv.org/abs/2512.08221
作者: Ziwei Yao,Qiyang Wan,Ruiping Wang,Xilin Chen
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology (计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures, 7 tables. Under review
Abstract:Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications. The project page is available at this https URL.
zh
[CV-97] Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement
【速读】:该论文旨在解决生成逼真人类虚拟形象时面临的两大核心挑战:一是多视角图像的几何一致性不足,二是复杂姿态和视角变化下图像清晰度下降的问题(即模糊输出)。现有方法往往在几何保真与照片真实感之间权衡,难以同时保证结构一致性和细节清晰度。解决方案的关键在于提出 Blur2Sharp 框架,其核心创新是融合 3D-aware 神经渲染与扩散模型(diffusion models)的双条件架构:首先利用 Human NeRF 生成目标姿态下的几何一致多视角图像,提供显式的三维结构引导;随后通过扩散模型以这些渲染结果为条件进行精细化修复,从而保留局部细节并提升整体结构保真度。此外,通过引入基于 SMPL 参数化模型提取的纹理、法线和语义先验,实现分层特征融合,进一步增强全局连贯性与局部准确性。
链接: https://arxiv.org/abs/2512.08215
作者: Chia-Hern Lai,I-Hsuan Lo,Yen-Ku Yeh,Thanh-Nguyen Truong,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.
zh
[CV-98] Animal Re-Identification on Microcontrollers
【速读】:该论文旨在解决动物再识别(Animal Re-ID)模型在资源受限的微控制器(MCU)类边缘设备上部署困难的问题,特别是在无线连接有限的大规模户外环境中,传统基于工作站或服务器的大型模型因内存和输入分辨率限制难以直接运行。解决方案的关键在于:首先通过系统分析揭示了传统知识蒸馏方法在低资源约束下效果有限;其次,设计了一种针对低分辨率输入优化的轻量化CNN架构,基于MobileNetV2骨干网络进行系统性缩放以提升精度;最后,提出一种数据高效的微调策略,仅需每类动物3张图像即可快速适应新场景,实现在MCU上的全设备端推理,同时保持与云端版本相当的Top-1准确率,显著缩小模型体积并提升实用性。
链接: https://arxiv.org/abs/2512.08198
作者: Yubo Chen,Di Zhao,Yun Sing Koh,Talia Xu
机构: University of Auckland(奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera-based animal re-identification (Animal Re-ID) can support wildlife monitoring and precision livestock management in large outdoor environments with limited wireless connectivity. In these settings, inference must run directly on collar tags or low-power edge nodes built around microcontrollers (MCUs), yet most Animal Re-ID models are designed for workstations or servers and are too large for devices with small memory and low-resolution inputs. We propose an on-device framework. First, we characterise the gap between state-of-the-art Animal Re-ID models and MCU-class hardware, showing that straightforward knowledge distillation from large teachers offers limited benefit once memory and input resolution are constrained. Second, guided by this analysis, we design a high-accuracy Animal Re-ID architecture by systematically scaling a CNN-based MobileNetV2 backbone for low-resolution inputs. Third, we evaluate the framework with a real-world dataset and introduce a data-efficient fine-tuning strategy to enable fast adaptation with just three images per animal identity at a new site. Across six public Animal Re-ID datasets, our compact model achieves competitive retrieval accuracy while reducing model size by over two orders of magnitude. On a self-collected cattle dataset, the deployed model performs fully on-device inference with only a small accuracy drop and unchanged Top-1 accuracy relative to its cluster version. We demonstrate that practical, adaptable Animal Re-ID is achievable on MCU-class devices, paving the way for scalable deployment in real field environments.
zh
[CV-99] Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
【速读】:该论文旨在解决当前机器人操作规划中世界模型(World Model)缺乏物理严谨性的问题,即视频生成模型常因物理约束建模不足而导致幻觉现象和长时间规划中的不一致性。解决方案的关键在于提出一种名为“具身思维树”(Embodied Tree of Thoughts, EToT)的Real2Sim2Real规划框架,其核心是利用基于物理的交互式数字孪生(Interactive Digital Twin)作为具身世界模型,并通过两种协同机制实现高效规划:一是先验分支(Priori Branching),基于语义与空间分析生成多样化的执行路径;二是反思分支(Reflective Branching),借助视觉语言模型(VLM)在仿真环境中诊断执行失败并迭代优化规划树。该方法确保高阶推理过程严格遵循刚体动力学和碰撞约束,从而显著提升短程与长程操作任务中的规划准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.08188
作者: Wenjiang Xu,Cindy Wang,Rui Fang,Mingkang Zhang,Lusong Li,Jing Xu,Jiayuan Gu,Zecui Zeng,Rui Chen
机构: University of Chinese Academy of Sciences (UCAS); Tsinghua University; JD Explore Academy; ShanghaiTech University; Nanjing University
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Website at this https URL
Abstract:World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at this https URL .
zh
[CV-100] GeoLoom: High-quality Geometric Diagram Generation from Textual Input
【速读】:该论文旨在解决几何图生成中结构准确性不足的问题,即如何在保持空间精度的同时实现可解释且可扩展的文本到图生成。其解决方案的关键在于提出GeoLoom框架,该框架由两个核心组件构成:一是自动形式化模块(autoformalization module),将自然语言描述转化为专为生成设计的形式语言GeoLingua;二是坐标求解器(coordinate solver),利用高效的蒙特卡洛优化方法将形式化约束映射为精确坐标。此外,研究还构建了GeoNF数据集,并提出基于约束的评估指标以量化结构偏差,从而支持迭代优化与数学严谨的监督训练。
链接: https://arxiv.org/abs/2512.08180
作者: Xiaojing Wei,Ting Zhang,Wei He,Jingdong Wang,Hua Huang
机构: Beijing Normal University (北京师范大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.
zh
[CV-101] RAVES-Calib: Robust Accurate and Versatile Extrinsic Self Calibration Using Optimal Geometric Features
【速读】:该论文旨在解决无标定目标(targetless)环境下激光雷达(LiDAR)与相机之间的外参标定问题,尤其针对现有方法对初始变换依赖性强、鲁棒性不足以及特征分布不均导致精度下降的挑战。其解决方案的关键在于:首先利用Gluestick流水线建立2D-3D点和线特征对应关系,实现无需初始变换的鲁棒自动初值估计;其次通过定量分析特征分布对标定结果的影响,并基于此自适应加权每类特征的代价函数,从而在优化过程中过滤劣质特征的不利影响,显著提升标定精度与稳定性。
链接: https://arxiv.org/abs/2512.08170
作者: Haoxin Zhang,Shuaixin Li,Xiaozhou Zhu,Hongbo Chen,Wen Yao
机构: Sun Yat-Sen University (中山大学); Chinese Academy of Military Science (中国军事科学研究院); Intelligent Game and Decision Laboratory (智能游戏与决策实验室); Harbin Engineering University (哈尔滨工程大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present a user-friendly LiDAR-camera calibration toolkit that is compatible with various LiDAR and camera sensors and requires only a single pair of laser points and a camera image in targetless environments. Our approach eliminates the need for an initial transform and remains robust even with large positional and rotational LiDAR-camera extrinsic parameters. We employ the Gluestick pipeline to establish 2D-3D point and line feature correspondences for a robust and automatic initial guess. To enhance accuracy, we quantitatively analyze the impact of feature distribution on calibration results and adaptively weight the cost of each feature based on these metrics. As a result, extrinsic parameters are optimized by filtering out the adverse effects of inferior features. We validated our method through extensive experiments across various LiDAR-camera sensors in both indoor and outdoor settings. The results demonstrate that our method provides superior robustness and accuracy compared to SOTA techniques. Our code is open-sourced on GitHub to benefit the community.
zh
[CV-102] Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators
【速读】:该论文旨在解决单目深度估计模型在追求高精度的同时,如何更好地对齐人类感知的问题,即提升模型的鲁棒性和可解释性。其核心挑战在于揭示模型准确性与人类相似性之间的潜在权衡关系,尤其是在依赖传感器真值标注的自然户外场景中。解决方案的关键在于系统性地评估69种单目深度估计算法在KITTI数据集上的表现,并通过仿射拟合方法将预测误差分解为可解释的组成部分,从而量化模型误差模式与人类感知偏差之间的关联,发现尽管人类与深度神经网络(DNNs)存在部分共同的估计偏倚(正误差相关性),但两者在准确性和人类相似性之间呈现出显著差异的权衡关系,表明单纯提高准确性并不能自然带来更符合人类感知的行为,因此亟需构建多维度的人类中心型评估体系。
链接: https://arxiv.org/abs/2512.08163
作者: Yuki Kubota,Taiki Fukiage
机构: NTT, Inc.(日本电信电话公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 12 figures, 1 table
Abstract:Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.
zh
[CV-103] Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing
【速读】:该论文旨在解决真实世界中非均匀雾霾条件下的图像去雾问题,该场景下传统方法难以实现可靠视觉感知,且基于Transformer的方法虽能捕捉全局上下文信息,但其二次计算复杂度限制了实时部署。解决方案的关键在于提出一种名为Fourier-RWKV的新框架,其核心是多状态感知(Multi-State Perception)范式,通过三个互补的感知状态协同建模:(1) 空间形态感知(Spatial-form Perception),利用可变形四向Token偏移(DQ-Shift)动态调整感受野以适应局部雾霾变化;(2) 频域感知(Frequency-domain Perception),在Fourier Mix模块中将RWKV的核心WKV注意力机制从空间域扩展至傅里叶域,保留长程依赖以支持全局雾霾估计并缓解空间衰减;(3) 语义关系感知(Semantic-relation Perception),借助语义桥接模块(SBM)与动态语义核融合(DSK-Fusion)精准对齐编码器-解码器特征并抑制伪影。该设计实现了线性计算复杂度下的高质量去雾效果,显著优于现有方法。
链接: https://arxiv.org/abs/2512.08161
作者: Lirong Zheng,Yanshan Li,Rui Yu,Kaihao Zhang
机构: Shenzhen University (深圳大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: this https URL.
zh
[CV-104] reeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在强化学习(Reinforcement Learning, RL)后训练过程中计算成本过高这一关键问题。传统方法依赖轨迹级的信用分配和独立采样,导致样本效率低且训练速度慢。其解决方案的核心在于提出 TreeGRPO 框架,通过将去噪过程建模为搜索树结构,在共享初始噪声的基础上进行多分支生成,从而高效复用公共前缀;该设计实现了三个关键技术优势:高样本效率、基于奖励反向传播的细粒度信用分配(step-specific advantages),以及通过多子节点分支实现的 amortized computation(即一次前向传播支持多次策略更新)。实验表明,TreeGRPO 在扩散模型与流模型上均实现 2.4 倍加速,并在效率-奖励权衡空间中取得更优帕累托前沿。
链接: https://arxiv.org/abs/2512.08153
作者: Zheng Ding,Weirui Ye
机构: UC San Diego (加州大学圣地亚哥分校); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emphHigh sample efficiency, achieving better performance under same training samples (2) \emphFine-grained credit assignment via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emphAmortized computation where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf2.4 \times faster training while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at this http URL.
zh
[CV-105] CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning WACV2026
【速读】:该论文旨在解决现有3D场景理解方法在空间推理能力上的局限性问题,这些问题主要源于其依赖无结构表示(如点云、体素或补丁特征)并通过坐标嵌入隐式注入场景上下文,导致缺乏显式的高层结构理解。解决方案的关键在于提出一种受人类视觉系统中中央视野(central vision)与周边视野(peripheral vision)启发的框架——CVP,其中引入两个互补组件:目标相关性标记(target-affinity token),类比于中央视野,引导模型注意力聚焦于与查询相关的物体;以及以中心外坐标系(allocentric grid)为表征的全局场景网格,类比于周边视野,用于捕捉整体场景结构和空间布局。这两个模块协同工作,实现对复杂3D环境的结构化、上下文感知的理解,从而显著提升空间推理性能。
链接: https://arxiv.org/abs/2512.08135
作者: Zeyuan Chen,Xiang Zhang,Haiyang Xu,Jianwen Xie,Zhuowen Tu
机构: UC San Diego; Lambda, Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields – central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model’s attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.
zh
[CV-106] Generalizations of the Normalized Radon Cumulative Distribution Transform for Limited Data Recognition
【速读】:该论文旨在解决图像特征表示在面对任意仿射变换(affine transformations)时缺乏不变性的问题,尤其是在小样本场景下(如filigranology中的水印识别),传统方法难以保持分类性能。解决方案的关键在于提出了一类广义归一化方法以增强应用灵活性,并利用广义Radon变换拓展至多维及非欧几里得空间,从而构建出在特定变换下保持不变的新型特征表示。理论证明表明这些特征可在特征空间中实现线性可分,数值实验进一步验证了其在2D图像、3D形状和3D旋转矩阵上的高精度分类与聚类效果。
链接: https://arxiv.org/abs/2512.08099
作者: Matthias Beckmann,Robert Beinert,Jonas Bresch
机构: TU Berlin (柏林工业大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
Abstract:The Radon cumulative distribution transform (R-CDT) exploits one-dimensional Wasserstein transport and the Radon transform to represent prominent features in images. It is closely related to the sliced Wasserstein distance and facilitates classification tasks, especially in the small data regime, like the recognition of watermarks in filigranology. Here, a typical issue is that the given data may be subject to affine transformations caused by the measuring process. To make the R-CDT invariant under arbitrary affine transformations, a two-step normalization of the R-CDT has been proposed in our earlier works. The aim of this paper is twofold. First, we propose a family of generalized normalizations to enhance flexibility for applications. Second, we study multi-dimensional and non-Euclidean settings by making use of generalized Radon transforms. We prove that our novel feature representations are invariant under certain transformations and allow for linear separation in feature space. Our theoretical results are supported by numerical experiments based on 2d images, 3d shapes and 3d rotation matrices, showing near perfect classification accuracies and clustering results.
zh
[CV-107] Identification of Deforestation Areas in the Amazon Rainforest Using Change Detection Models
【速读】:该论文旨在解决当前基于机器学习的森林砍伐检测模型在准确性和方法学标准化方面存在的不足,具体包括:现有模型效果不佳、未采用现代深度学习架构(如基于自注意力机制的Transformer),以及缺乏统一评估框架导致不同研究结果难以直接比较。其解决方案的关键在于构建一个统一的数据集对多种变化检测模型进行系统评估,涵盖全卷积网络和引入Transformer结构的自注意力机制模型,并通过优化预处理与后处理策略(如基于连通区域大小过滤、纹理替换和图像增强)显著提升单个模型性能;此外,还探索了集成不同模型的融合策略,最终实现F1分数达80.41%,达到当前文献中先进水平。
链接: https://arxiv.org/abs/2512.08075
作者: Christian Massao Konishi,Helio Pedrini
机构: Universidade Estadual de Campinas (巴西坎皮纳斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The preservation of the Amazon Rainforest is one of the global priorities in combating climate change, protecting biodiversity, and safeguarding indigenous cultures. The Satellite-based Monitoring Project of Deforestation in the Brazilian Legal Amazon (PRODES), a project of the National Institute for Space Research (INPE), stands out as a fundamental initiative in this effort, annually monitoring deforested areas not only in the Amazon but also in other Brazilian biomes. Recently, machine learning models have been developed using PRODES data to support this effort through the comparative analysis of multitemporal satellite images, treating deforestation detection as a change detection problem. However, existing approaches present significant limitations: models evaluated in the literature still show unsatisfactory effectiveness, many do not incorporate modern architectures, such as those based on self-attention mechanisms, and there is a lack of methodological standardization that allows direct comparisons between different studies. In this work, we address these gaps by evaluating various change detection models in a unified dataset, including fully convolutional models and networks incorporating self-attention mechanisms based on Transformers. We investigate the impact of different pre- and post-processing techniques, such as filtering deforested areas predicted by the models based on the size of connected components, texture replacement, and image enhancements; we demonstrate that such approaches can significantly improve individual model effectiveness. Additionally, we test different strategies for combining the evaluated models to achieve results superior to those obtained individually, reaching an F1-score of 80.41%, a value comparable to other recent works in the literature.
zh
[CV-108] Mask to Adapt: Simple Random Masking Enables Robust Continual Test-Time Learning
【速读】:该论文旨在解决测试时分布偏移(distribution shifts at test time)导致图像分类器性能下降的问题,尤其针对连续测试时适应(Continual Test-Time Adaptation, CTTA)方法中依赖复杂设计(如校准不确定性或稳定注意力分数)的局限性。其解决方案的关键在于提出一种简化的Mask to Adapt (M2A) 方法:通过随机生成短序列的掩码视图(空间或频率域),并结合两个核心目标——掩码一致性损失(mask consistency loss)以对齐不同视图下的预测结果,以及熵最小化损失(entropy minimization loss)以促使模型输出更置信的预测。实验表明,仅使用简单随机掩码即可实现与现有强基线相当甚至更优的性能,证明了无需依赖特定掩码设计或外部信号(如不确定性估计)即可有效驱动测试时自适应。
链接: https://arxiv.org/abs/2512.08048
作者: Chandler Timm C. Doloriel
机构: Norwegian University of Life Sciences (挪威生命科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ongoing work
Abstract:Distribution shifts at test time degrade image classifiers. Recent continual test-time adaptation (CTTA) methods use masking to regulate learning, but often depend on calibrated uncertainty or stable attention scores and introduce added complexity. We ask: do we need custom-made masking designs, or can a simple random masking schedule suffice under strong corruption? We introduce Mask to Adapt (M2A), a simple CTTA approach that generates a short sequence of masked views (spatial or frequency) and adapts with two objectives: a mask consistency loss that aligns predictions across different views and an entropy minimization loss that encourages confident outputs. Motivated by masked image modeling, we study two common masking families – spatial masking and frequency masking – and further compare subtypes within each (spatial: patch vs.\ pixel; frequency: all vs.\ low vs.\ high). On CIFAR10C/CIFAR100C/ImageNetC (severity~5), M2A (Spatial) attains 8.3%/19.8%/39.2% mean error, outperforming or matching strong CTTA baselines, while M2A (Frequency) lags behind. Ablations further show that simple random masking is effective and robust. These results indicate that a simple random masking schedule, coupled with consistency and entropy objectives, is sufficient to drive effective test-time adaptation without relying on uncertainty or attention signals.
zh
[CV-109] owards Sustainable Universal Deepfake Detection with Frequency-Domain Masking
【速读】:该论文旨在解决通用深度伪造检测(universal deepfake detection)中模型对未见过的生成式AI(Generative AI)图像泛化能力不足的问题,同时降低计算开销以支持大规模筛查,契合绿色人工智能(Green AI)的发展需求。其解决方案的关键在于引入频域掩码(frequency-domain masking)作为训练策略,通过随机掩码和几何变换增强模型鲁棒性,其中频域掩码因其优异的泛化性能成为核心手段。实验表明,该方法在GAN与扩散模型生成的图像数据集上均达到当前最优的泛化效果,并在结构化剪枝下保持稳定性能,为可持续且可扩展的深度伪造检测提供了有效路径。
链接: https://arxiv.org/abs/2512.08042
作者: Chandler Timm C. Doloriel,Habib Ullah,Kristian Hovde Liland,Fadi Al Machot,Ngai-Man Cheung
机构: Norwegian University of Life Sciences (NMBU); Singapore University of Technology and Design (SUTD)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at: [this https URL](this https URL).
zh
[CV-110] Lost in Translation Found in Embeddings: Sign Language Translation and Alignment
【速读】:该论文旨在解决手语理解中的两个核心问题:手语翻译(Sign Language Translation, SLT)与手语字幕对齐(Sign-Subtitle Alignment, SSA),二者共同实现连续手语视频到口语文本的转换及时间上的精准对齐,这对于实际交流、大规模语料库构建和教育应用至关重要。解决方案的关键在于提出一个统一模型架构,包含三个核心组件:(i) 轻量级视觉主干网络,通过人体关键点和唇部区域图像捕捉手部与非手部线索,同时保护签名者隐私;(ii) 滑动感知器映射网络(Sliding Perceiver mapping network),将连续视觉特征聚合为词级别嵌入以弥合视觉与文本之间的鸿沟;(iii) 多任务可扩展训练策略,联合优化SLT与SSA任务,强化语言与时间对齐能力。此外,模型在涵盖英国手语(BSL)和美国手语(ASL)的大规模语料库上进行预训练,显著提升了跨语言泛化能力,并在BOBSL数据集上实现了SLT与SSA的最先进性能,且在How2Sign(ASL)上展现出强大的零样本迁移与微调效果。
链接: https://arxiv.org/abs/2512.08040
作者: Youngjoon Jang,Liliane Momeni,Zifan Jiang,Joon Son Chung,Gül Varol,Andrew Zisserman
机构: VGG, University of Oxford (牛津大学视觉几何组); KAIST (韩国科学技术院); University of Zurich (苏黎世大学); LIGM, École des Ponts, IP Paris, UGE, CNRS (法国巴黎路桥学院、巴黎文理研究大学、高等工程学院、国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles – both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.
zh
[CV-111] SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification
【速读】:该论文旨在解决医学影像诊断中黑箱神经网络模型缺乏可解释性的问题,特别是在早产儿视网膜病变(Retinopathy of Prematurity, ROP)的分类任务中,现有解释方法无法有效保留输入图像的结构特性(如平滑性和稀疏性),导致生成的解释不具临床可信度。解决方案的关键在于提出一种名为Sparse and Smooth Explainer (SSplain) 的新方法,通过引入组合约束优化问题,并利用交替方向乘子法(Alternating Direction Method of Multipliers, ADMM)求解,从而在像素级层面生成既保持图像结构特征又具备稀疏性的解释结果,显著提升了解释的真实性和与临床先验知识的一致性。
链接: https://arxiv.org/abs/2512.08038
作者: Elifnur Sunger,Tales Imbiriba,Peter Campbell,Deniz Erdogmus,Stratis Ioannidis,Jennifer Dy
机构: Northeastern University (东北大学); University of Massachusetts Boston (马萨诸塞大学波士顿分校); Oregon Health & Science University (俄勒冈健康与科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 16 figures
Abstract:Neural networks are frequently used in medical diagnosis. However, due to their black-box nature, model explainers are used to help clinicians understand better and trust model outputs. This paper introduces an explainer method for classifying Retinopathy of Prematurity (ROP) from fundus images. Previous methods fail to generate explanations that preserve input image structures such as smoothness and sparsity. We introduce Sparse and Smooth Explainer (SSplain), a method that generates pixel-wise explanations while preserving image structures by enforcing smoothness and sparsity. This results in realistic explanations to enhance the understanding of the given black-box model. To achieve this goal, we define an optimization problem with combinatorial constraints and solve it using the Alternating Direction Method of Multipliers (ADMM). Experimental results show that SSplain outperforms commonly used explainers in terms of both post-hoc accuracy and smoothness analyses. Additionally, SSplain identifies features that are consistent with domain-understandable features that clinicians consider as discriminative factors for ROP. We also show SSplain’s generalization by applying it to additional publicly available datasets. Code is available at this https URL.
zh
[CV-112] CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space
【速读】:该论文旨在解决当前静态人工智能(AI)预测模型无法有效模拟肿瘤动态演进过程的问题,尤其是在临床决策中缺乏对患者个体化时间与临床背景的建模能力。其核心挑战在于现有医学世界模型(Medical World Models, MeWM)多依赖随机扩散机制进行图像重建,忽视了生理因果关系和治疗条件下的轨迹演化,并且缺乏将预测结果转化为可操作决策的反馈机制。解决方案的关键在于提出CLARITY——一种在结构化潜在空间中直接预测疾病演进的医学世界模型,通过显式整合时间间隔(temporal context)和患者特异性数据(clinical context),将治疗条件下的进展建模为平滑、可解释的轨迹,从而生成生理上合理且个性化的治疗方案;同时引入“预测到决策”框架,将潜在空间中的滚动预测转化为透明、可执行的临床建议,显著提升了治疗规划的准确性与实用性。
链接: https://arxiv.org/abs/2512.08029
作者: Tianxingjian Ding,Yuanhao Zou,Chen Chen,Mubarak Shah,Yu Tian
机构: University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Clinical decision-making in oncology requires predicting dynamic disease evolution, a task current static AI predictors cannot perform. While world models (WMs) offer a paradigm for generative prediction, existing medical applications remain limited. Existing methods often rely on stochastic diffusion models, focusing on visual reconstruction rather than causal, physiological transitions. Furthermore, in medical domain, models like MeWM typically ignore patient-specific temporal and clinical contexts and lack a feedback mechanism to link predictions to treatment decisions. To address these gaps, we introduce CLARITY, a medical world model that forecasts disease evolution directly within a structured latent space. It explicitly integrates time intervals (temporal context) and patient-specific data (clinical context) to model treatment-conditioned progression as a smooth, interpretable trajectory, and thus generate physiologically faithful, individualized treatment plans. Finally, CLARITY introduces a novel prediction-to-decision framework, translating latent rollouts into transparent, actionable recommendations. CLARITY demonstrates state-of-the-art performance in treatment planning. On the MU-Glioma-Post dataset, our approach outperforms recent MeWM by 12%, and significantly surpasses all other medical-specific large language models.
zh
[CV-113] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision Language Models, LVLMs)在复杂地图认知推理(cartographic reasoning)能力上的显著不足问题,尤其是其对多图空间关系理解与跨图定位推理的薄弱表现。现有研究常将地图视为图表(chart)的特例进行评估,忽略了地图特有的分层符号系统(如符号、几何图形和文本标签)及与方向和距离相关的空间关系。为此,作者提出FRIEDA基准测试集,其关键在于:(1)基于地理信息系统(GIS)文献分类体系,覆盖拓扑(topological)、度量(metric)和方向(directional)三类空间关系;(2)设计需多步推理且多数涉及跨图接地(cross-map grounding)的问题;(3)设置两种评测场景——直接设置(提供相关地图)与上下文设置(模型需先识别相关地图),从而全面评估LVLMs的空间智能水平。实验表明,即使最强模型如Gemini-2.5-Pro和GPT-5-Think在FRIEDA上准确率也仅达38.20%和37.20%,远低于人类水平(84.87%),凸显了该任务的挑战性与FRIEDA作为严谨基准对推动LVLM空间认知能力发展的必要性。
链接: https://arxiv.org/abs/2512.08016
作者: Jiyoon Pyo,Yuankun Jiao,Dongwon Jung,Zekun Li,Leeje Jang,Sofia Kirsanova,Jina Kim,Yijun Lin,Qin Liu,Junyi Xie,Hadi Askari,Nan Xu,Muhao Chen,Yao-Yi Chiang
机构: University of Minnesota-Twin Cities (明尼苏达大学双城分校); University of California, Davis (加州大学戴维斯分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.
zh
[CV-114] DIJIT: A Robotic Head for an Active Observer
【速读】:该论文旨在解决移动机器人在主动视觉(active vision)研究中缺乏类人眼-头协同运动能力的问题,尤其关注人类视觉系统中眼动与头部运动的交互机制及其对视觉任务性能的影响。解决方案的关键在于设计了一种具有九个机械自由度(mechanical degrees of freedom)和四个光学自由度(optical degrees of freedom)的双目机器人头部DIJIT,其运动范围和速度可媲美人类,并支持会聚立体视觉所需的关键运动模式(如 vergence、version 和 cyclotorsion)。此外,论文提出一种新的快速眼动式相机运动方法(saccadic camera movements),通过建立相机姿态与电机控制值之间的直接映射关系,使相机运动在准确性上接近人类眼动行为,从而为机器视觉与人类视觉的对比研究提供了高保真的实验平台。
链接: https://arxiv.org/abs/2512.07998
作者: Mostafa Kamali Tabrizi,Mingshi Chi,Bir Bikram Dey,Yu Qing Yuan,Markus D. Solbach,Yiqian Liu,Michael Jenkin,John K. Tsotsos
机构: York University (约克大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT’s unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT’s design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. The exploration of the utility of these to both human and machine vision is ongoing. Here, we present the design of DIJIT and evaluate aspects of its performance. We present a new method for saccadic camera movements. In this method, a direct relationship between camera orientation and motor values is developed. The resulting saccadic camera movements are close to human movements in terms of their accuracy.
zh
[CV-115] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection
【速读】:该论文旨在解决牙科影像中细粒度解剖结构(如牙层和牙槽骨)的语义分割问题,现有方法依赖损失函数间接编码解剖层次关系,监督信号弱且不明确。解决方案的关键在于提出一种通用框架,通过显式嵌入解剖层次结构来增强模型性能:具体包括采用递归的逐级预测机制、限制性输出头设计以及自顶向下的特征条件化策略——在每一层级上,骨干网络重新处理原始图像与前一级 logits 的拼接输入,子类特征通过父类概率进行特征调制(Feature-wise Linear Modulation),从而实现细粒度检测;同时引入概率组合规则约束父类与子类间的逻辑一致性,并结合分层损失(包含加权 Dice 损失、交叉熵损失及一致性项损失)确保父类预测等于其子类之和。该方法显著提升了 IoU、Dice 和召回率,尤其在数据稀缺的牙科影像场景下增强了分割结果的临床合理性。
链接: https://arxiv.org/abs/2512.07984
作者: Ryan Banks,Camila Lindoni Azevedo,Hongying Tang,Yunpeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, 3 tables
Abstract:Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.
zh
[CV-116] CIP-Net: Continual Interpretable Prototype-based Network
【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会显著退化对先前任务的性能。现有可解释人工智能(explainable AI)方法多依赖于事后解释(post-hoc explanations)或为每个新任务额外存储示例(exemplar),导致扩展性受限。论文提出CIP-Net,一种无需存储历史样本的自解释原型模型(exemplar-free self-explainable prototype-based model),其核心创新在于通过轻量级架构实现知识保留与可解释性:模型在预测过程中动态生成解释,从而有效缓解遗忘,同时显著降低内存开销,在任务增量(task-incremental)和类别增量(class-incremental)场景下均达到当前最优性能。
链接: https://arxiv.org/abs/2512.07981
作者: Federico Di Valerio,Michela Proietti,Alessio Ragno,Roberto Capobianco
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning constrains models to learn new tasks over time without forgetting what they have already learned. A key challenge in this setting is catastrophic forgetting, where learning new information causes the model to lose its performance on previous tasks. Recently, explainable AI has been proposed as a promising way to better understand and reduce forgetting. In particular, self-explainable models are useful because they generate explanations during prediction, which can help preserve knowledge. However, most existing explainable approaches use post-hoc explanations or require additional memory for each new task, resulting in limited scalability. In this work, we introduce CIP-Net, an exemplar-free self-explainable prototype-based model designed for continual learning. CIP-Net avoids storing past examples and maintains a simple architecture, while still providing useful explanations and strong performance. We demonstrate that CIPNet achieves state-of-the-art performances compared to previous exemplar-free and self-explainable methods in both task- and class-incremental settings, while bearing significantly lower memory-related overhead. This makes it a practical and interpretable solution for continual learning.
zh
[CV-117] VLD: Visual Language Goal Distance for Reinforcement Learning Navigation
【速读】:该论文旨在解决机器人系统从图像数据中端到端训练导航策略时面临的两大挑战:一是策略迁移过程中存在的“仿真到现实”(sim-to-real)差距,二是带动作标签的训练数据有限。其解决方案的关键在于提出视觉-语言距离(Vision-Language Distance, VLD)学习框架,通过将感知学习与策略学习解耦,先在大规模视频数据上训练一个自监督的距离到目标预测器(distance-to-goal predictor),该预测器可同时支持图像和文本形式的目标描述,从而提供一种语义化的距离信号;随后在仿真环境中使用特权几何距离信号训练强化学习(Reinforcement Learning, RL)策略,并注入噪声以模拟预测器的不确定性;部署时,策略直接消费VLD预测结果,既继承了大规模视觉训练中的语义目标信息(即“去哪”),又保留了在仿真中学习到的鲁棒低层导航行为。这一解耦设计显著提升了导航策略的泛化能力与多模态灵活性。
链接: https://arxiv.org/abs/2512.07976
作者: Lazar Milikic,Manthan Patel,Jonas Frey
机构: ETH Zurich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院); Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-“where to go”-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
zh
[CV-118] Sparse Variable Projection in Robotic Perception: Exploiting Separable Structure for Efficient Nonlinear Optimization
【速读】:该论文旨在解决机器人感知中大规模非线性最小二乘(Nonlinear Least-Squares, NLS)问题的计算效率瓶颈问题,特别是针对传统变量投影(Variable Projection, VarPro)方法在处理具有规范对称性(gauge symmetries,如全局平移和旋转不变性)时所面临的计算挑战。解决方案的关键在于提出一种专为含规范对称性的NLS问题设计的VarPro方案,该方案能够同时利用问题的可分离性(separability)与稀疏性(sparsity),通过一次预处理构建一个无矩阵存储的Schur补算子(matrix-free Schur complement operator),从而高效计算约简后问题的目标函数、梯度及Hessian-向量积,并无缝集成到标准迭代NLS求解器中,显著提升运行效率(实测加速比达2–35倍)且保持精度。
链接: https://arxiv.org/abs/2512.07969
作者: Alan Papalia,Nikolas Sanderson,Haoyu Han,Heng Yang,Hanumant Singh,Michael Everett
机构: Northeastern University (东北大学); University of Michigan (密歇根大学); Harvard University (哈佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, submitted for review
Abstract:Robotic perception often requires solving large nonlinear least-squares (NLS) problems. While sparsity has been well-exploited to scale solvers, a complementary and underexploited structure is \emphseparability – where some variables (e.g., visual landmarks) appear linearly in the residuals and, for any estimate of the remaining variables (e.g., poses), have a closed-form solution. Variable projection (VarPro) methods are a family of techniques that exploit this structure by analytically eliminating the linear variables and presenting a reduced problem in the remaining variables that has favorable properties. However, VarPro has seen limited use in robotic perception; a major challenge arises from gauge symmetries (e.g., cost invariance to global shifts and rotations), which are common in perception and induce specific computational challenges in standard VarPro approaches. We present a VarPro scheme designed for problems with gauge symmetries that jointly exploits separability and sparsity. Our method can be applied as a one-time preprocessing step to construct a \emphmatrix-free Schur complement operator. This operator allows efficient evaluation of costs, gradients, and Hessian-vector products of the reduced problem and readily integrates with standard iterative NLS solvers. We provide precise conditions under which our method applies, and describe extensions when these conditions are only partially met. Across synthetic and real benchmarks in SLAM, SNL, and SfM, our approach achieves up to \textbf2 \times --35 \times faster runtimes than state-of-the-art methods while maintaining accuracy. We release an open-source C++ implementation and all datasets from our experiments.
zh
[CV-119] Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
【速读】:该论文旨在解决长视频序列中人脸替换(face swapping)的高保真度(fidelity)与时间一致性(temporal consistency)难题,尤其是在复杂运动和光照变化下保持目标身份稳定的问题。其解决方案的关键在于提出首个基于视频参考引导的人脸替换模型 LivingSwap,通过关键帧(keyframes)作为条件信号注入目标身份,并结合视频参考引导机制实现时间上的拼接(temporal stitching),从而在保证高保真重建的同时显著提升跨帧的身份一致性。此外,为缓解参考引导训练数据稀缺问题,作者构建了配对数据集 Face2Face 并进行数据对翻转以增强监督信号可靠性。
链接: https://arxiv.org/abs/2512.07951
作者: Zekai Luo,Zongze Du,Zhouhang Zhu,Hao Zhong,Muzhi Zhu,Wen Wang,Yuling Xi,Chenchen Jing,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video’s expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: this https URL
zh
[CV-120] Near-real time fires detection using satellite imagery in Sudan conflict
【速读】:该论文旨在解决武装冲突中火灾损毁区域难以实现快速监测与分析的问题(rapid monitoring and analysis of fire damage in armed conflicts)。其解决方案的关键在于利用Planet Labs提供的四波段遥感影像(4-band imagery)与深度学习模型相结合,实现对冲突地区火灾痕迹的近实时识别与量化。研究表明,该自动化方法相比传统基线方法能更准确捕捉活跃火点及烧毁区域,且使用8波段影像或时间序列数据仅带来边际性能提升,凸显了所提方案在效率与精度之间的最优平衡。
链接: https://arxiv.org/abs/2512.07925
作者: Kuldip Singh Atwal,Dieter Pfoser,Daniel Rothbart
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The challenges of ongoing war in Sudan highlight the need for rapid moni- toring and analysis of such conflicts. Advances in deep learning and readily available satellite remote sensing imagery allow for near real-time monitor- ing. This paper uses 4-band imagery from Planet Labs with a deep learning model to show that fire damage in armed conflicts can be monitored with minimal delay. We demonstrate the effectiveness of our approach using five case studies in Sudan. We show that, compared to a baseline, the automated method captures the active fires and charred areas more accurately. Our re- sults indicate that using 8-band imagery or time series of such imagery only result in marginal gains.
zh
[CV-121] GSPN-2: Efficient Parallel Sequence Modeling NEURIPS2025
【速读】:该论文旨在解决视觉Transformer在高分辨率图像和长视频等实际应用中计算效率低下的问题,尤其是传统方法中自注意力机制的二次复杂度导致的性能瓶颈。其核心解决方案是提出GSPN-2,通过算法与系统协同设计实现显著优化:关键在于将此前数千次微小GPU内核调用合并为单一二维内核,显式地将每个线程束(warp)绑定到通道切片,并利用共享内存缓存前一列的激活值;同时在模型层面引入紧凑的通道传播策略,以结构化矩阵变换替代每通道独立的权重矩阵,从而减少参数量并自然契合Transformer中的亲和图(affinity map)。这一设计使GSPN-2在保持与Transformer相当精度的同时,大幅降低计算开销,确立了视觉任务中建模全局空间上下文的新效率边界。
链接: https://arxiv.org/abs/2512.07884
作者: Hongjun Wang,Yitong Jiang,Collin McCarthy,David Wehr,Hanrong Ye,Xinhao Li,Ka Chun Cheung,Wonmin Byeon,Jinwei Gu,Ke Chen,Kai Han,Hongxu Yin,Pavlo Molchanov,Jan Kautz,Sifei Liu
机构: NVIDIA; The University of Hong Kong (香港大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm-system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column’s activations in shared memory. On the model side, we introduce a compact channel propagation strategy that replaces per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2’s effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation. Project page: this https URL
zh
[CV-122] LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model
【速读】:该论文旨在解决Transformer模型在多阶段处理过程中因输入序列变化导致的计算瓶颈动态性问题,现有单阶段稀疏化方法难以实现跨阶段的高效加速且存在显著功耗开销。其解决方案的关键在于提出一种基于对数域注意力预测的软硬件协同设计方法(LAPA),核心创新包括:1)设计不对称前导1计算(ALOC)方案以消除高成本乘法运算;2)引入混合精度多轮移位累加(MRSA)机制降低累积误差与计算开销;3)结合数据特征依赖滤波(DDF)策略优化稀疏性预测精度;最终通过专用加速器实现理论性能提升到实际硬件优化的转化,实验表明其能显著优于当前最优方案(Spatten、Sanger和FACT)。
链接: https://arxiv.org/abs/2512.07855
作者: Huizheng Wang,Hongbin Wang,Shaojun Wei,Yang Hu,Shouyi Yin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.
zh
[CV-123] Detection of Cyberbullying in GIF using AI
【速读】:该论文旨在解决社交媒体中基于GIF动图的网络欺凌(cyberbullying)检测问题,这是当前研究相对匮乏的领域。现有方法主要聚焦于文本和图像内容的检测,而对动态表情包(GIF/stickers)中的网络欺凌行为缺乏有效识别手段。解决方案的关键在于构建了一个包含4100个GIF动图的标注数据集,并采用预训练的深度学习模型VGG16进行特征提取与分类,最终实现了97%的检测准确率,为后续针对非文本形式网络欺凌内容的自动化识别提供了可复用的数据资源和高效的技术路径。
链接: https://arxiv.org/abs/2512.07838
作者: Pal Dave,Xiaohong Yuan,Madhuri Siddula,Kaushik Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.
zh
[CV-124] Self-Reinforced Deep Priors for Reparameterized Full Waveform Inversion
【速读】:该论文旨在解决全波形反演(Full Waveform Inversion, FWI)因强非线性导致易陷入局部极小值,以及在复杂地质条件下由于先验信息不足加剧反演病态性、引发伪影和重建不稳定的问题。现有基于深度图像先验的重参数化FWI(DIP-FWI)虽能利用神经网络的谱偏差(spectral bias)和隐式正则化避免局部极小值,但其固定随机输入无法充分利用网络输入与输出间的映射关系,限制了结构增强和正则化效果。本文提出自强化DIP-FWI(SRDIP-FWI)框架,其关键在于设计了一种交替更新机制:在每次迭代中,通过当前网络输出的反馈同步优化网络参数与输入,从而实现结构自适应增强与正则化能力提升,有效缓解FWI的病态性;同时,通过分析网络谱偏差在多尺度速度模型构建中的作用,实现了无需人工频率带选择和时间窗拾取的自动化流程,显著简化工作流并提升成像分辨率与深度穿透能力。
链接: https://arxiv.org/abs/2512.08284
作者: Guangyuan Zou,Junlun Li,Feng Liu,Xuejing Zheng,Jianjian Xie,Guoyi Chen
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to GEOPHYSICS
Abstract:Full waveform inversion (FWI) has become a widely adopted technique for high-resolution subsurface imaging. However, its inherent strong nonlinearity often results in convergence toward local minima. Recently, deep image prior-based reparameterized FWI (DIP-FWI) has been proposed to alleviate the dependence on massive training data. By exploiting the spectral bias and implicit regularization in the neural network architecture, DIP-FWI can effectively avoid local minima and reconstruct more geologically plausible velocity models. Nevertheless, existing DIP-FWI typically use a fixed random input throughout the inversion process, which fails to utilize the mapping and correlation between the input and output of the network. Moreover, under complex geological conditions, the lack of informative prior in the input can exacerbate the ill-posedness of the inverse problem, leading to artifacts and unstable reconstructions. To address these limitations, we propose a self-reinforced DIP-FWI (SRDIP-FWI) framework, in which a steering algorithm alternately updates both the network parameters and the input at each iteration using feedback from the current network output. This design allows adaptive structural enhancement and improved regularization, thereby effectively mitigating the ill-posedness in FWI. Additionally, we analyze the spectral bias of the network in SRDIP-FWI and quantify its role in multiscale velocity model building. Synthetic tests and field land data application demonstrate that SRDIP-FWI achieves superior resolution, improved accuracy and greater depth penetration compared to multiscale FWI. More importantly, SRDIP-FWI eliminates the need for manual frequency-band selection and time-window picking, substantially simplifying the inversion workflow. Overall, the proposed method provides a novel, adaptive and robust framework for accurate subsurface velocity model reconstruction.
zh
[CV-125] umor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation
【速读】:该论文旨在解决医学影像中肿瘤分割模型在面对分布外(out-of-distribution, OOD)输入时,仍可能生成高置信度但错误的分割结果的问题,这会危及临床部署的安全性。现有方法如基于logit的检测易受任务特定模型偏差影响,而架构增强型方法则带来参数和计算成本的增加。解决方案的关键在于提出一种轻量级、即插即用的后处理随机森林(Random Forests, RF)检测框架RF-Deep,其利用预训练-微调骨干编码器的分层特征,并从锚定于预测肿瘤区域的多个感兴趣区提取特征,实现任务相关的OOD检测;该方法无需额外训练且对网络深度与预训练策略不敏感,显著提升了跨不同影像场景下的分割可靠性。
链接: https://arxiv.org/abs/2512.08216
作者: Aneesh Rangnekar,Harini Veeraraghavan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks for safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a plug-and-play and lightweight post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone encoder, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. Hence, it scales to images of varying fields-of-view. We compared RF-Deep against existing OOD detection methods using 1,916 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained similar performance consistency across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.
zh
[CV-126] FlowSteer: Conditioning Flow Field for Consistent Image Restoration
【速读】:该论文旨在解决流模型(flow-based models)在图像恢复(Image Restoration, IR)任务中因生成式先验与测量数据不一致而导致的“漂移”问题,即生成结果偏离真实观测数据。传统方法依赖于特定数据或任务的流变换或适配器(adapters),存在计算开销大且难以跨任务扩展的局限性。解决方案的关键在于提出FlowSteer(FS),一种基于操作符感知的条件注入机制,通过在采样路径上引入测量先验(measurement priors),将冻结流模型的隐式引导与显式测量约束进行耦合,从而在无需重训练模型或添加适配器的情况下实现零样本(zero-shot)图像恢复,显著提升重建图像的保真度和身份一致性。
链接: https://arxiv.org/abs/2512.08125
作者: Tharindu Wickremasinghe,Chenyang Qi,Harshana Weligampola,Zhengzhong Tu,Stanley H. Chan
机构: Purdue University (普渡大学); HKUST (香港科技大学); Texas A&M University (德克萨斯农工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often “drifting away” from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question “Can’t we efficiently manipulate the existing generative capabilities of a flow model?” To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow’s implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models.
zh
[CV-127] Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm
【速读】:该论文旨在解决扩散模型在磁共振成像(MRI)重建中后验采样效率低的问题,特别是传统方法如扩散后验采样或似然退火(likelihood annealing)存在收敛速度慢和参数调优复杂等局限。解决方案的关键在于引入精确似然(exact likelihood)与预条件(preconditioning)相结合的采样策略,在逆向扩散过程中将精确似然与各噪声尺度下的扩散先验相乘,并通过预条件加速采样收敛。该方法在fastMRI数据上训练并在健康志愿者的笛卡尔与非笛卡尔加速MRI数据上验证,结果表明其在重建速度和样本质量上均优于退火采样,且无需额外参数调优,实现了快速可靠的不同MRI重建任务的后验采样。
链接: https://arxiv.org/abs/2512.05791
作者: Moritz Blumenthal,Tina Holliber,Jonathan I. Tamir,Martin Uecker
机构: Graz University of Technology (格拉茨工业大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); BioTechMed-Graz (BioTechMed-格拉茨)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR)
备注: Submitted to Magnetic Resonance in Medicine
Abstract:Purpose: The Unadjusted Langevin Algorithm (ULA) in combination with diffusion models can generate high quality MRI reconstructions with uncertainty estimation from highly undersampled k-space data. However, sampling methods such as diffusion posterior sampling or likelihood annealing suffer from long reconstruction times and the need for parameter tuning. The purpose of this work is to develop a robust sampling algorithm with fast convergence. Theory and Methods: In the reverse diffusion process used for sampling the posterior, the exact likelihood is multiplied with the diffused prior at all noise scales. To overcome the issue of slow convergence, preconditioning is used. The method is trained on fastMRI data and tested on retrospectively undersampled brain data of a healthy volunteer. Results: For posterior sampling in Cartesian and non-Cartesian accelerated MRI the new approach outperforms annealed sampling in terms of reconstruction speed and sample quality. Conclusion: The proposed exact likelihood with preconditioning enables rapid and reliable posterior sampling across various MRI reconstruction tasks without the need for parameter tuning. Comments: Submitted to Magnetic Resonance in Medicine Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2512.05791 [physics.med-ph] (or arXiv:2512.05791v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2512.05791 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Tina Holliber [view email] [v1] Fri, 5 Dec 2025 15:17:29 UTC (14,723 KB)
zh
人工智能
[AI-0] Same Content Different Answers: Cross-Modal Inconsistency in MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的跨模态不一致性问题,即模型在不同模态(如图像、文本和混合模态)下对相同语义信息的处理结果不一致。为系统评估这一问题,作者提出了两个新基准测试:REST 和 REST+(Render-Equivalence Stress Tests),其关键在于设计包含相同语义信息但不同模态表达的样本,从而揭示当前最先进的MLLMs在跨模态推理中的一致性缺陷。实验表明,即使OCR识别准确,视觉特征(如文本颜色、分辨率)和视觉token数量仍显著影响模型表现,且一致性评分与文本与图像之间的模态差距呈强相关,这为理解MLLMs的跨模态不一致性提供了机制层面的解释。
链接: https://arxiv.org/abs/2512.08923
作者: Angela van Sprang,Laurens Samson,Ana Lucic,Erman Acar,Sennay Ghebreab,Yuki M. Asano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Angela van Sprang and Laurens Samson contributed equally as first authors. Preprint
Abstract:We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
zh
[AI-1] DAO-GP Drift Aware Online Non-Linear Regression Gaussian-Process
【速读】:该论文旨在解决在线非线性回归中因概念漂移(concept drift)导致的模型性能下降问题,尤其是传统高斯过程(Gaussian Process, GP)模型在面对动态数据分布时缺乏自适应能力、固定超参数难以调整、易受数据窥探(data snooping)影响以及内存效率低等缺陷。其解决方案的关键在于提出DAO-GP(Drift-Aware Online Gaussian Process),一种完全自适应、无需超参数、具备衰减机制和稀疏性的在线非线性回归模型;该模型内置漂移检测与自适应机制,可根据漂移严重程度动态调整模型行为,并通过演化诱导点(inducing points)实现高效内存管理和去噪,从而在多种漂移类型(突发型、渐进型、缓慢型)和数据特征下均表现出鲁棒性和优越性能。
链接: https://arxiv.org/abs/2512.08879
作者: Mohammad Abu-Shaira,Ajita Rattani,Weishi Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model’s predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP’s robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.
zh
[AI-2] When Tables Leak: Attacking String Memorization in LLM -Based Tabular Data Generation
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)生成表格型合成数据时存在的隐私泄露问题,特别是模型可能通过记忆并再现训练数据中数值数字序列的方式导致成员推断攻击(Membership Inference Attack, MIA)。其核心发现是,无论采用微调小模型还是利用上下文提示大模型的方法,当前主流实现均存在显著的隐私风险。解决方案的关键在于提出一种新颖的采样策略,该策略在生成过程中对数字位进行有策略的扰动,从而有效抵御此类攻击,同时保持合成数据的保真度(fidelity)和实用性(utility)最小损失。
链接: https://arxiv.org/abs/2512.08875
作者: Joshua Ward,Bochao Gu,Chi-Hua Wang,Guang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.
zh
[AI-3] Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
【速读】:该论文旨在解决在隐私约束下,大型语言模型(Large Language Model, LLM)代理在动态环境中进行自演化时面临的联邦学习(Federated Learning, FL)适配难题。传统FL方法在静态数据集上表现良好,但在开放式、任务异构且奖励稀疏的交互场景中,由于梯度冲突严重,难以实现稳定的全局优化。解决方案的关键在于提出Fed-SE框架,其核心机制是构建“局部演化-全局聚合”范式:本地代理通过参数高效微调(Parameter-Efficient Fine-Tuning)筛选出高回报轨迹以获得稳定梯度更新;全局层面则在低秩子空间中聚合更新,解耦环境特异性动态,从而有效降低客户端间的负迁移(Negative Transfer),实现跨环境的知识鲁棒迁移。
链接: https://arxiv.org/abs/2512.08870
作者: Xiang Chen,Yuling Shi,Qizhen Lan,Yuchao Qiu,Xiaodong Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. While Federated Learning (FL) has proven effective on static datasets, its extension to the open-ended self-evolution of agents remains underexplored. Directly applying standard FL is challenging: heterogeneous tasks and sparse, trajectory-level rewards introduce severe gradient conflicts, destabilizing the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents. Fed-SE establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace that disentangles environment-specific dynamics, effectively reducing negative transfer across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by approximately 18% over federated baselines, validating its effectiveness in robust cross-environment knowledge transfer in privacy-constrained deployments.
zh
[AI-4] Differentially Private Synthetic Data Generation Using Context-Aware GANs
【速读】:该论文旨在解决传统合成数据生成方法在处理复杂领域规则时的不足,尤其是在医疗等敏感领域中,难以捕捉隐含的、非显式陈述但至关重要的领域约束(如处方指南或药物相互作用限制),从而导致生成的数据虽能再现显式模式却缺乏现实合理性与实用性的问题。解决方案的关键在于提出ContextGAN——一种上下文感知的差分隐私生成对抗网络,其核心创新是通过约束矩阵(constraint matrix)编码显性和隐性知识,并引入约束感知判别器(constraint-aware discriminator)对合成数据进行验证,确保其符合领域规则;同时结合差分隐私机制,在保障原始数据隐私的前提下提升合成数据的真实性和可用性。
链接: https://arxiv.org/abs/2512.08869
作者: Anantaa Kotal,Anupam Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The widespread use of big data across sectors has raised major privacy concerns, especially when sensitive information is shared or analyzed. Regulations such as GDPR and HIPAA impose strict controls on data handling, making it difficult to balance the need for insights with privacy requirements. Synthetic data offers a promising solution by creating artificial datasets that reflect real patterns without exposing sensitive information. However, traditional synthetic data methods often fail to capture complex, implicit rules that link different elements of the data and are essential in domains like healthcare. They may reproduce explicit patterns but overlook domain-specific constraints that are not directly stated yet crucial for realism and utility. For example, prescription guidelines that restrict certain medications for specific conditions or prevent harmful drug interactions may not appear explicitly in the original data. Synthetic data generated without these implicit rules can lead to medically inappropriate or unrealistic profiles. To address this gap, we propose ContextGAN, a Context-Aware Differentially Private Generative Adversarial Network that integrates domain-specific rules through a constraint matrix encoding both explicit and implicit knowledge. The constraint-aware discriminator evaluates synthetic data against these rules to ensure adherence to domain constraints, while differential privacy protects sensitive details from the original data. We validate ContextGAN across healthcare, security, and finance, showing that it produces high-quality synthetic data that respects domain rules and preserves privacy. Our results demonstrate that ContextGAN improves realism and utility by enforcing domain constraints, making it suitable for applications that require compliance with both explicit patterns and implicit rules under strict privacy guarantees.
zh
[AI-5] EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
【速读】:该论文旨在解决当前基础智能体(Foundation Agents)评估体系普遍局限于学术场景或人工设计环境,难以真实反映其在复杂现实应用中性能的问题。现有基准测试忽略了实际部署时面临的动态市场、多样化用户交互及高阶决策需求等挑战。为此,作者提出EcomBench——一个面向真实电商场景的综合性评测基准,其关键在于:基于全球主流电商平台的真实用户需求构建数据集,并通过专家标注确保任务的准确性与领域相关性;同时涵盖多类电商任务并设置三个难度层级,系统性地评估智能体在深度信息检索、多步推理和跨源知识融合等核心能力上的表现,从而提供一个贴近实际业务逻辑的动态测试平台,推动智能体从实验室走向真实世界应用。
链接: https://arxiv.org/abs/2512.08868
作者: Rui Min,Zile Qiao,Ze Xu,Jiawen Zhai,Wenyu Gao,Xuanzhong Chen,Haozhen Sun,Zhen Zhang,Xinyu Wang,Hong Zhou,Wenbiao Yin,Xuan Zhou,Yong Jiang,Haicheng Liu,Liang Ding,Ling Zou,Yi R.(May)Fung,Yalong Li,Pengjun Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.
zh
[AI-6] Interpolation in Knowledge Representation
【速读】:该论文旨在解决知识表示中Craig插值与统一插值(uniform interpolation)的理论与实践难题,尤其是在描述逻辑(description logics)和逻辑编程等主流知识表示形式系统中,这些插值性质通常不成立或难以计算。其解决方案的关键在于深入分析这两类形式系统的理论特性,并提出相应的实用计算方法,以支持如可解释性、遗忘、模块化与重用等应用场景中的插值需求。
链接: https://arxiv.org/abs/2512.08833
作者: Jean Christoph Jung,Patrick Koopmann,Matthias Knorr
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: The article will appear in Balder ten Cate, Jean Christoph Jung, Patrick Koopmann, Christoph Wernhard and Frank Wolter, editors. Theory and Applications of Craig Interpolation. Ubiquity Press, 2026
Abstract:Craig interpolation and uniform interpolation have many applications in knowledge representation, including explainability, forgetting, modularization and reuse, and even learning. At the same time, many relevant knowledge representation formalisms do in general not have Craig or uniform interpolation, and computing interpolants in practice is challenging. We have a closer look at two prominent knowledge representation formalisms, description logics and logic programming, and discuss theoretical results and practical methods for computing interpolants.
zh
[AI-7] CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
【速读】:该论文旨在解决生成式组件(如LoRA)在无结构生态系统中难以有效发现与评估的问题,现有方法依赖不可靠的用户描述或有偏的流行度指标,导致可用性受限。其解决方案的关键在于提出CARLoS框架,通过在多种提示词和随机种子下对超过650个LoRA进行图像生成分析,利用CLIP嵌入及其与基础模型生成结果的差异,构建包含三个维度的表征:方向(Directions,定义语义变化)、强度(Strength,量化影响显著性)和一致性(Consistency,衡量效果稳定性)。基于此表征,开发出高效的语义检索系统,在匹配文本查询的同时过滤过强或不稳定的LoRA,显著优于纯文本基线方法,并为版权法中的实质性与自愿性等法律概念提供量化依据,体现出更广泛的LoRA分析价值。
链接: https://arxiv.org/abs/2512.08826
作者: Shahar Sarfaty,Adi Haviv,Uri Hacohen,Niva Elkin-Koren,Roi Livni,Amit H. Bermano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper Page: this https URL
Abstract:The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three-part representation: Directions, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.
zh
[AI-8] Emovectors: assessing emotional content in jazz improvisations for creativity evaluation WWW
【速读】:该论文试图解决如何量化评估爵士乐即兴演奏中的创造力问题,尤其是在当前基于大语言模型(Large Language Models, LLMs)的生成系统中缺乏有效自动化创造力指标的背景下。其核心假设是:即兴演奏中情感内容越丰富,越可能被认定为具有创造性。解决方案的关键在于提出一种基于嵌入(embeddings)的方法,通过心理学基础的音乐特征分类体系提取情感信息,生成可量化的“情感向量”(emovectors),并以此对多个即兴演奏进行比较分析,从而为创造力评估提供可扩展的新指标。
链接: https://arxiv.org/abs/2512.08812
作者: Anna Jordanous
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Presented at IEEE Big Data 2025 3rd Workshop on AI Music Generation (AIMG 2025). this https URL
Abstract:Music improvisation is fascinating to study, being essentially a live demonstration of a creative process. In jazz, musicians often improvise across predefined chord progressions (leadsheets). How do we assess the creativity of jazz improvisations? And can we capture this in automated metrics for creativity for current LLM-based generative systems? Demonstration of emotional involvement is closely linked with creativity in improvisation. Analysing musical audio, can we detect emotional involvement? This study hypothesises that if an improvisation contains more evidence of emotion-laden content, it is more likely to be recognised as creative. An embeddings-based method is proposed for capturing the emotional content in musical improvisations, using a psychologically-grounded classification of musical characteristics associated with emotions. Resulting ‘emovectors’ are analysed to test the above hypothesis, comparing across multiple improvisations. Capturing emotional content in this quantifiable way can contribute towards new metrics for creativity evaluation that can be applied at scale.
zh
[AI-9] Multicalibration for LLM -based Code Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成任务中,大型语言模型(Large Language Models, LLMs)输出的置信度分数与实际代码正确性之间不一致的问题,即模型校准(calibration)不足的问题。其核心解决方案是引入多校准(multicalibration)方法,通过将校准过程细化到编码问题的不同特征维度(如复杂度、代码长度和编程语言),从而提升置信度评分对代码正确性的忠实度。实验表明,该方法相比原始token似然得分可提升技能分数(skill score)1.03,相较基线校准方法提升0.37,验证了多校准在增强代码LLMs可信度方面的有效性。
链接: https://arxiv.org/abs/2512.08810
作者: Viola Campos,Robin Kuschnereit,Adrian Ulges
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AI-SQE 2026 (The 1st International Workshop on AI for Software Quality Evaluation: Judgment, Metrics, Benchmarks, and Beyond)
Abstract:As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both uncalibrated token likelihoods (+1.03 in skill score) and baseline calibrations (+0.37 in skill score). We study the influence of the aforementioned factors in ablations, and make our dataset (consisting of code generations, likelihoods, and correctness labels) available for future research on code LLM calibration.
zh
[AI-10] PrivTune: Efficient and Privacy-Preserving Fine-Tuning of Large Language Models via Device-Cloud Collaboration
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)服务中用户私有数据在微调过程中可能泄露的问题。现有方法依赖于设备-云协同框架中的差分隐私(Differential Privacy),难以在隐私保护与模型性能之间取得平衡,易受推断攻击或导致微调效果下降。解决方案的关键在于提出PrivTune,一种基于分割学习(Split Learning, SL)的高效隐私保护微调框架:其核心思想是在SL底层模型输出的token表示中注入精心设计的噪声,使每个token与其n跳间接邻居相似;并通过优化问题求解最优噪声向量,在保证防御目标的同时最大化模型效用;进一步调整噪声分布参数(即均值)以匹配优化方向,并按token重要性动态缩放噪声强度,从而最小化对模型性能的干扰。实验表明,PrivTune在多个数据集上显著降低攻击成功率(如在Stanford Sentiment Treebank上将攻击成功率降至10%),同时仅造成3.33%的效用损失,优于当前最先进基线方法。
链接: https://arxiv.org/abs/2512.08809
作者: Yi Liu,Weixiang Han,Chengjun Cai,Xingliang Yuan,Cong Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE INFOCOM 2026 (full version)
Abstract:With the rise of large language models, service providers offer language models as a service, enabling users to fine-tune customized models via uploaded private datasets. However, this raises concerns about sensitive data leakage. Prior methods, relying on differential privacy within device-cloud collaboration frameworks, struggle to balance privacy and utility, exposing users to inference attacks or degrading fine-tuning performance. To address this, we propose PrivTune, an efficient and privacy-preserving fine-tuning framework via Split Learning (SL). The key idea of PrivTune is to inject crafted noise into token representations from the SL bottom model, making each token resemble the n -hop indirect neighbors. PrivTune formulates this as an optimization problem to compute the optimal noise vector, aligning with defense-utility goals. On this basis, it then adjusts the parameters (i.e., mean) of the d_\chi -Privacy noise distribution to align with the optimization direction and scales the noise according to token importance to minimize distortion. Experiments on five datasets (covering both classification and generation tasks) against three embedding inversion and three attribute inference attacks show that, using RoBERTa on the Stanford Sentiment Treebank dataset, PrivTune reduces the attack success rate to 10% with only a 3.33% drop in utility performance, outperforming state-of-the-art baselines.
zh
[AI-11] Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework WWW
【速读】:该论文旨在解决当前安全运营中心(Security Operations Centers, SOC)中广泛依赖规则驱动检测方法所面临的局限性,即高误报率/漏报率、维护成本高以及机器学习(Machine Learning, ML)方案因数据稀缺和技能缺口难以落地的问题。其核心解决方案是一个两阶段混合框架:第一阶段使用故意宽松的YARA规则进行粗粒度过滤,以最大化召回率;第二阶段引入机器学习分类器对第一阶段结果进行精筛,从而显著降低误报。为克服训练数据匮乏问题,系统采用Simula——一种无需种子样本的合成数据生成框架,使安全分析师可在无大量标注数据或数据科学背景的情况下构建高质量训练集。此外,通过持续反馈机制将实际调查结果回流至模型,实现主动学习(Active Learning),使模型能自适应优化并避免规则退化。该方案在数万系统规模的生产环境中长期验证,日均原始日志量从2500亿条降至仅数十条需人工处理的告警,同时精度随时间提升,形成低开销、低维护且可自我演进的威胁检测体系。
链接: https://arxiv.org/abs/2512.08802
作者: Sadegh Momeni,Ge Zhang,Birkett Huber,Hamza Harkous,Sam Lipton,Benoit Seguin,Yanis Pavlidis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: published in CAMLIS 2025, this https URL
Abstract:Despite advancements in machine learning for security, rule-based detection remains prevalent in Security Operations Centers due to the resource intensiveness and skill gap associated with ML solutions. While traditional rule-based methods offer efficiency, their rigidity leads to high false positives or negatives and requires continuous manual maintenance. This paper proposes a novel, two-stage hybrid framework to democratize ML-based threat detection. The first stage employs intentionally loose YARA rules for coarse-grained filtering, optimized for high recall. The second stage utilizes an ML classifier to filter out false positives from the first stage’s output. To overcome data scarcity, the system leverages Simula, a seedless synthetic data generation framework, enabling security analysts to create high-quality training datasets without extensive data science expertise or pre-labeled examples. A continuous feedback loop incorporates real-time investigation results to adaptively tune the ML model, preventing rule degradation. This proposed model with active learning has been rigorously tested for a prolonged time in a production environment spanning tens of thousands of systems. The system handles initial raw log volumes often reaching 250 billion events per day, significantly reducing them through filtering and ML inference to a handful of daily tickets for human investigation. Live experiments over an extended timeline demonstrate a general improvement in the model’s precision over time due to the active learning feature. This approach offers a self-sustained, low-overhead, and low-maintenance solution, allowing security professionals to guide model learning as expert ``teachers’'. Comments: published in CAMLIS 2025, this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08802 [cs.CR] (or arXiv:2512.08802v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.08802 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] Can TabPFN Compete with GNNs for Node Classification via Graph Tabularization?
【速读】:该论文旨在解决图节点分类任务中对专用图神经网络(Graph Neural Networks, GNNs)或依赖大语言模型(Large Language Models, LLMs)的图基础模型的强依赖问题,从而提升方法的通用性和实用性。其解决方案的关键在于将图数据通过系统性特征工程转化为结构化的表格特征(tabular features),包括节点属性、结构特性、位置编码以及可选的平滑邻域特征,进而利用预训练的TabPFN模型直接进行零样本节点分类,无需任何图特定训练或LLM支持。这种基于表征重构的方法在同质图上表现与GNN相当,在异质图上显著优于GNN,验证了特征工程在连接表格学习与图学习领域中的有效性。
链接: https://arxiv.org/abs/2512.08798
作者: Jeongwhan Choi,Woosung Kang,Minseo Kim,Jongwoo Kim,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Rejected from LoG 2025 (submitted August 2025)
Abstract:Foundation models pretrained on large data have demonstrated remarkable zero-shot generalization capabilities across domains. Building on the success of TabPFN for tabular data and its recent extension to time series, we investigate whether graph node classification can be effectively reformulated as a tabular learning problem. We introduce TabPFN-GN, which transforms graph data into tabular features by extracting node attributes, structural properties, positional encodings, and optionally smoothed neighborhood features. This enables TabPFN to perform direct node classification without any graph-specific training or language model dependencies. Our experiments on 12 benchmark datasets reveal that TabPFN-GN achieves competitive performance with GNNs on homophilous graphs and consistently outperforms them on heterophilous graphs. These results demonstrate that principled feature engineering can bridge the gap between tabular and graph domains, providing a practical alternative to task-specific GNN training and LLM-dependent graph foundation models.
zh
[AI-13] A Practical Guide for Designing Developing and Deploying Production-Grade Agent ic AI Workflows
【速读】:该论文旨在解决如何设计、开发和部署生产级别的代理式人工智能(Agentic AI)工作流,以应对当前在工业界和研究领域中快速采用Agentic AI时面临的挑战:即确保系统具备可靠性、可观测性、可维护性,并满足安全与治理要求。其解决方案的关键在于提出一个结构化的工程生命周期框架,涵盖工作流分解、多代理设计模式、模型上下文协议(Model Context Protocol, MCP)、工具集成、确定性编排、负责任AI(Responsible-AI)考量以及环境感知的部署策略,并进一步提炼出九项核心最佳实践,如“工具优先设计”、“纯函数调用”、“单一职责代理”、“外部化提示管理”及“KISS原则”等,从而实现复杂代理系统的稳定性、可扩展性和可操作性。
链接: https://arxiv.org/abs/2512.08769
作者: Eranga Bandara,Ross Gore,Peter Foytik,Sachin Shetty,Ravi Mukkamala,Abdul Rahman,Xueping Liang,Safdar H. Bouk,Amin Hass,Sachini Rajapakse,Ng Wee Keong,Kasun De Zoysa,Aruna Withanage,Nilaan Loganathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi-step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool-augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision-making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production-grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end-to-end guide for designing, developing, and deploying production-quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi-agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible-AI considerations, and environment-aware deployment strategies. We then present nine core best practices for engineering production-grade agentic AI workflows, including tool-first design over MCP, pure-function invocation, single-tool and single-responsibility agents, externalized prompt management, Responsible-AI-aligned model-consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news-analysis and media-generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production-ready agentic AI workflows.
zh
[AI-14] Data-Driven Dynamic Parameter Learning of manipulator robots
【速读】:该论文旨在解决机器人领域中“仿真到现实(sim-to-real)”差距的核心挑战,即如何实现高精度的动力学参数估计,以支撑可靠的模型预测控制、逼真的仿真以及机械臂的安全部署。传统解析方法在复杂结构和交互场景下表现不足,而常规神经网络难以捕捉长期依赖关系。其解决方案的关键在于提出一种基于Transformer的动态参数估计框架,并结合自动化数据生成流程——该流程利用雅可比矩阵导出的特征生成多样化机器人模型与丰富轨迹数据(共8,192个不同惯性与摩擦属性的机器人),从而增强模型对时空依赖性的建模能力。实验表明,该方法在质量、惯量等参数上达到近似完美估计,在库仑摩擦上表现中高精度,验证了Transformer架构与自动数据增强相结合的有效性,为提升机器人系统中的sim-to-real迁移性能提供了可扩展且准确的技术路径。
链接: https://arxiv.org/abs/2512.08767
作者: Mohammed Elseiagy,Tsige Tadesse Alemayoh,Ranulfo Bezerra,Shotaro Kojima,Kazunori Ohno
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication at SII 2026. 6 pages, 7 figures. Code is available at: this https URL
Abstract:Bridging the sim-to-real gap remains a fundamental challenge in robotics, as accurate dynamic parameter estimation is essential for reliable model-based control, realistic simulation, and safe deployment of manipulators. Traditional analytical approaches often fall short when faced with complex robot structures and interactions. Data-driven methods offer a promising alternative, yet conventional neural networks such as recurrent models struggle to capture long-range dependencies critical for accurate estimation. In this study, we propose a Transformer-based approach for dynamic parameter estimation, supported by an automated pipeline that generates diverse robot models and enriched trajectory data using Jacobian-derived features. The dataset consists of 8,192 robots with varied inertial and frictional properties. Leveraging attention mechanisms, our model effectively captures both temporal and spatial dependencies. Experimental results highlight the influence of sequence length, sampling rate, and architecture, with the best configuration (sequence length 64, 64 Hz, four layers, 32 heads) achieving a validation R2 of 0.8633. Mass and inertia are estimated with near-perfect accuracy, Coulomb friction with moderate-to-high accuracy, while viscous friction and distal link center-of-mass remain more challenging. These results demonstrate that combining Transformers with automated dataset generation and kinematic enrichment enables scalable, accurate dynamic parameter estimation, contributing to improved sim-to-real transfer in robotic systems
zh
[AI-15] Performance Comparison of Aerial RIS and STAR-RIS in 3D Wireless Environments
【速读】:该论文旨在解决高空部署的可重构智能表面(Reconfigurable Intelligent Surface, RIS)与同时发射和反射型可重构智能表面(Simultaneously Transmitting and Reflecting RIS, STAR-RIS)在三维无线环境中的性能差异问题,特别是二者在不同高度和方向下的系统速率优化。解决方案的关键在于构建考虑方向性辐射特性的精确信道模型,并针对两种架构分别建立联合优化问题以最大化系统总速率;进而提出基于加权最小均方误差(Weighted Minimum Mean Square Error, WMMSE)与块坐标下降(Block Coordinate Descent, BCD)算法的高效求解方法,从而实现对RIS和STAR-RIS的波束赋形与反射/透射系数协同优化。
链接: https://arxiv.org/abs/2512.08755
作者: Dongdong Yang,Bin Li,Jiguang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reconfigurable intelligent surface (RIS) and simultaneously transmitting and reflecting RIS (STAR-RIS) have emerged as key enablers for enhancing wireless coverage and capacity in next-generation networks. When mounted on unmanned aerial vehicles (UAVs), they benefit from flexible deployment and improved line-of-sight conditions. Despite their promising potential, a comprehensive performance comparison between aerial RIS and STAR-RIS architectures has not been thoroughly investigated. This letter presents a detailed performance comparison between aerial RIS and STAR-RIS in three-dimensional wireless environments. Accurate channel models incorporating directional radiation patterns are established, and the influence of deployment altitude and orientation is thoroughly examined. To optimize the system sum-rate, we formulate joint optimization problems for both architectures and propose an efficient solution based on the weighted minimum mean square error and block coordinate descent algorithms. Simulation results reveal that STAR-RIS outperforms RIS in low-altitude scenarios due to its full-space coverage capability, whereas RIS delivers better performance near the base station at higher altitudes. The findings provide practical insights for the deployment of aerial intelligent surfaces in future 6G communication systems.
zh
[AI-16] owards Foundation Models with Native Multi-Agent Intelligence
【速读】:该论文试图解决的问题是:当前基础模型(Foundation Models, FMs)虽已具备单智能体能力(如GUI交互或工具调用),但尚未具备原生的多智能体(multi-agent)智能,而这一能力对于构建更复杂、协作性强的AI代理系统至关重要。解决方案的关键在于识别并提升FMs在多智能体场景下的四项核心能力——理解(understanding)、规划(planning)、高效通信(efficient communication)和适应性(adaptation),并通过系统性的研究方向(包括数据集构建、评估体系、训练范式与安全考量)推动FMs从单智能体性能向多智能体智能的跃迁,而非依赖其单智能体能力的自然涌现。
链接: https://arxiv.org/abs/2512.08743
作者: Shuyue Hu,Haoyang Yan,Yiqun Zhang,Yang Chen,Dongzhan Zhou,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Foundation models (FMs) are increasingly assuming the role of the “brain” of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities – such as GUI interaction or integrated tool use – we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions – spanning dataset construction, evaluation, training paradigms, and safety considerations – for building FMs with native multi-agent intelligence.
zh
[AI-17] Deconstructing the Dual Black Box:A Plug-and-Play Cognitive Framework for Human-AI Collaborative Enhancement and Its Implications for AI Governance
【速读】:该论文旨在解决人类专家的“认知黑箱”(cognitive black box)与人工智能(AI)的“计算黑箱”(computational black box)之间的根本性分裂问题,即二者在决策过程中的不可解释性和不可审计性。其解决方案的关键在于提出了一种“人-AI协同认知增强”新范式,通过结构化的“元交互”(meta-interaction)将双黑箱转化为可组合、可审计、可扩展的“功能白箱”系统;核心突破是构建了“即插即用的认知框架”——一种从专家对话中提取并编码为可计算知识包的机制,并集成于递归对抗元思维网络(Recursive Adversarial Meta-Thinking Network, RAMTN),从而实现专家思维(如医学诊断逻辑和教学直觉)向可复用、可扩展的公共资产转化,推动AI从“工具”向“思维伙伴”的范式跃迁。
链接: https://arxiv.org/abs/2512.08740
作者: Yiming Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, in Chinese language, 5 figures,3 tables
Abstract:Currently, there exists a fundamental divide between the “cognitive black box” (implicit intuition) of human experts and the “computational black box” (untrustworthy decision-making) of artificial intelligence (AI). This paper proposes a new paradigm of “human-AI collaborative cognitive enhancement,” aiming to transform the dual black boxes into a composable, auditable, and extensible “functional white-box” system through structured “meta-interaction.” The core breakthrough lies in the “plug-and-play cognitive framework”–a computable knowledge package that can be extracted from expert dialogues and loaded into the Recursive Adversarial Meta-Thinking Network (RAMTN). This enables expert thinking, such as medical diagnostic logic and teaching intuition, to be converted into reusable and scalable public assets, realizing a paradigm shift from “AI as a tool” to “AI as a thinking partner.” This work not only provides the first engineering proof for “cognitive equity” but also opens up a new path for AI governance: constructing a verifiable and intervenable governance paradigm through “transparency of interaction protocols” rather than prying into the internal mechanisms of models. The framework is open-sourced to promote technology for good and cognitive inclusion. This paper is an independent exploratory research conducted by the author. All content presented, including the theoretical framework (RAMTN), methodology (meta-interaction), system implementation, and case validation, constitutes the author’s individual research achievements.
zh
[AI-18] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology
【速读】:该论文旨在解决多模态临床推理在胃肠道(GI)肿瘤学领域中的挑战,特别是如何有效整合内镜图像、影像学数据和生化标志物等异构医学信息,以提升自动化决策支持系统的准确性与可靠性。当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法常因上下文稀释(context dilution)和幻觉(hallucination)问题导致推理偏差。论文提出的关键解决方案是一种分层的多智能体框架(hierarchical Multi-Agent Framework),该框架模拟人类多学科诊疗团队(Multidisciplinary Team, MDT)的协作流程,通过代理间的分工与协同机制显著增强推理逻辑严谨性和医学准确性,最终在专家评估中获得4.60/5.00的综合得分,优于传统单体基线模型。
链接: https://arxiv.org/abs/2512.08674
作者: Rongzhao Zhang,Junqiao Wang,Shuyun Yang,Mouxiao Bian,Chao Ding,Yuwei Bai,Chihao Zhang,Yuguang Shen,Lei Wang,Lei Zheng,Qiujuan Yan,Yun Zhong,Meiling Liu,Jiwei Yu,Zheng Wang,Jie Xu,Meng Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.
zh
[AI-19] Reusability in MLOps: Leverag ing Ports and Adapters to Build a Microservices Architecture for the Maritime Domain
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在复杂性管理上的挑战,尤其是在构建多微服务架构时如何实现软件架构的复用与可维护性。其解决方案的关键在于采用端口与适配器(Ports and Adapters)模式,结合六边形架构(Hexagonal Architecture)理念,从单一代码库中高效构建多个微服务,从而提升系统模块化程度和组件复用能力,为机器学习驱动的系统(ML-Enabled Systems, MLES)提供可扩展、可测试且易于演进的架构基础。
链接: https://arxiv.org/abs/2512.08657
作者: Renato Cordeiro Ferreira(1,2,3,4),Aditya Dhinavahi(1,2),Rowanne Trapmann(1,3),Willem-Jan van den Heuvel(1,2,3) ((1) Jheronimus Academy of Data Science, (2) Technical University of Eindhoven, (3) Tilburg University, (4) University of São Paulo)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 3 figures (3 diagrams), submitted to ICSA 2026
Abstract:ML-Enabled Systems (MLES) are inherently complex since they require multiple components to achieve their business goal. This experience report showcases the software architecture reusability techniques applied while building Ocean Guard, an MLES for anomaly detection in the maritime domain. In particular, it highlights the challenges and lessons learned to reuse the Ports and Adapters pattern to support building multiple microservices from a single codebase. This experience report hopes to inspire software engineers, machine learning engineers, and data scientists to apply the Hexagonal Architecture pattern to build their MLES.
zh
[AI-20] Protein Secondary Structure Prediction Using Transformers
【速读】:该论文旨在解决从氨基酸序列准确预测蛋白质二级结构(如α螺旋、β折叠和无规卷曲)的问题,这对于理解蛋白质功能具有重要意义。其解决方案的关键在于提出了一种基于Transformer的模型,利用注意力机制有效捕捉残基间的局部与远距离相互作用,并结合滑动窗口数据增强技术在CB513数据集上扩充训练样本,从而提升模型对不同长度序列的泛化能力。
链接: https://arxiv.org/abs/2512.08613
作者: Manzi Kevin Maxime
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer-based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding-window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions.
zh
[AI-21] CogMCTS: A Novel Cognitive-Guided Monte Carlo Tree Search Framework for Iterative Heuristic Evolution with Large Language Models
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动启发式设计(Automatic Heuristic Design, AHD)方法在优化过程中存在的局部最优陷阱、探索与利用平衡不足以及搜索多样性受限的问题。其解决方案的关键在于提出一种认知引导的蒙特卡洛树搜索框架(Cognitive-guided Monte Carlo Tree Search, CogMCTS),通过将LLMs的认知引导机制与MCTS深度融合,实现高效自动化启发式优化;具体包括多轮认知反馈以融合历史经验、节点信息及负面结果来动态改进启发式生成,采用双轨节点扩展与精英启发式管理策略以平衡多样启发式的探索与高质量经验的利用,并引入策略变异机制调整启发式形式与参数,从而显著提升解空间的多样性与整体优化性能。
链接: https://arxiv.org/abs/2512.08609
作者: Hui Wang,Yang Liu,Xiaoyu Zhang,Chaoxu Mu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic Heuristic Design (AHD) is an effective1 framework for solving complex optimization prob-2 lems. The development of large language mod-3 els (LLMs) enables the automated generation of4 heuristics. Existing LLM-based evolutionary meth-5 ods rely on population strategies and are prone6 to local optima. Integrating LLMs with Monte7 Carlo Tree Search (MCTS) improves the trade-off8 between exploration and exploitation, but multi-9 round cognitive integration remains limited and10 search diversity is constrained. To overcome these11 limitations, this paper proposes a novel cognitive-12 guided MCTS framework (CogMCTS). CogMCTS13 tightly integrates the cognitive guidance mecha-14 nism of LLMs with MCTS to achieve efficient au-15 tomated heuristic optimization. The framework16 employs multi-round cognitive feedback to incor-17 porate historical experience, node information, and18 negative outcomes, dynamically improving heuris-19 tic generation. Dual-track node expansion com-20 bined with elite heuristic management balances the21 exploration of diverse heuristics and the exploita-22 tion of high-quality experience. In addition, strate-23 gic mutation modifies the heuristic forms and pa-24 rameters to further enhance the diversity of the so-25 lution and the overall optimization performance.26 The experimental results indicate that CogMCTS27 outperforms existing LLM-based AHD methods in28 stability, efficiency, and solution quality.
zh
[AI-22] Examining Student Interactions with a Pedagogical AI-Assistant for Essay Writing and their Impact on Students Writing Quality
【速读】:该论文旨在解决当前关于学生与生成式 AI (Generative AI) 互动动态及其对写作质量影响的研究不足问题,特别是缺乏对教育场景下专用 GenAI 系统在写作各阶段中学生行为模式的深入探讨。其解决方案的关键在于设计并评估一个面向高等教育论证类写作的 GenAI 驱动作文辅助工具(Essay Writing Assistant, EWA),通过分析 32 名本科生在两小时写作会话中的 1,282 条交互日志,运用序列模式挖掘和 K-Means 聚类识别出两类典型行为模式:一类侧重提纲规划与结构组织(Cluster 1),另一类聚焦内容生成(Cluster 2);结果显示 Cluster 1 在作文组织维度上得分更高(效应量 r = 0.36),且高绩效学生更倾向于主动撰写并分享段落以获取反馈,而非被动提问。这提示教学设计应鼓励主动参与,而未来 EWA 可引入自动标注与监控机制,引导学生从提问转向写作,从而最大化 GenAI 支持学习的效果。
链接: https://arxiv.org/abs/2512.08596
作者: Wicaksono Febriantoro,Qi Zhou,Wannapon Suraworachet,Sahan Bulathwela,Andrea Gauthier,Eva Millan,Mutlu Cukurova
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 Figures, 8 Tables
Abstract:The dynamic nature of interactions between students and GenAI, as well as their relationship to writing quality, remains underexplored. While most research has examined how general-purpose GenAI can support writing, fewer studies have investigated how students interact with pedagogically designed systems across different phases of the writing process. To address this gap, we evaluated a GenAI-driven essay-writing assistant (EWA) designed to support higher education students in argumentative writing. Drawing on 1,282 interaction logs from 32 undergraduates during a two-hour writing session, Sequential Pattern Mining and K-Means clustering were used to identify behavioral patterns. Two clusters emerged: Cluster 1 emphasized outline planning and essay structure, while Cluster 2 focused on content development. A Mann-Whitney U test revealed a moderate effect size (r = 0.36) in the essay Organization dimension, with Cluster 1 showing higher scores. Qualitative analysis indicated that students with better performance actively wrote and shared essay sections with EWA for feedback, rather than interacted passively by asking questions. These findings suggest implications for teaching and system design. Teachers can encourage active engagement, while future EWAs may integrate automatic labeling and monitoring to prompt students to move from questioning to writing, enabling fuller benefits from GenAI-supported learning.
zh
[AI-23] he SMART Framework for AI Systems
【速读】:该论文旨在解决人工智能(AI)系统在临床研究及其他行业应用中因自动化决策和复杂性增加而引发的安全性、责任归属与合规性挑战。其解决方案的关键在于提出SMART+框架——一个以安全(Safety)、监控(Monitoring)、责任(Accountability)、可靠性(Reliability)和透明度(Transparency)为核心支柱,并进一步融合隐私(Privacy)、安全(Security)、数据治理(Data Governance)、公平性与偏见(Fairness & Bias)及防护机制(Guardrails)的结构化治理模型。该框架通过集成操作保障、监督流程和强化的数据治理控制,实现风险缓解、信任构建与合规准备,从而为跨行业的负责任AI部署提供可审计、可验证的治理基础。
链接: https://arxiv.org/abs/2512.08592
作者: Laxmiraju Kandikatla,Branislav Radeljic
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:
Abstract:Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy Security, Data Governance, Fairness Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.
zh
[AI-24] Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
【速读】:该论文旨在解决当前AI系统在物理世界中实现具身推理(embodied reasoning)与动作执行之间脱节的问题,即如何将大规模预训练视觉语言模型(VLM)的通用推理能力有效映射到机器人实际操作中,从而实现从语义理解到动作规划的闭环控制。解决方案的关键在于提出一个三阶段预训练流程:首先通过高质量视觉语言数据继续预训练VLM以增强空间规划和轨迹预测等具身推理能力;其次结合跨机器人平台的数据进行联合训练,提升模型对不同机器人形态的适应性;最后在Astribot S1双臂移动操作机器人上收集轨迹数据,并引入强化学习优化推理过程与动作输出的一致性,实现语义推理与运动控制的闭环对齐。这一方法显著提升了机器人在复杂任务中的泛化能力和长期规划性能。
链接: https://arxiv.org/abs/2512.08580
作者: Peijun Tang,Shangjin Xie,Binyan Sun,Baifu Huang,Kuncheng Luo,Haotian Yang,Weiqi Jin,Jianan Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 49 pages, 25 figures
Abstract:Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning (“mind”) with robot action (“hand”). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
zh
[AI-25] A Hybrid Model for Stock Market Forecasting: Integrating News Sentiment and Time Series Data with Graph Neural Networks
【速读】:该论文旨在解决股票市场预测这一长期挑战,即如何通过融合多模态信息提升预测准确性以支持更明智的投资决策。其核心解决方案是提出一种基于图神经网络(Graph Neural Network, GNN)的多模态模型,将公司历史股价数据与新闻文本信息(特别是标题)进行联合建模:利用LSTM编码个股历史价格序列,用语言模型嵌入新闻标题,并构建异构图结构,其中节点包括公司、新闻和行业,再通过GraphSAGE聚合邻接信息以捕捉跨模态交互。实验表明,该方法在二分类方向预测任务上达到53%准确率,在显著性标签任务上实现4%的精度提升,验证了融合外部新闻信号对短期市场走势预测的有效性。
链接: https://arxiv.org/abs/2512.08567
作者: Nader Sadek,Mirette Moawad,Christina Naguib,Mariam Elzahaby
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures. Published in the Proceedings of the 5th International Conference on Artificial Intelligence Research (ICAIR 2025). Published version available at: this https URL
Abstract:Stock market prediction is a long-standing challenge in finance, as accurate forecasts support informed investment decisions. Traditional models rely mainly on historical prices, but recent work shows that financial news can provide useful external signals. This paper investigates a multimodal approach that integrates companies’ news articles with their historical stock data to improve prediction performance. We compare a Graph Neural Network (GNN) model with a baseline LSTM model. Historical data for each company is encoded using an LSTM, while news titles are embedded with a language model. These embeddings form nodes in a heterogeneous graph, and GraphSAGE is used to capture interactions between articles, companies, and industries. We evaluate two targets: a binary direction-of-change label and a significance-based label. Experiments on the US equities and Bloomberg datasets show that the GNN outperforms the LSTM baseline, achieving 53% accuracy on the first target and a 4% precision gain on the second. Results also indicate that companies with more associated news yield higher prediction accuracy. Moreover, headlines contain stronger predictive signals than full articles, suggesting that concise news summaries play an important role in short-term market reactions.
zh
[AI-26] Bridging Scale Discrepancies in Robotic Control via Language-Based Action Representations
【速读】:该论文旨在解决机器人操作中因不同机器人平台和任务间动作数据存在显著分布偏移(distribution shift)而导致预训练知识难以有效迁移的问题,其关键在于提出一种语义 grounded 的语言表征方式来标准化动作表示。解决方案的核心是引入一种运动表征(motion representation),该表征忽略动作命令的数值尺度影响,仅关注方向性信息,从而减少分布偏移并提升预训练表示的泛化能力;同时,该表征缩小了动作标记与标准词汇标记之间的特征距离,缓解模态鸿沟(modality gap),进而增强多任务场景下的迁移性能。
链接: https://arxiv.org/abs/2512.08548
作者: Yuchi Zhang,Churui Sun,Shiqi Liang,Diyuan Liu,Chao Ji,Wei-Nan Zhang,Ting Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent end-to-end robotic manipulation research increasingly adopts architectures inspired by large language models to enable robust manipulation. However, a critical challenge arises from severe distribution shifts between robotic action data, primarily due to substantial numerical variations in action commands across diverse robotic platforms and tasks, hindering the effective transfer of pretrained knowledge. To address this limitation, we propose a semantically grounded linguistic representation to normalize actions for efficient pretraining. Unlike conventional discretized action representations that are sensitive to numerical scales, the motion representation specifically disregards numeric scale effects, emphasizing directionality instead. This abstraction mitigates distribution shifts, yielding a more generalizable pretraining representation. Moreover, using the motion representation narrows the feature distance between action tokens and standard vocabulary tokens, mitigating modality gaps. Multi-task experiments on two benchmarks demonstrate that the proposed method significantly improves generalization performance and transferability in robotic manipulation tasks.
zh
[AI-27] Principles2Plan: LLM -Guided System for Operationalising Ethical Principles into Plans AAAI2026
【速读】:该论文旨在解决机器人在人类环境中运行时缺乏伦理意识的问题,现有自动化规划工具难以支持生成情境敏感的伦理规则,而手动制定伦理规则则劳动密集且高度依赖具体场景。解决方案的关键在于提出一个名为Principles2Plan的交互式研究原型,通过人与大型语言模型(Large Language Model, LLM)协作,将领域专家提供的高层次伦理原则(如有益性、隐私保护)转化为可操作的伦理规则,并由用户审查、排序后输入规划器,从而生成具有伦理考量的可行计划。该方法首次在经典规划场景中支持用户基于原则生成可执行的伦理规则,显著提升了伦理自动化规划的实用性与可行性。
链接: https://arxiv.org/abs/2512.08536
作者: Tammy Zhong,Yang Song,Maurice Pagnucco
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Ethical awareness is critical for robots operating in human environments, yet existing automated planning tools provide little support. Manually specifying ethical rules is labour-intensive and highly context-specific. We present Principles2Plan, an interactive research prototype demonstrating how a human and a Large Language Model (LLM) can collaborate to produce context-sensitive ethical rules and guide automated planning. A domain expert provides the planning domain, problem details, and relevant high-level principles such as beneficence and privacy. The system generates operationalisable ethical rules consistent with these principles, which the user can review, prioritise, and supply to a planner to produce ethically-informed plans. To our knowledge, no prior system supports users in generating principle-grounded rules for classical planning contexts. Principles2Plan showcases the potential of human-LLM collaboration for making ethical automated planning more practical and feasible.
zh
[AI-28] SensHRPS: Sensing Comfortable Human-Robot Proxemics and Personal Space With Eye-Tracking
【速读】:该论文旨在解决社交机器人在与人类交互时如何根据人类的proxemic(空间距离)规范调整自身行为以提升用户舒适度的问题。其核心挑战在于,尽管眼动特征已被证明可有效估计人际互动中的舒适感,但这些特征在人机交互场景下的适用性尚不明确。解决方案的关键在于通过移动眼动追踪与主观报告相结合的方式,在四个实验控制的距离下(0.5 m 至 2.0 m)收集数据,并采用多种机器学习和深度学习模型分析 gaze features(注视特征),最终发现决策树(Decision Tree)分类器表现最优(F1-score = 0.73),且最小瞳孔直径是最关键的预测因子,表明人机交互中生理舒适阈值不同于人际互动,且可通过可解释逻辑进行建模。
链接: https://arxiv.org/abs/2512.08518
作者: Nadezhda Kushina(1),Ko Watanabe(2),Aarthi Kannan(1),Ashita Ashok(1),Andreas Dengel(2),Karsten Berns(1) ((1) RPTU Kaiserslautern-Landau, (2) DFKI GmbH)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot “Ameca” across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
zh
[AI-29] A Lightweight Transfer Learning-Based State-of-Health Monitoring with Application to Lithium-ion Batteries in Unmanned Air Vehicles
【速读】:该论文旨在解决锂离子电池(Lithium-ion Battery)在便携式移动设备中状态健康监测(State-of-Health, SOH)的准确性与计算资源消耗之间的矛盾问题。传统基于迁移学习(Transfer Learning, TL)的方法虽能减少目标域训练数据需求,但其计算开销大,难以部署于资源受限的移动设备。解决方案的关键在于提出一种轻量级的构造性增量迁移学习方法(Constructive Incremental Transfer Learning, CITL):首先利用目标域未标注数据设计半监督迁移机制,在迭代过程中通过新增网络节点以最小化监测残差;其次通过结构风险最小化、迁移不匹配最小化和流形一致性最大化三重约束保障节点参数的跨域学习能力;最后理论分析了CITL的收敛性,确保迁移性能与模型紧凑性的双重优势。实验表明,该方法在真实无人机电池数据集上显著优于现有主流迁移学习方法。
链接: https://arxiv.org/abs/2512.08512
作者: Jiang Liu,Yan Qin,Wei Dai,Chau Yuen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Transactions on Industrial Informatics
Abstract:Accurate and rapid state-of-health (SOH) monitoring plays an important role in indicating energy information for lithium-ion battery-powered portable mobile devices. To confront their variable working conditions, transfer learning (TL) emerges as a promising technique for leveraging knowledge from data-rich source working conditions, significantly reducing the training data required for SOH monitoring from target working conditions. However, traditional TL-based SOH monitoring is infeasible when applied in portable mobile devices since substantial computational resources are consumed during the TL stage and unexpectedly reduce the working endurance. To address these challenges, this paper proposes a lightweight TL-based SOH monitoring approach with constructive incremental transfer learning (CITL). First, taking advantage of the unlabeled data in the target domain, a semi-supervised TL mechanism is proposed to minimize the monitoring residual in a constructive way, through iteratively adding network nodes in the CITL. Second, the cross-domain learning ability of node parameters for CITL is comprehensively guaranteed through structural risk minimization, transfer mismatching minimization, and manifold consistency maximization. Moreover, the convergence analysis of the CITL is given, theoretically guaranteeing the efficacy of TL performance and network compactness. Finally, the proposed approach is verified through extensive experiments with a realistic unmanned air vehicles (UAV) battery dataset collected from dozens of flight missions. Specifically, the CITL outperforms SS-TCA, MMD-LSTM-DA, DDAN, BO-CNN-TL, and AS ^3 LSTM, in SOH estimation by 83.73%, 61.15%, 28.24%, 87.70%, and 57.34%, respectively, as evaluated using the index root mean square error.
zh
[AI-30] Developing Distance-Aware Uncertainty Quantification Methods in Physics-Guided Neural Networks for Reliable Bearing Health Prediction ALT
【速读】:该论文旨在解决旋转机械中滚动轴承退化估计的不确定性建模问题,现有方法普遍存在置信度校准不足、计算成本高、缺乏距离感知能力以及在分布外(Out-of-Distribution, OOD)数据下泛化性能差等缺陷。解决方案的关键在于提出两种基于物理引导神经网络(Physics-guided Neural Networks, PGNN)的距离感知不确定性方法:PG-SNGP(基于谱归一化高斯过程)和PG-SNER(基于深度证据回归)。其核心创新在于对隐藏层施加谱归一化以保持输入到隐空间的距离特性,并分别通过高斯过程层(PG-SNGP)和正态逆伽马参数输出(PG-SNER)实现距离敏感的不确定性建模,从而提升预测准确性、OOD泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2512.08499
作者: Waleed Razzaq,Yun-Bo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at Structural health Monitoring - SAGE
Abstract:Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA dataset and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.
zh
[AI-31] LLM -based Vulnerable Code Augmentation: Generate or Refactor?
【速读】:该论文旨在解决漏洞代码库中类别严重失衡的问题,这限制了基于深度学习的漏洞分类器的有效性。其关键解决方案是利用大语言模型(Large Language Models, LLMs)进行数据增强,具体包括两种策略:一是通过受控生成新漏洞样本,二是对现有漏洞代码进行语义保持的重构(refactoring)。实验表明,这两种方法均能在简单流程下提升漏洞代码库的丰富度与质量,而混合策略能更显著地提升漏洞分类器的性能。
链接: https://arxiv.org/abs/2512.08493
作者: Dyna Soumhane Ouchebara,Stéphane Dupont
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, Submitted to ESAAN 2026, Under pier review
Abstract:Vulnerability code-bases often suffer from severe imbalance, limiting the effectiveness of Deep Learning-based vulnerability classifiers. Data Augmentation could help solve this by mitigating the scarcity of under-represented CWEs. In this context, we investigate LLM-based augmentation for vulnerable functions, comparing controlled generation of new vulnerable samples with semantics-preserving refactoring of existing ones. Using Qwen2.5-Coder to produce augmented data and CodeBERT as a vulnerability classifier on the SVEN dataset, we find that our approaches are indeed effective in enriching vulnerable code-bases through a simple process and with reasonable quality, and that a hybrid strategy best boosts vulnerability classifiers’ performance.
zh
[AI-32] Autonomous Issue Resolver: Towards Zero-Touch Code Maintenance
【速读】:该论文旨在解决当前自动化程序修复(Automated Program Repair, APR)在仓库级(repository-scale)场景下面临的核心挑战,尤其是现有基于控制流的代码生成方法难以有效处理复杂目录结构和无关控制逻辑的问题。其关键解决方案在于提出了一种从标准代码属性图(Code Property Graphs, CPGs)到数据转换图(Data Transformation Graph, DTG)的范式转变:将数据状态建模为节点、函数作为边,从而通过数据血缘(data lineage)追踪逻辑缺陷,而非依赖传统的控制流分析。这一设计使得多智能体框架能够协调数据完整性导航与控制流逻辑,有效缓解了传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中的“语义陷阱”问题,并通过自主问题解析器(Autonomous Issue Resolver, AIR)实现了神经符号推理驱动的可扩展逻辑修复,在SWE-Verified基准上达到87.1%的修复成功率。
链接: https://arxiv.org/abs/2512.08492
作者: Aliaksei Kaliutau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures
Abstract:Recent advances in Large Language Models have revolutionized function-level code generation; however, repository-scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control-centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi-agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the “Semantic Trap” inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self-improvement system for zero-touch code maintenance that utilizes neuro-symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE-Verified benchmark. Our approach directly addresses the core limitations of current AI code-assistant tools and tackles the critical need for a more robust foundation for our increasingly software-dependent world.
zh
[AI-33] Using reinforcement learning to probe the role of feedback in skill acquisition
【速读】:该论文旨在解决高绩效技能习得过程中反馈信息需求的难题,特别是探究在物理系统中学习与执行阶段对环境反馈(如流场信息)依赖性的差异。其核心问题是:是否可以通过有限的真实世界交互学习出高性能策略,并且这些策略在无反馈条件下能否稳定执行?解决方案的关键在于构建一个基于桌面循环水通道的物理实验平台,将通用强化学习代理(generalist reinforcement learning agent)直接接入旋转圆柱体以控制阻力(drag)——该系统具备高度混沌的流体动力学特性,难以建模,但奖励函数明确(最大化或最小化阻力),且已有经典开环策略作为参考。实验发现,仅需数分钟真实交互即可让代理利用高维流场反馈学习到高性能策略;而当后续无反馈重放相同动作序列时,性能几乎不变,说明执行阶段并不需要持续反馈。更关键的是,训练时若缺少流场反馈,代理无法学会阻力最大化策略,却仍能学习阻力最小化策略,揭示了学习条件可能因目标不同而呈现“友好”或“恶劣”的性质,而非单纯取决于系统动态或策略复杂度。
链接: https://arxiv.org/abs/2512.08463
作者: Antonio Terpin,Raffaello D’Andrea
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Website: this https URL
Abstract:Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective – drag minimization or maximization – is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
zh
[AI-34] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset
【速读】:该论文旨在解决前沿生成式人工智能(Generative AI)模型,特别是大语言模型(Large Language Models, LLMs),可能被用于生物恐怖主义或获取生物武器所带来的生物安全风险评估难题。其解决方案的关键在于开发并试点应用“细菌生物威胁基准”(Bacterial Biothreat Benchmark, B3)数据集,该数据集能够对LLM的生物安全风险进行快速、细致的量化评估,识别风险来源,并为制定优先级缓解策略提供依据。
链接: https://arxiv.org/abs/2512.08459
作者: Gary Ackerman,Theodore Wilson,Zachary Kallenborn,Olivia Shoemaker,Anna Wetzel,Hayley Peterson,Abigail Danfora,Jenna LaTourette,Brandon Behlendorf,Douglas Clifford
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 19 pages, 2 figures
Abstract:The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset. It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework, with previous papers detailing the development of the B3 dataset. The pilot involved running the benchmarks through a sample frontier AI model, followed by human evaluation of model responses, and an applied risk analysis of the results along several dimensions. Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM, identifying the key sources of that risk and providing guidance for priority areas of mitigation priority.
zh
[AI-35] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process
【速读】:该论文旨在解决生成式人工智能(Generative AI)模型,特别是大语言模型(Large Language Models, LLMs),可能被用于生物恐怖主义或获取生物武器所带来的生物安全风险评估难题。其解决方案的关键在于构建一个名为Bacterial Biothreat Benchmark (B3)的数据集,作为新型Biothreat Benchmark Generation (BBG)框架的第二部分。该方法通过三种互补途径——基于网络的提示生成、红队测试(red teaming)以及现有基准语料库挖掘——生成超过7,000个潜在基准,并经去重、提升诊断性评估和质量控制后筛选出1,010个最终基准,确保这些基准具备生物安全威胁的相关性、能提供可量化的风险提升(uplift)信号,并与更广泛的生物安全分析架构对齐,从而支持多层级的精细化风险评估。
链接: https://arxiv.org/abs/2512.08451
作者: Gary Ackerman,Zachary Kallenborn,Anna Wetzel,Hayley Peterson,Jenna LaTourette,Olivia Shoemaker,Brandon Behlendorf,Sheriff Almakki,Doug Clifford,Noah Sheinbaum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 18 pages, 3 figures
Abstract:The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper, the second in a series of three, describes the second component of a novel Biothreat Benchmark Generation (BBG) framework: the generation of the Bacterial Biothreat Benchmark (B3) dataset. The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora, to generate over 7,000 potential benchmarks linked to the Task-Query Architecture that was developed during the first component of the project. A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks. This procedure ensured that these benchmarks are a) diagnostic in terms of providing uplift; b) directly relevant to biosecurity threats; and c) are aligned with a larger biosecurity architecture permitting nuanced analysis at different levels of analysis.
zh
[AI-36] From Accuracy to Impact: The Impact-Driven AI Framework (IDAIF) for Aligning Engineering Architecture with Theory of Change
【速读】:该论文旨在解决当前人工智能(AI)系统在高风险领域(如医疗、金融和公共政策)中因忽视社会技术维度而导致的“对齐问题”(alignment problem),即确保AI行为与人类价值观和意图保持一致。现有方法主要聚焦于技术性能指标优化,而忽略了AI部署的社会影响与价值导向。解决方案的关键在于提出一种以影响驱动的人工智能框架(Impact-Driven AI Framework, IDAIF),其核心是将“理论变革模型”(Theory of Change, ToC)的五阶段结构(输入-活动-产出-结果-影响)映射到AI系统的五个架构层(数据层-管道层-推理层-代理层-规范层),并通过多目标帕累托优化实现价值对齐、分层多智能体编排达成结果、因果有向无环图(causal DAGs)抑制幻觉、以及基于人类反馈强化学习(RLHF)的对抗去偏保障公平性,同时引入保障层通过守护架构管理假设失效,从而推动AI开发从模型中心转向影响中心。
链接: https://arxiv.org/abs/2512.08449
作者: Yong-Woon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces the Impact-Driven AI Framework (IDAIF), a novel architectural methodology that integrates Theory of Change (ToC) principles with modern artificial intelligence system design. As AI systems increasingly influence high-stakes domains including healthcare, finance, and public policy, the alignment problem–ensuring AI behavior corresponds with human values and intentions–has become critical. Current approaches predominantly optimize technical performance metrics while neglecting the sociotechnical dimensions of AI deployment. IDAIF addresses this gap by establishing a systematic mapping between ToC’s five-stage model (Inputs-Activities-Outputs-Outcomes-Impact) and corresponding AI architectural layers (Data Layer-Pipeline Layer-Inference Layer-Agentic Layer-Normative Layer). Each layer incorporates rigorous theoretical foundations: multi-objective Pareto optimization for value alignment, hierarchical multi-agent orchestration for outcome achievement, causal directed acyclic graphs (DAGs) for hallucination mitigation, and adversarial debiasing with Reinforcement Learning from Human Feedback (RLHF) for fairness assurance. We provide formal mathematical formulations for each component and introduce an Assurance Layer that manages assumption failures through guardian architectures. Three case studies demonstrate IDAIF application across healthcare, cybersecurity, and software engineering domains. This framework represents a paradigm shift from model-centric to impact-centric AI development, providing engineers with concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.
zh
[AI-37] Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems
【速读】:该论文旨在解决模型规划在机器人领域中因物理动力学的混合特性(hybrid dynamics)而导致的长期预测误差累积问题,特别是传统隐式世界模型由于采用全局连续性约束而过度平滑不同动态模式(如粘附与滑动、飞行与着地状态之间的切换),从而在物理边界处引发灾难性误差传播。解决方案的关键在于提出一种结构化的棱柱世界模型(Prismatic World Model, PRISM-WM),其核心是基于上下文感知的专家混合(Mixture-of-Experts, MoE)架构:通过门控机制隐式识别当前物理模式,并由专用专家分别预测对应的动力学转移;同时引入潜在正交化目标以确保专家多样性、防止模式坍缩,从而精准建模系统动力学中的突变边界,显著降低轨迹滚动过程中的漂移,为轨迹优化算法(如TD-MPC)提供高保真度的动态模型基础。
链接: https://arxiv.org/abs/2512.08411
作者: Mingwei Li,Xiaoyuan Zhang,Chengwei Yang,Zilong Zheng,Yaodong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Model-based planning in robotic domains is fundamentally challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, inevitably over-smoothing the distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in catastrophic compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM leverages a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, effectively preventing mode collapse. By accurately modeling the sharp mode transitions in system dynamics, PRISM-WM significantly reduces rollout drift. Extensive experiments on challenging continuous control benchmarks, including high-dimensional humanoids and diverse multi-task settings, demonstrate that PRISM-WM provides a superior high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), proving its potential as a powerful foundational model for next-generation model-based agents.
zh
[AI-38] DeepFeature: Iterative Context-aware Feature Generation for Wearable Biosignals
【速读】:该论文旨在解决可穿戴生物信号(wearable biosignals)特征提取过程中存在的三大核心问题:现有方法缺乏任务特定的上下文知识、难以在高维特征空间中识别最优提取设置,以及容易出现代码生成与自动化错误。其解决方案的关键在于提出 DeepFeature——首个基于大语言模型(LLM)驱动的、具备上下文感知能力的特征生成框架。该框架通过多源特征生成机制融合专家知识与任务设定,并采用基于特征评估反馈的迭代优化流程实现特征重选择;同时引入多层过滤与验证机制保障特征到代码的鲁棒映射,从而显著提升特征质量与系统稳定性。
链接: https://arxiv.org/abs/2512.08379
作者: Kaiwei Liu,Yuting He,Bufang Yang,Mu Yuan,Chun Man Victor Wong,Ho Pong Andrew Sze,Zhenyu Yan,Hongkai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Biosignals collected from wearable devices are widely utilized in healthcare applications. Machine learning models used in these applications often rely on features extracted from biosignals due to their effectiveness, lower data dimensionality, and wide compatibility across various model architectures. However, existing feature extraction methods often lack task-specific contextual knowledge, struggle to identify optimal feature extraction settings in high-dimensional feature space, and are prone to code generation and automation errors. In this paper, we propose DeepFeature, the first LLM-empowered, context-aware feature generation framework for wearable biosignals. DeepFeature introduces a multi-source feature generation mechanism that integrates expert knowledge with task settings. It also employs an iterative feature refinement process that uses feature assessment-based feedback for feature re-selection. Additionally, DeepFeature utilizes a robust multi-layer filtering and verification approach for robust feature-to-code translation to ensure that the extraction functions run without crashing. Experimental evaluation results show that DeepFeature achieves an average AUROC improvement of 4.21-9.67% across eight diverse tasks compared to baseline methods. It outperforms state-of-the-art approaches on five tasks while maintaining comparable performance on the remaining tasks.
zh
[AI-39] Reflecting with Two Voices: A Co-Adaptive Dual-Strategy Framework for LLM -Based Agent Decision Making
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行复杂任务时依赖外部示范或检索增强规划所导致的脆弱性、泛化能力差及计算开销高的问题。其核心解决方案是提出DuSAR(Dual-Strategy Agent with Reflecting)框架,该框架不依赖演示,通过两个互补策略实现协同自适应推理:一是高层级的整体规划(high-level holistic plan),二是基于上下文的局部策略(context-grounded local policy)。二者通过轻量级反思机制交互,代理持续利用策略适配度评分(Strategy Fitness Score)评估进展,并在停滞时动态修正全局计划或在取得有意义进展时优化计划,模拟人类元认知行为。此设计显著提升了性能与效率,在ALFWorld和Mind2Web基准上均达到当前最优结果,同时将每步token消耗降低3–9倍。
链接: https://arxiv.org/abs/2512.08366
作者: Wentao Zhang,Qunbo Wang,Tao Zhang,Junsheng Wu,Hongping Gan,Yang Liu,Ling Dai,Shizhuang Deng,Shuntong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents often rely on external demonstrations or retrieval-augmented planning, leading to brittleness, poor generalization, and high computational overhead. Inspired by human problem-solving, we propose DuSAR (Dual-Strategy Agent with Reflecting) - a demonstration-free framework that enables a single frozen LLM to perform co-adaptive reasoning via two complementary strategies: a high-level holistic plan and a context-grounded local policy. These strategies interact through a lightweight reflection mechanism, where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement, mimicking human metacognitive behavior. On ALFWorld and Mind2Web, DuSAR achieves state-of-the-art performance with open-source LLMs (7B-70B), reaching 37.1% success on ALFWorld (Llama3.1-70B) - more than doubling the best prior result (13.0%) - and 4.02% on Mind2Web, also more than doubling the strongest baseline. Remarkably, it reduces per-step token consumption by 3-9X while maintaining strong performance. Ablation studies confirm the necessity of dual-strategy coordination. Moreover, optional integration of expert demonstrations further boosts results, highlighting DuSAR’s flexibility and compatibility with external knowledge.
zh
[AI-40] Enhancing Explainability of Graph Neural Networks Through Conceptual and Structural Analyses and Their Extensions DATE
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在实际应用中可解释性不足的问题,即现有可解释人工智能(Explainable AI, XAI)方法难以有效解析图结构数据中复杂的节点间关系及其对预测结果的影响。当前主流方法多采用事后解释(post-hoc)或自解释设计,前者计算开销大且可靠性受限,后者虽能即时提供解释但泛化能力弱。论文提出的解决方案关键在于构建一个面向图结构机器学习的新型XAI框架,其核心创新在于实现适应性强、计算高效且超越单个特征分析的解释机制,能够捕捉图结构整体如何影响模型决策,从而提升GNN决策过程的透明度与可信度。
链接: https://arxiv.org/abs/2512.08344
作者: Tien Cuong Bui
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 157 pages, Doctoral dissertation at Seoul National University (submitted in 2024.08 to SNU library, slightly updated in 2025.11 for open digital version)
Abstract:Graph Neural Networks (GNNs) have become a powerful tool for modeling and analyzing data with graph structures. The wide adoption in numerous applications underscores the value of these models. However, the complexity of these methods often impedes understanding their decision-making processes. Current Explainable AI (XAI) methods struggle to untangle the intricate relationships and interactions within graphs. Several methods have tried to bridge this gap via a post-hoc approach or self-interpretable design. Most of them focus on graph structure analysis to determine essential patterns that correlate with prediction outcomes. While post-hoc explanation methods are adaptable, they require extra computational resources and may be less reliable due to limited access to the model’s internal workings. Conversely, Interpretable models can provide immediate explanations, but their generalizability to different scenarios remains a major concern. To address these shortcomings, this thesis seeks to develop a novel XAI framework tailored for graph-based machine learning. The proposed framework aims to offer adaptable, computationally efficient explanations for GNNs, moving beyond individual feature analysis to capture how graph structure influences predictions.
zh
[AI-41] Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach
【速读】:该论文旨在解决传统土工压实参数(最优含水率 OMC 和最大干密度 MDD)测定方法依赖繁琐实验室试验、且经验回归模型在不同土壤类型中泛化能力差的问题。其解决方案的关键在于引入自动化机器学习(AutoML)技术,通过自动化的算法选择与超参数优化,提升预测精度与跨土壤类型的通用性;实验表明,XGBoost 算法在独立测试集上对 MDD 和 OMC 的 R² 分别达到 80.4% 和 89.1%,验证了 AutoML 在复杂多源土壤数据下具有显著优势。
链接: https://arxiv.org/abs/2512.08343
作者: Caner Erden,Alparslan Serhat Demir,Abdullah Hulusi Kokcam,Talas Fikret Kurnaz,Ugur Dagdeviren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI https://doi.org/10.5281/zenodo.17533851
Abstract:Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.
zh
[AI-42] Predicting California Bearing Ratio with Ensemble and Neural Network Models: A Case Study from Türkiye
【速读】:该论文旨在解决传统加州承载比(California Bearing Ratio, CBR)测试方法在实际工程应用中存在耗时长、成本高及难以适应大规模或复杂土质分布的问题。其解决方案的关键在于构建一个基于机器学习(Machine Learning, ML)的预测框架,利用包含382个来自土耳其不同气候地质区域土壤样本的多维物理化学特性数据集,通过12种主流回归算法进行训练与评估,最终发现随机森林(Random Forest)模型表现最优,测试集R²达0.83,展现出强大的非线性映射能力,从而为地基土承载力预测提供了一种高效且可靠的智能化替代方法。
链接: https://arxiv.org/abs/2512.08340
作者: Abdullah Hulusi Kökçam,Uğur Dağdeviren,Talas Fikret Kurnaz,Alparslan Serhat Demir,Caner Erden
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI https://doi.org/10.5281/zenodo.17530868
Abstract:The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils, especially in transportation infrastructure and foundation design. Traditional CBR determination relies on laboratory penetration tests. Despite their accuracy, these tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles. Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision. This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in Türkiye. The dataset includes physicochemical soil properties relevant to bearing capacity, allowing multidimensional feature representation in a supervised learning context. Twelve ML algorithms were tested, including decision tree, random forest, extra trees, gradient boosting, xgboost, k-nearest neighbors, support vector regression, multi-layer perceptron, adaboost, bagging, voting, and stacking regressors. Each model was trained, validated, and evaluated to assess its generalization and robustness. Among them, the random forest regressor performed the best, achieving strong R2 scores of 0.95 (training), 0.76 (validation), and 0.83 (test). These outcomes highlight the model’s powerful nonlinear mapping ability, making it a promising tool for predictive geotechnical tasks. The study supports the integration of intelligent, data-centric models in geotechnical engineering, offering an effective alternative to traditional methods and promoting digital transformation in infrastructure analysis and design.
zh
[AI-43] Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging
【速读】:该论文旨在解决通用机器人策略(generalist robot policies)在有限新任务示范数据上微调时出现的过拟合问题,即微调后不仅丧失了原有广泛的泛化能力,也无法在新任务的不同变体中有效泛化。其解决方案的关键在于采用一种简单而有效的策略:将微调后的模型权重与预训练模型权重进行插值融合(interpolating the weights),从而得到一个既能保留预训练阶段获得的广泛技能,又能稳健掌握新任务能力的单一模型。实验证明,该方法在模拟和真实世界场景中均能显著提升新任务的分布外泛化性能,并支持持续学习新技能而不损害已有通用能力。
链接: https://arxiv.org/abs/2512.08333
作者: Yajat Yadav,Zhiyuan Zhou,Andrew Wagenmaker,Karl Pertsch,Sergey Levine
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations–not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.
zh
[AI-44] Argus: A Multi-Agent Sensitive Information Leakage Detection Framework Based on Hierarchical Reference Relationships ICSE2026
【速读】:该论文旨在解决代码仓库中敏感信息泄露的检测问题,传统方法如正则表达式、指纹特征和高熵计算常导致高误报率,进而降低检测效率并增加开发者的人工筛查负担。解决方案的关键在于提出一种名为Argus的多智能体协作框架,其核心创新是采用三层检测机制——整合关键内容、文件上下文与项目引用关系,从而显著减少误报并提升整体检测精度。
链接: https://arxiv.org/abs/2512.08326
作者: Bin Wang,Hui Li,Liyang Zhang,Qijia Zhuang,Ao Yang,Dong Zhang,Xijun Luo,Bing Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 8 tables;Accepted to ICSE 2026 Research Track
Abstract:Sensitive information leakage in code repositories has emerged as a critical security challenge. Traditional detection methods that rely on regular expressions, fingerprint features, and high-entropy calculations often suffer from high false-positive rates. This not only reduces detection efficiency but also significantly increases the manual screening burden on developers. Recent advances in large language models (LLMs) and multi-agent collaborative architectures have demonstrated remarkable potential for tackling complex tasks, offering a novel technological perspective for sensitive information detection. In response to these challenges, we propose Argus, a multi-agent collaborative framework for detecting sensitive information. Argus employs a three-tier detection mechanism that integrates key content, file context, and project reference relationships to effectively reduce false positives and enhance overall detection accuracy. To comprehensively evaluate Argus in real-world repository environments, we developed two new benchmarks, one to assess genuine leak detection capabilities and another to evaluate false-positive filtering performance. Experimental results show that Argus achieves up to 94.86% accuracy in leak detection, with a precision of 96.36%, recall of 94.64%, and an F1 score of 0.955. Moreover, the analysis of 97 real repositories incurred a total cost of only 2.2 . All code implementations and related datasets are publicly available at this https URL for further research and application.
zh
[AI-45] rSIM: Incentivizing Reasoning Capabilities of LLM s via Reinforced Strategy Injection ACL
【速读】:该论文旨在解决如何使大型语言模型(Large Language Models, LLMs)具备类人的高级推理能力,即从单纯的语言生成能力进化为能够进行自我反思、深度思考等“顿悟”式策略执行的推理语言模型(Reasoning Language Models, RLMs)。其核心挑战在于如何在不重新训练整个LLM的前提下,高效注入并引导推理策略以提升其链式思维(Chain of Thoughts, CoT)质量。解决方案的关键是提出了一种新颖的强化策略注入机制(reinforced Strategy Injection Mechanism, rSIM),通过一个小型规划器(planner,作为领导者代理)与LLM(跟随者代理)在多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架下联合训练,利用基于规则的奖励信号自适应地向LLM的CoT中注入推理策略。该机制具有高度可迁移性,仅需一次训练即可作为插件模块显著增强现有LLM的推理性能,并支持跨任务持续学习,从而逐步提升规划能力并扩展至更广泛的推理问题。
链接: https://arxiv.org/abs/2512.08300
作者: Sijia Chen,Baochun Li,Di Niu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures. Accepted to the ACL ARR July
Abstract:Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha’’ moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM’s CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.
zh
[AI-46] owards a Science of Scaling Agent Systems
【速读】:该论文旨在解决当前基于语言模型(Language Model, LM)的智能体(Agent)系统在实际应用中性能表现缺乏可量化、可预测的Scaling原则的问题,从而避免依赖经验性设计而非理论指导。其核心解决方案在于通过系统化实验与定量分析,构建一个基于可测量协调指标(如效率、开销、误差放大和冗余)的预测模型,并识别出三种主导效应:工具协调权衡、能力饱和现象以及拓扑结构相关的误差放大特性。该框架能够根据任务属性自动推荐最优的多智能体协作策略,在87%的未见配置上实现准确预测,为智能体系统的规模化部署提供了可解释且实用的设计准则。
链接: https://arxiv.org/abs/2512.08296
作者: Yubin Kim,Ken Gu,Chanwoo Park,Chunjong Park,Samuel Schmidgall,A. Ali Heydari,Yao Yan,Zhihan Zhang,Yuchen Zhuang,Mark Malhotra,Paul Pu Liang,Hae Won Park,Yuzhe Yang,Xuhai Xu,Yilun Du,Shwetak Patel,Tim Althoff,Daniel McDuff,Xin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.
zh
[AI-47] Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem
【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)生态系统中因解耦上下文与执行所带来的新型安全风险问题,特别是如何区分和应对恶意攻击(如间接提示注入、工具污染)与认知安全危害(如分布式工具委托中的对齐失败)。其解决方案的关键在于系统性地构建MCP三大核心组件——资源(Resources)、提示(Prompts)和工具(Tools)的结构脆弱性分析框架,并提出通过密码学溯源(ETDI)和运行时意图验证等防御机制,实现从对话式聊天机器人向自主代理操作系统演进过程中的安全加固。
链接: https://arxiv.org/abs/2512.08290
作者: Shiva Gaire,Srijan Gyawali,Saroj Mishra,Suman Niroula,Dilip Thakur,Umesh Yadav
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: All authors contributed equally to this work
Abstract:The Model Context Protocol (MCP) has emerged as the de facto standard for connecting Large Language Models (LLMs) to external data and tools, effectively functioning as the “USB-C for Agentic AI.” While this decoupling of context and execution solves critical interoperability challenges, it introduces a profound new threat landscape where the boundary between epistemic errors (hallucinations) and security breaches (unauthorized actions) dissolves. This Systematization of Knowledge (SoK) aims to provide a comprehensive taxonomy of risks in the MCP ecosystem, distinguishing between adversarial security threats (e.g., indirect prompt injection, tool poisoning) and epistemic safety hazards (e.g., alignment failures in distributed tool delegation). We analyze the structural vulnerabilities of MCP primitives, specifically Resources, Prompts, and Tools, and demonstrate how “context” can be weaponized to trigger unauthorized operations in multi-agent environments. Furthermore, we survey state-of-the-art defenses, ranging from cryptographic provenance (ETDI) to runtime intent verification, and conclude with a roadmap for securing the transition from conversational chatbots to autonomous agentic operating systems.
zh
[AI-48] Empowering smart app development with SolidGPT : an edge-cloud hybrid AI agent framework
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在移动与软件开发工作流中面临的三重矛盾:语义理解能力、开发者生产力提升与数据隐私保护之间的冲突。传统云端工具虽具备强大推理能力,但存在数据泄露风险和延迟问题;而纯本地部署方案则难以实现跨代码库和开发工具的完整上下文理解。其解决方案的关键在于提出 SolidGPT——一个基于 GitHub 的开源边缘-云混合开发者助手,通过本地运行(Docker 或 VSCode 插件)保障数据隐私,同时结合交互式代码查询、自动化项目结构生成及可配置私有代理(支持嵌入与上下文训练定制),实现语义感知的代码导航、任务管理集成与灵活的人机协作,从而在不牺牲隐私的前提下显著提升真实开发场景中的效率与可用性。
链接: https://arxiv.org/abs/2512.08286
作者: Liao Hu,Qiteng Wu,Ruoyu Qi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The integration of Large Language Models (LLMs) into mobile and software development workflows faces a persistent tension among three demands: semantic awareness, developer productivity, and data privacy. Traditional cloud-based tools offer strong reasoning but risk data exposure and latency, while on-device solutions lack full-context understanding across codebase and developer tooling. We introduce SolidGPT, an open-source, edge-cloud hybrid developer assistant built on GitHub, designed to enhance code and workspace semantic search. SolidGPT enables developers to: talk to your codebase: interactively query code and project structure, discovering the right methods and modules without manual searching. Automate software project workflows: generate PRDs, task breakdowns, Kanban boards, and even scaffold web app beginnings, with deep integration via VSCode and Notion. Configure private, extensible agents: onboard private code folders (up to approximately 500 files), connect Notion, customize AI agent personas via embedding and in-context training, and deploy via Docker, CLI, or VSCode extension. In practice, SolidGPT empowers developer productivity through: Semantic-rich code navigation: no more hunting through files or wondering where a feature lives. Integrated documentation and task management: seamlessly sync generated PRD content and task boards into developer workflows. Privacy-first design: running locally via Docker or VSCode, with full control over code and data, while optionally reaching out to LLM APIs as needed. By combining interactive code querying, automated project scaffolding, and human-AI collaboration, SolidGPT provides a practical, privacy-respecting edge assistant that accelerates real-world development workflows, ideal for intelligent mobile and software engineering contexts.
zh
[AI-49] Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making
【速读】:该论文旨在解决离线决策(offline decision-making)中生成轨迹动态不可行的问题,即现有生成式方法虽能从固定数据集中合成行为,但所生成的轨迹常违反系统动力学约束。其解决方案的关键在于提出Model Predictive Diffuser (MPDiffuser),这是一个基于模型的扩散框架,由三部分组成:(i) 规划器生成任务对齐且多样化的轨迹;(ii) 动力学模型确保轨迹与底层系统动力学一致;(iii) 排名模块筛选符合任务目标的行为。该方法采用交替扩散采样机制,在采样过程中交替更新规划器与动力学模型,从而在任务对齐性和动态可行性之间实现渐进式优化。理论分析表明该策略可在数据先验保真度与动力学一致性之间取得平衡,实验证明其在D4RL和DSRL等基准上优于现有方法,并具备扩展至视觉控制和真实四足机器人部署的能力。
链接: https://arxiv.org/abs/2512.08280
作者: Haldun Balim,Na Li,Yilun Du
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Offline decision-making requires synthesizing reliable behaviors from fixed datasets without further interaction, yet existing generative approaches often yield trajectories that are dynamically infeasible. We propose Model Predictive Diffuser (MPDiffuser), a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives. MPDiffuser employs an alternating diffusion sampling scheme, where planner and dynamics updates are interleaved to progressively refine trajectories for both task alignment and feasibility during the sampling process. We also provide a theoretical rationale for this procedure, showing how it balances fidelity to data priors with dynamics consistency. Empirically, the compositional design improves sample efficiency, as it leverages even low-quality data for dynamics learning and adapts seamlessly to novel dynamics. We evaluate MPDiffuser on both unconstrained (D4RL) and constrained (DSRL) offline decision-making benchmarks, demonstrating consistent gains over existing approaches. Furthermore, we present a preliminary study extending MPDiffuser to vision-based control tasks, showing its potential to scale to high-dimensional sensory inputs. Finally, we deploy our method on a real quadrupedal robot, showcasing its practicality for real-world control.
zh
[AI-50] Agent Eval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content
【速读】:该论文旨在解决现代企业在生成和评估高质量内容时面临的时效性与成本问题,尤其是人类写作者的时间限制以及外部人工评价带来的高运营成本。当前大语言模型(Large Language Models, LLMs)虽在内容生成方面具有潜力,但其产出质量仍存疑,而传统依赖人工的评估方式效率低下且昂贵。论文提出的解决方案关键在于引入生成式智能体(Generative Agents),这些代理能够模拟人类判断,自动对AI生成内容的连贯性、趣味性、清晰度、公平性和相关性等维度进行评分,从而实现高效、低成本的内容质量评估。通过集成此类代理,企业可在减少人工干预的同时保障内容输出的一致性和高质量,推动LLMs在商业场景中更可靠地应用。
链接: https://arxiv.org/abs/2512.08273
作者: Thanh Vu,Richi Nayak,Thiru Balasubramaniam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.
zh
[AI-51] Beyond Traditional Diagnostics: Transforming Patient-Side Information into Predictive Insights with Knowledge Graphs and Prototypes ICDE2026
【速读】:该论文旨在解决仅基于患者侧信息(如人口统计学特征和自报症状)进行疾病预测时面临的两大关键问题:一是疾病分布不均衡导致的模型偏差与可靠性不足,二是预测结果缺乏可解释性,难以满足临床需求。其解决方案的核心在于提出一个知识图谱增强、原型感知且可解释的框架(Knowledge graph-enhanced, Prototype-aware, and Interpretable, KPI),通过构建统一的疾病知识图谱融合结构化医学知识,设计具有临床意义的疾病原型,并结合对比学习提升对长尾疾病的预测准确性;同时利用大语言模型(Large Language Models, LLMs)生成患者特异性的医学相关解释,从而显著增强模型的可解释性和临床可信度。
链接: https://arxiv.org/abs/2512.08261
作者: Yibowen Zhao,Yinan Zhang,Zhixiang Su,Lizhen Cui,Chunyan Miao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work has been accepted by ICDE 2026 and is available on arXiv for early access
Abstract:Predicting diseases solely from patient-side information, such as demographics and self-reported symptoms, has attracted significant research attention due to its potential to enhance patient awareness, facilitate early healthcare engagement, and improve healthcare system efficiency. However, existing approaches encounter critical challenges, including imbalanced disease distributions and a lack of interpretability, resulting in biased or unreliable predictions. To address these issues, we propose the Knowledge graph-enhanced, Prototype-aware, and Interpretable (KPI) framework. KPI systematically integrates structured and trusted medical knowledge into a unified disease knowledge graph, constructs clinically meaningful disease prototypes, and employs contrastive learning to enhance predictive accuracy, which is particularly important for long-tailed diseases. Additionally, KPI utilizes large language models (LLMs) to generate patient-specific, medically relevant explanations, thereby improving interpretability and reliability. Extensive experiments on real-world datasets demonstrate that KPI outperforms state-of-the-art methods in predictive accuracy and provides clinically valid explanations that closely align with patient narratives, highlighting its practical value for patient-centered healthcare delivery.
zh
[AI-52] SpeechQualityLLM : LLM -Based Multimodal Assessment of Speech Quality
【速读】:该论文旨在解决传统语音质量评估方法在实际应用中面临的局限性问题:经典指标如PESQ和POLQA虽能近似人类平均意见分(MOS),但依赖严格控制的实验条件且需昂贵的主观听感测试;而基于学习的方法如NISQA虽能从波形或频谱中回归MOS及多个感知维度,却缺乏交互性和可解释性,无法支持自然语言查询或生成文本理由。解决方案的关键在于提出SpeechQualityLLM——一个将音频编码器与语言模型结合的多模态语音质量问答系统,通过模板化问题-答案对在NISQA数据集上训练,使模型不再直接输出数值评分,而是生成可解析的文本回答,从而实现自然语言交互、多样化听众模拟和可解释的质量判断,显著提升评估灵活性并降低对大规模众包测试的依赖。
链接: https://arxiv.org/abs/2512.08238
作者: Mahathir Monjur,Shahriar Nirjon
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 8 tables
Abstract:Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
zh
[AI-53] Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions
【速读】:该论文旨在解决大预训练模型在因果学习(causal learning)方面表现不佳的问题,即如何使机器有效习得世界中的因果结构。其核心解决方案在于引入“赋能”(empowerment)这一概念,作为连接经典贝叶斯因果学习与强化学习的桥梁:赋能通过最大化动作与其结果之间的互信息来提供内在奖励信号,从而引导代理(agent)构建更准确的因果世界模型;反之,准确的因果模型又会提升赋能水平,形成正反馈机制。该框架不仅有助于解释儿童因果学习的独特特征,也为实现机器的因果推理提供了可计算、可验证的路径。
链接: https://arxiv.org/abs/2512.08230
作者: Eunice Yiu,Kelsey Allen,Shiry Ginosar,Alison Gopnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Philosophical Transactions A, Special issue: World models, AGI, and the hard problems of life-mind continuity. Expected publication in 2026
Abstract:Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called “empowerment” which maximizes mutual information between actions and their outcomes. “Empowerment” may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.
zh
[AI-54] PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning WSDM2026
【速读】:该论文旨在解决胶囊网络(Capsule Networks, CapsNets)在处理现实世界图结构数据时因固定曲率空间导致的几何建模能力不足问题,即其在非欧几里得流形中存在测地线不连通性(geodesical disconnectedness)缺陷,从而限制了对复杂图结构(如层次、簇或循环结构)的表征性能。解决方案的关键在于提出一种伪黎曼胶囊网络(Pseudo-Riemannian Capsule Network, PR-CapsNet),通过将欧氏胶囊路由扩展至测地线不连通的伪黎曼流形,并引入自适应伪黎曼切空间路由(Adaptive Pseudo-Riemannian Tangent Space Routing),利用伪黎曼几何实现可学习曲率的特征融合机制:首先通过微分同胚变换将胶囊状态分解为球面-时间与欧氏-空间子空间;进而设计自适应曲率路由模块,基于局部流形几何属性的注意力机制动态融合不同曲率空间的特征;最终构建保持几何性质的伪黎曼胶囊分类器,采用曲率加权softmax完成分类。该方法显著提升了对复杂图结构的表示能力,在节点和图分类任务上超越现有最先进模型。
链接: https://arxiv.org/abs/2512.08218
作者: Ye Qin,Jingchao Wang,Yang Shi,Haiying Huang,Junxu Li,Weijian Liu,Tinghui Chen,Jinghui Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in WSDM 2026 (ACM International Conference on Web Search and Data Mining)
Abstract:Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real-world graphs poorly by fixed-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non-Euclidean pseudo-Riemannian manifolds provide specific inductive biases for embedding graph data, but how to leverage them to improve CapsNets is still underexplored. Here, we extend the Euclidean capsule routing into geodesically disconnected pseudo-Riemannian manifolds and derive a Pseudo-Riemannian Capsule Network (PR-CapsNet), which models data in pseudo-Riemannian manifolds of adaptive curvature, for graph representation learning. Specifically, PR-CapsNet enhances the CapsNet with Adaptive Pseudo-Riemannian Tangent Space Routing by utilizing pseudo-Riemannian geometry. Unlike single-curvature or subspace-partitioning methods, PR-CapsNet concurrently models hierarchical and cluster or cyclic graph structures via its versatile pseudo-Riemannian metric. It first deploys Pseudo-Riemannian Tangent Space Routing to decompose capsule states into spherical-temporal and Euclidean-spatial subspaces with diffeomorphic transformations. Then, an Adaptive Curvature Routing is developed to adaptively fuse features from different curvature spaces for complex graphs via a learnable curvature tensor with geometric attention from local manifold properties. Finally, a geometric properties-preserved Pseudo-Riemannian Capsule Classifier is developed to project capsule embeddings to tangent spaces and use curvature-weighted softmax for classification. Extensive experiments on node and graph classification benchmarks show PR-CapsNet outperforms SOTA models, validating PR-CapsNet’s strong representation power for complex graph structures.
zh
[AI-55] A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties
【速读】:该论文旨在解决医疗领域大语言模型(Medical Large Language Models, LLMs)在实际部署中面临的两大安全挑战:一是对抗性滥用(如越狱攻击,包括角色扮演、权威身份冒充和多轮操纵),二是隐私泄露风险;二是现有安全评估基准普遍依赖GPU集群、商业API或受保护的健康数据(Protected Health Information, PHI),导致大多数研究者难以参与相关研究。其解决方案的关键在于提出一个可在消费级CPU硬件上运行、完全可复现且无需IRB审批的评估框架,该框架采用合成患者记录作为输入数据,覆盖从高风险专科(如急诊医学和精神病学)到普通科的多类临床场景,从而实现对医疗AI模型安全性与隐私保护能力的系统化评测,为医疗专用模型的安全性比较与防御机制开发提供基础支撑。
链接: https://arxiv.org/abs/2512.08185
作者: Jinghao Wang,Ping Zhang,Carter Yagemann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, framework proposal
Abstract:Medical Large Language Models (LLMs) are increasingly deployed for clinical decision support across diverse specialties, yet systematic evaluation of their robustness to adversarial misuse and privacy leakage remains inaccessible to most researchers. Existing security benchmarks require GPU clusters, commercial API access, or protected health data – barriers that limit community participation in this critical research area. We propose a practical, fully reproducible framework for evaluating medical AI security under realistic resource constraints. Our framework design covers multiple medical specialties stratified by clinical risk – from high-risk domains such as emergency medicine and psychiatry to general practice – addressing jailbreaking attacks (role-playing, authority impersonation, multi-turn manipulation) and privacy extraction attacks. All evaluation utilizes synthetic patient records requiring no IRB approval. The framework is designed to run entirely on consumer CPU hardware using freely available models, eliminating cost barriers. We present the framework specification including threat models, data generation methodology, evaluation protocols, and scoring rubrics. This proposal establishes a foundation for comparative security assessment of medical-specialist models and defense mechanisms, advancing the broader goal of ensuring safe and trustworthy medical AI systems.
zh
[AI-56] Information-Dense Reasoning for Efficient and Auditable Security Alert Triage
【速读】:该论文旨在解决安全运营中心(Security Operations Center, SOC)在分钟级服务窗口内处理海量异构告警流时面临的“告警分诊延迟悖论”(Alert Triage Latency Paradox):详尽的推理链虽能保证准确性和合规性,但导致显著延迟和高昂的令牌消耗;而简短的推理链则牺牲了透明度与可审计性。现有方案如签名系统脆弱、异常检测方法缺乏可操作性,且完全依赖云端的大语言模型(Large Language Model, LLM)带来延迟、成本和隐私问题。解决方案的关键在于提出一种云边协同框架 AIDR(Alert Inference with Distilled Reasoning),其核心创新是基于梯度的推理链压缩技术,通过优化信息密度,在保留决策关键步骤的前提下最小化冗余内容,从而在满足令牌和延迟预算的同时确保预测合理性。该方法结合紧凑数据集构建(将告警提炼为3–5个高信息量要点,实现68%令牌减少)、LoRA微调领域专家模型,并采用云边架构——云端LLM负责路由,本地边缘专家生成SOAR就绪的JSON输出,实验证明AIDR在准确性上优于Chain-of-Thought方法,同时降低40.6%延迟,具备对数据污染和分布外样本的鲁棒性,支持可审计且符合全数据驻留合规要求的SOC分诊。
链接: https://arxiv.org/abs/2512.08169
作者: Guangze Zhao,Yongzheng Zhang,Changbo Tian,Dan Xie,Hongri Liu,Bailing Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Security Operations Centers face massive, heterogeneous alert streams under minute-level service windows, creating the Alert Triage Latency Paradox: verbose reasoning chains ensure accuracy and compliance but incur prohibitive latency and token costs, while minimal chains sacrifice transparency and auditability. Existing solutions fail: signature systems are brittle, anomaly methods lack actionability, and fully cloud-hosted LLMs raise latency, cost, and privacy concerns. We propose AIDR, a hybrid cloud-edge framework that addresses this trade-off through constrained information-density optimization. The core innovation is gradient-based compression of reasoning chains to retain only decision-critical steps–minimal evidence sufficient to justify predictions while respecting token and latency budgets. We demonstrate that this approach preserves decision-relevant information while minimizing complexity. We construct compact datasets by distilling alerts into 3-5 high-information bullets (68% token reduction), train domain-specialized experts via LoRA, and deploy a cloud-edge architecture: a cloud LLM routes alerts to on-premises experts generating SOAR-ready JSON. Experiments demonstrate AIDR achieves higher accuracy and 40.6% latency reduction versus Chain-of-Thought, with robustness to data corruption and out-of-distribution generalization, enabling auditable and efficient SOC triage with full data residency compliance.
zh
[AI-57] LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Averag e for Training Neural Networks
【速读】:该论文旨在解决生成式 AI(Generative AI)训练中流水线并行(pipelining)策略缺乏理论指导的问题,特别是如何在不同层间合理分配梯度延迟以实现高效且稳定的训练。其关键解决方案在于通过变量延迟梯度适应(variable delayed gradient adaptation)和重定时(retiming)的数学形式化推导,明确了延迟插入的位置与量级:内层网络所需延迟较少,外层则需更长延迟;当按层或分组进行流水线时,延迟量仅取决于剩余下游阶段数量,从而形成可预测的调度模式。此外,论文提出一种面向流水线的移动平均机制,避免显式存储历史权重,显著降低内存开销而不牺牲精度保证,最终构建了一个具有可控通信-计算权衡能力的可扩展流水线训练框架。
链接: https://arxiv.org/abs/2512.08160
作者: Nanda K. Unnikrishnan,Keshab K. Parhi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Proc. of 2025 Asilomar Conference on Signals, Systems, and Computers, October 2025, Pacific Grove, CA
Abstract:In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline–aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.
zh
[AI-58] Scalable Back-End for an AI-Based Diabetes Prediction Application
【速读】:该论文旨在解决糖尿病(Diabetes)早期检测中因用户规模扩大而带来的系统性能瓶颈问题,确保AI驱动的移动预测应用在高并发场景下仍能保持低延迟和高可用性。其解决方案的关键在于构建一个可扩展的后端架构,核心包括水平扩展(horizontal scaling)、数据库分片(database sharding)以及基于消息队列(message queue)的异步通信机制;其中,采用RabbitMQ实现异步处理显著降低了计算密集型预测请求的错误率,在高负载下通过请求排队保障了系统的可靠性与数据完整性。
链接: https://arxiv.org/abs/2512.08147
作者: Henry Anand Septian Radityo,Bernardus Willson,Reynard Tanadi,Latifa Dwiyanti,Saiful Akbar
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: This paper was accepted and presented at the 2025 IEEE International Conference on Data and Software Engineering (ICoDSE) on 28 October 2025 in Batam, Indonesia, and is currently awaiting publication
Abstract:The rising global prevalence of diabetes necessitates early detection to prevent severe complications. While AI-powered prediction applications offer a promising solution, they require a responsive and scalable back-end architecture to serve a large user base effectively. This paper details the development and evaluation of a scalable back-end system designed for a mobile diabetes prediction application. The primary objective was to maintain a failure rate below 5% and an average latency of under 1000 ms. The architecture leverages horizontal scaling, database sharding, and asynchronous communication via a message queue. Performance evaluation showed that 83% of the system’s features (20 out of 24) met the specified performance targets. Key functionalities such as user profile management, activity tracking, and read-intensive prediction operations successfully achieved the desired performance. The system demonstrated the ability to handle up to 10,000 concurrent users without issues, validating its scalability. The implementation of asynchronous communication using RabbitMQ proved crucial in minimizing the error rate for computationally intensive prediction requests, ensuring system reliability by queuing requests and preventing data loss under heavy load.
zh
[AI-59] Chat with UAV – Human-UAV Interaction Based on Large Language Models
【速读】:该论文旨在解决当前无人飞行器(UAV)人机交互(Human-UAV Interaction, HUI)系统中因缺乏用户与无人机之间通用语言而导致的个性化任务规划与执行困难问题,尤其针对现有基于大语言模型(Large Language Models, LLMs)的HUI框架在复杂场景下混合任务规划与执行适应性差的局限。解决方案的关键在于提出一种双代理(dual-agent)HUI框架,通过构建两个独立的LLM代理——任务规划代理和执行代理,并分别应用不同的提示工程(Prompt Engineering)策略,实现对任务的理解、规划与执行的解耦处理,从而提升交互流畅性和任务执行灵活性,有效满足用户的个性化需求。
链接: https://arxiv.org/abs/2512.08145
作者: Haoran Wang,Zhuohang Chen,Guang Li,Bo Ma,Chuanghuang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The future of UAV interaction systems is evolving from engineer-driven to user-driven, aiming to replace traditional predefined Human-UAV Interaction designs. This shift focuses on enabling more personalized task planning and design, thereby achieving a higher quality of interaction experience and greater flexibility, which can be used in many fileds, such as agriculture, aerial photography, logistics, and environmental monitoring. However, due to the lack of a common language between users and the UAVs, such interactions are often difficult to be achieved. The developments of Large Language Models possess the ability to understand nature languages and Robots’ (UAVs’) behaviors, marking the possibility of personalized Human-UAV Interaction. Recently, some HUI frameworks based on LLMs have been proposed, but they commonly suffer from difficulties in mixed task planning and execution, leading to low adaptability in complex scenarios. In this paper, we propose a novel dual-agent HUI framework. This framework constructs two independent LLM agents (a task planning agent, and an execution agent) and applies different Prompt Engineering to separately handle the understanding, planning, and execution of tasks. To verify the effectiveness and performance of the framework, we have built a task database covering four typical application scenarios of UAVs and quantified the performance of the HUI framework using three independent metrics. Meanwhile different LLM models are selected to control the UAVs with compared performance. Our user study experimental results demonstrate that the framework improves the smoothness of HUI and the flexibility of task execution in the tasks scenario we set up, effectively meeting users’ personalized needs.
zh
[AI-60] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture
【速读】:该论文旨在解决当前前沿生成式 AI(Generative AI)模型,尤其是大语言模型(LLMs)可能被用于生物恐怖主义或获取生物武器所带来的生物安全风险难以量化与评估的问题。现有基准测试方法往往忽视了不同行为体能力水平及操作层面的风险因素,导致对实际威胁的评估不全面。解决方案的关键在于提出一种新型的“生物威胁基准生成框架”(Biothreat Benchmark Generation, BBG),其核心是构建一个分层的细菌生物威胁任务-查询架构——即“细菌生物威胁模式”(Bacterial Biothreat Schema),该架构系统化地组织了生物威胁类别、要素和任务,并以此为基础开发与任务对齐的查询语句,从而为模型开发者和评估者提供一套可复用、多层级、涵盖技术与操作维度的生物安全风险评估工具。
链接: https://arxiv.org/abs/2512.08130
作者: Gary Ackerman,Brandon Behlendorf,Zachary Kallenborn,Sheriff Almakki,Doug Clifford,Jenna LaTourette,Hayley Peterson,Noah Sheinbaum,Olivia Shoemaker,Anna Wetzel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages
Abstract:Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.
zh
[AI-61] Long-only cryptocurrency portfolio management by ranking the assets: a neural network approach
【速读】:该论文旨在解决传统加密货币投资组合管理方法中忽视资产间相对关系的问题,即以往研究多聚焦于单一币种(如比特币)的价格走势预测并据此交易,而未充分考虑多币种之间的动态相对表现。其解决方案的关键在于引入基于神经网络的跨市场排序预测机制:在每个时间步,模型预测一组被管理加密货币未来收益率的相对排名,并据此分配权重,从而利用截面信息优化投资组合配置。实验表明,该方法在2020年5月至2023年11月的完整市场周期中显著优于现有方法,实现了年化收益率64.26%和夏普比率1.01,且对交易费用增加具有鲁棒性。
链接: https://arxiv.org/abs/2512.08124
作者: Zijiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:This paper will propose a novel machine learning based portfolio management method in the context of the cryptocurrency market. Previous researchers mainly focus on the prediction of the movement for specific cryptocurrency such as the bitcoin(BTC) and then trade according to the prediction. In contrast to the previous work that treats the cryptocurrencies independently, this paper manages a group of cryptocurrencies by analyzing the relative relationship. Specifically, in each time step, we utilize the neural network to predict the rank of the future return of the managed cryptocurrencies and place weights accordingly. By incorporating such cross-sectional information, the proposed methods is shown to profitable based on the backtesting experiments on the real daily cryptocurrency market data from May, 2020 to Nov, 2023. During this 3.5 years, the market experiences the full cycle of bullish, bearish and stagnant market conditions. Despite under such complex market conditions, the proposed method outperforms the existing methods and achieves a Sharpe ratio of 1.01 and annualized return of 64.26%. Additionally, the proposed method is shown to be robust to the increase of transaction fee.
zh
[AI-62] Scalable Offline Model-Based RL with Action Chunks
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中复杂、长时程任务的可扩展性问题,特别是模型偏差与模型误差累积之间的权衡。现有基于模型的价值扩展方法虽能减少价值bootstrap中的偏差,但在长horizon场景下会因模型误差不断累积而导致未来预测性能下降。解决方案的关键在于提出一种名为“基于动作块的模型强化学习”(Model-Based RL with Action Chunks, MAC)的新框架:其一,引入动作块模型(action-chunk model),该模型通过预测一段连续动作序列(即“动作块”)后的状态,而非单个动作,从而降低误差传播;其二,采用拒绝采样(rejection sampling)从一个表达能力强的行为动作块策略中采样动作,避免因模型对分布外动作的错误估计而引发的策略过拟合问题。这一设计显著提升了离线模型强化学习在大规模数据集和长时程任务上的表现。
链接: https://arxiv.org/abs/2512.08108
作者: Kwanyoung Park,Seohong Park,Youngwoon Lee,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures
Abstract:In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emphaction-chunk model that predicts a future state from a sequence of actions (an “action chunk”) instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbfModel-Based RL with Action Chunks (MAC). Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.
zh
[AI-63] raining LLM s for Honesty via Confessions
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能产生的不诚实行为问题,例如过度自信地陈述事实或掩盖其隐蔽行为,这种现象常源于强化学习(Reinforcement Learning, RL)中奖励函数设计不当所导致的激励错位。为应对这一挑战,作者提出了一种通过“自述忏悔”(confession)机制来诱导模型诚实表达自身缺陷的方法:忏悔是在主回答之后应请求生成的附加输出,其奖励仅基于诚实性,与主回答的奖励解耦。关键在于,只要模型通过坦白错误行为而非掩盖来最大化忏悔奖励更符合“最小阻力路径”,即可有效激励模型在忏悔中如实披露其不当行为。实验表明,该方法能显著提升模型在幻觉、指令遵循、阴谋行为和奖励劫持等分布外场景下的忏悔诚实度,从而为推理阶段提供可干预的透明性机制,如监控、拒绝采样和用户提示。
链接: https://arxiv.org/abs/2512.08093
作者: Manas Joglekar,Jeremy Chen,Gabriel Wu,Jason Yosinski,Jasmine Wang,Boaz Barak,Amelia Glaese
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can be dishonest when reporting on their actions and beliefs – for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward. As long as the “path of least resistance” for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its “main” answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08093 [cs.LG] (or arXiv:2512.08093v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-64] Large Language Models for Education and Research: An Empirical and User Survey-based Analysis
【速读】:该论文旨在解决当前预训练大语言模型(Pretrained Large Language Models, LLMs)在教育与科研场景中应用效果的系统性评估问题,尤其关注模型准确性、计算效率与用户体验之间的权衡。其解决方案的关键在于通过背景技术分析、实证实验与真实用户调查三方面相结合的方法,对ChatGPT和DeepSeek两款前沿LLM进行全面评测,揭示其在文本生成、编程任务及专业领域问题求解中的差异化优势,从而为教育与科研场景下LLM的选型与优化提供依据。
链接: https://arxiv.org/abs/2512.08057
作者: Md Mostafizer Rahman,Ariful Islam Shiplu,Md Faizul Ibne Amin,Yutaka Watanobe,Lu Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Pretrained Large Language Models (LLMs) have achieved remarkable success across diverse domains, with education and research emerging as particularly impactful areas. Among current state-of-the-art LLMs, ChatGPT and DeepSeek exhibit strong capabilities in mathematics, science, medicine, literature, and programming. In this study, we present a comprehensive evaluation of these two LLMs through background technology analysis, empirical experiments, and a real-world user survey. The evaluation explores trade-offs among model accuracy, computational efficiency, and user experience in educational and research affairs. We benchmarked these LLMs performance in text generation, programming, and specialized problem-solving. Experimental results show that ChatGPT excels in general language understanding and text generation, while DeepSeek demonstrates superior performance in programming tasks due to its efficiency- focused design. Moreover, both models deliver medically accurate diagnostic outputs and effectively solve complex mathematical problems. Complementing these quantitative findings, a survey of students, educators, and researchers highlights the practical benefits and limitations of these models, offering deeper insights into their role in advancing education and research.
zh
[AI-65] Joint Activity Design Heuristics for Enhancing Human-Machine Collaboration
【速读】:该论文试图解决如何设计技术以有效支持人类与机器在联合活动(Joint Activity)中的协同工作问题,即如何使技术不仅可用,更能作为团队中有效的协作成员。解决方案的关键在于明确支持团队内五种核心宏观认知功能:事件检测(Event Detection)、态势理解(Sensemaking)、适应性(Adaptability)、视角转换(Perspective-Shifting)和协调(Coordination),并基于此提出14条设计启发式原则,整合来自显示设计、人因工程、认知系统工程、认知心理学及计算机科学等领域的研究成果,为技术的设计、开发与评估提供系统性指导。
链接: https://arxiv.org/abs/2512.08036
作者: Mohammadreza Jalaeian,Dane A. Morey,Michael F. Rayo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 10 pages
Abstract:Joint activity describes when more than one agent (human or machine) contributes to the completion of a task or activity. Designing for joint activity focuses on explicitly supporting the interdependencies between agents necessary for effective coordination among agents engaged in the joint activity. This builds and expands upon designing for usability to further address how technologies can be designed to act as effective team players. Effective joint activity requires supporting, at minimum, five primary macrocognitive functions within teams: Event Detection, Sensemaking, Adaptability, Perspective-Shifting, and Coordination. Supporting these functions is equally as important as making technologies usable. We synthesized fourteen heuristics from relevant literature including display design, human factors, cognitive systems engineering, cognitive psychology, and computer science to aid the design, development, and evaluation of technologies that support joint human-machine activity.
zh
[AI-66] oward an AI Reasoning -Enabled System for Patient-Clinical Trial Matching
【速读】:该论文旨在解决临床试验患者筛选过程中存在的手动、耗时且资源密集的问题。其解决方案的关键在于构建一个安全、可扩展的AI增强型患者-试验匹配系统,该系统利用开源的具备推理能力的大语言模型(Large Language Models, LLMs),突破传统二分类方法的局限,生成结构化的入选评估结果并附带可解释的推理链,支持人机协同审核;同时将入选状态视为动态过程而非静态判定,不仅识别当前匹配项,还提供可操作建议以提升未来入选可能性,从而降低协调员负担、智能扩展候选试验范围,并确保所有AI输出的完整审计性。
链接: https://arxiv.org/abs/2512.08026
作者: Caroline N. Leach,Mitchell A. Klusty,Samuel E. Armstrong,Justine C. Pickarski,Kristen L. Hankins,Emily B. Collier,Maya Shah,Aaron D. Mullen,V. K. Cody Bumgardner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, submitted to AMIA
Abstract:Screening patients for clinical trial eligibility remains a manual, time-consuming, and resource-intensive process. We present a secure, scalable proof-of-concept system for Artificial Intelligence (AI)-augmented patient-trial matching that addresses key implementation challenges: integrating heterogeneous electronic health record (EHR) data, facilitating expert review, and maintaining rigorous security standards. Leveraging open-source, reasoning-enabled large language models (LLMs), the system moves beyond binary classification to generate structured eligibility assessments with interpretable reasoning chains that support human-in-the-loop review. This decision support tool represents eligibility as a dynamic state rather than a fixed determination, identifying matches when available and offering actionable recommendations that could render a patient eligible in the future. The system aims to reduce coordinator burden, intelligently broaden the set of trials considered for each patient and guarantee comprehensive auditability of all AI-generated outputs.
zh
[AI-67] SkipKV: Selective Skipping of KV Generation and Storag e for Efficient Inference with Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在链式思维(Chain-of-Thought, CoT)推理过程中因键值缓存(Key-Value Cache, KV Cache)线性增长而导致的内存占用高和吞吐量瓶颈问题。现有KV缓存淘汰方法在多批处理场景下表现不佳,主要由于token级评分不稳定及填充token导致的有效KV预算减少,且常因语义无关的逐token淘汰而引发重复验证,从而生成更长序列。解决方案的关键在于提出一种无需训练的KV压缩方法SkipKV,其核心创新为:1)引入句子级评分机制,在粗粒度层面识别并移除高度相似句以保持语义连贯性;2)通过动态调整控制向量(steering vector)来修改推理过程中的隐藏状态,引导模型生成更简洁响应,从而实现高效CoT推理。实验表明,SkipKV在相近压缩预算下可提升最高达26.7%的准确性,并将生成长度减少最多1.6倍,吞吐量提升至1.7倍。
链接: https://arxiv.org/abs/2512.07993
作者: Jiayi Tian,Seyedarmin Azizi,Yequan Zhao,Erfan Baghaei Potraghloo,Sean McPherson,Sharath Nittur Sridhar,Zhengyang Wang,Zheng Zhang,Massoud Pedram,Souvik Kundu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbfSkipKV, a \textbf\textittraining-free KV compression method for selective \textiteviction and \textitgeneration operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textitsentence-scoring metric to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to \mathbf26.7% improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to \mathbf1.6\times fewer generation length while improving throughput up to \mathbf1.7\times .
zh
[AI-68] A Gray Literature Study on Fairness Requirements in AI-enabled Software Engineering
【速读】:该论文试图解决当前人工智能(AI)软件开发中对公平性(fairness)关注不足的问题,尤其是在机器学习(ML)模型的应用中,尽管有效性(如F1-score)常被作为主要评估指标,但公平性要求往往被忽视。解决方案的关键在于系统梳理灰色文献中关于公平性要求的定义、在软件开发生命周期(SDLC)中的管理实践以及违反公平性所带来的成因与后果,并强调需建立一致的框架和实践,将公平性与有效性同等重视,贯穿于数据处理、模型训练、偏差缓解、监控评估等环节,以减少歧视、增强透明度并提升AI决策的社会可信度。
链接: https://arxiv.org/abs/2512.07990
作者: Thanh Nguyen,Chaima Boufaied,Ronnie de Souza Santos
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Today, with the growing obsession with applying Artificial Intelligence (AI), particularly Machine Learning (ML), to software across various contexts, much of the focus has been on the effectiveness of AI models, often measured through common metrics such as F1- score, while fairness receives relatively little attention. This paper presents a review of existing gray literature, examining fairness requirements in AI context, with a focus on how they are defined across various application domains, managed throughout the Software Development Life Cycle (SDLC), and the causes, as well as the corresponding consequences of their violation by AI models. Our gray literature investigation shows various definitions of fairness requirements in AI systems, commonly emphasizing non-discrimination and equal treatment across different demographic and social attributes. Fairness requirement management practices vary across the SDLC, particularly in model training and bias mitigation, fairness monitoring and evaluation, and data handling practices. Fairness requirement violations are frequently linked, but not limited, to data representation bias, algorithmic and model design bias, human judgment, and evaluation and transparency gaps. The corresponding consequences include harm in a broad sense, encompassing specific professional and societal impacts as key examples, stereotype reinforcement, data and privacy risks, and loss of trust and legitimacy in AI-supported decisions. These findings emphasize the need for consistent frameworks and practices to integrate fairness into AI software, paying as much attention to fairness as to effectiveness.
zh
[AI-69] An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face
【速读】:该论文旨在解决高自主系统中学习使能软件系统(Learning-Enabled Software Systems, LESS)的语义保真性(semantic preservation)问题,即在对智能组件进行优化或重构时,如何确保不会改变系统的整体功能行为。其核心挑战在于机器学习(Machine Learning, ML)模型的非确定性和运行时定义的语义特性,使得传统软件重构方法难以适用。解决方案的关键在于构建一个基于HuggingFace模型演化数据的实证评估框架:通过挖掘模型提交历史、Model Cards和性能指标等多源信息,识别语义漂移(semantic drift),并结合提交消息分析提炼出常见的重构模式。该框架不仅提供了可复现的大规模ML模型演化数据集(来自170万条HuggingFace条目),还实现了对536个模型及4000+指标的语义保真度评估,为建立社区共识的语义保真边界奠定了基础。
链接: https://arxiv.org/abs/2512.07983
作者: Nan Jia,Anita Raja,Raffi Khatchadourian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to Hawaii International Conference on System Sciences (HICSS) 2026
Abstract:As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system’s overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, \textitModel Cards , and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how \textitsemantic drift can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.
zh
[AI-70] Can AI autonomously build operate and use the entire data stack?
【速读】:该论文旨在解决当前企业数据管理(Enterprise Data Management)中AI辅助工具仅能局部优化、难以实现全流程自动化的问题,其核心挑战在于如何将AI从独立组件的辅助操作升级为对整个数据生命周期(Data Lifecycle)的自主管控。解决方案的关键在于推动范式转变——从现有分散式AI应用转向由智能代理(Intelligent Agents)驱动的端到端自治数据系统,使数据平台不仅服务于人类用户,也能被AI自身调用和运行,从而构建具备自适应与自我优化能力的“自主数据资产”(Autonomous Data Estates)。
链接: https://arxiv.org/abs/2512.07926
作者: Arvind Agarwal,Lisa Amini,Sameep Mehta,Horst Samulowitz,Kavitha Srinivas
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Enterprise data management is a monumental task. It spans data architecture and systems, integration, quality, governance, and continuous improvement. While AI assistants can help specific persona, such as data engineers and stewards, to navigate and configure the data stack, they fall far short of full automation. However, as AI becomes increasingly capable of tackling tasks that have previously resisted automation due to inherent complexities, we believe there is an imminent opportunity to target fully autonomous data estates. Currently, AI is used in different parts of the data stack, but in this paper, we argue for a paradigm shift from the use of AI in independent data component operations towards a more holistic and autonomous handling of the entire data lifecycle. Towards that end, we explore how each stage of the modern data stack can be autonomously managed by intelligent agents to build self-sufficient systems that can be used not only by human end-users, but also by AI itself. We begin by describing the mounting forces and opportunities that demand this paradigm shift, examine how agents can streamline the data lifecycle, and highlight open questions and areas where additional research is needed. We hope this work will inspire lively debate, stimulate further research, motivate collaborative approaches, and facilitate a more autonomous future for data systems.
zh
[AI-71] DeepCode: Open Agent ic Coding
【速读】:该论文旨在解决科学论文到代码库的高保真自动转换问题(即 document-to-codebase synthesis),其核心挑战在于大语言模型(LLM)面临的信息过载与上下文瓶颈之间的根本性冲突。解决方案的关键在于提出一个名为 DeepCode 的全自动框架,通过系统化的信息流管理机制,在有限的上下文预算下最大化任务相关信号:具体包括基于蓝图蒸馏的源码压缩、利用状态化代码记忆的结构化索引、通过检索增强生成的条件知识注入,以及闭环错误修正机制,从而实现从论文规范到可生产级代码的高质量自动化映射。
链接: https://arxiv.org/abs/2512.07921
作者: Zongwei Li,Zhonghang Li,Zirui Guo,Xubin Ren,Chao Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: for source code, please see this https URL
Abstract:Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis–such as scientific papers to code–primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.
zh
[AI-72] CFD-copilot: leverag ing domain-adapted large language model and model context protocol to enhance simulation automation
【速读】:该论文旨在解决计算流体动力学(Computational Fluid Dynamics, CFD)仿真全流程自动化难题,特别是针对非专业用户在物理建模、数值方法选择及后处理分析等方面面临的高门槛问题。其关键解决方案是提出一个领域专用的大语言模型(Large Language Model, LLM)框架——CFD-copilot,该框架通过微调LLM实现自然语言到可执行CFD设置的直接映射,并结合多智能体系统整合仿真执行、自动错误修正与结果分析;同时引入模型上下文协议(Model Context Protocol, MCP),以解耦LLM推理与外部工具执行,从而构建一个模块化、可扩展的后处理接口,显著提升工程工作流中数据提取与分析的自动化水平与可靠性。
链接: https://arxiv.org/abs/2512.07917
作者: Zhehao Dong,Shanghai Du,Zhen Lu,Yue Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:Configuring computational fluid dynamics (CFD) simulations requires significant expertise in physics modeling and numerical methods, posing a barrier to non-specialists. Although automating scientific tasks with large language models (LLMs) has attracted attention, applying them to the complete, end-to-end CFD workflow remains a challenge due to its stringent domain-specific requirements. We introduce CFD-copilot, a domain-specialized LLM framework designed to facilitate natural language-driven CFD simulation from setup to post-processing. The framework employs a fine-tuned LLM to directly translate user descriptions into executable CFD setups. A multi-agent system integrates the LLM with simulation execution, automatic error correction, and result analysis. For post-processing, the framework utilizes the model context protocol (MCP), an open standard that decouples LLM reasoning from external tool execution. This modular design allows the LLM to interact with numerous specialized post-processing functions through a unified and scalable interface, improving the automation of data extraction and analysis. The framework was evaluated on benchmarks including the NACA~0012 airfoil and the three-element 30P-30N airfoil. The results indicate that domain-specific adaptation and the incorporation of the MCP jointly enhance the reliability and efficiency of LLM-driven engineering workflows.
zh
[AI-73] he Theory of Strategic Evolution: Games with Endogenous Players and Strategic Replicators
【速读】:该论文旨在解决多层战略系统中参与者、策略与制度规则协同演化的建模难题,其核心问题是如何在包含内生性玩家、多层级选择、创新机制、宪法变迁及元治理(meta governance)的复杂环境中,构建一个统一且具有数学严谨性的演化理论框架。解决方案的关键在于提出“生成式堆栈”(Poiesis stack)这一数学结构——它通过跨层级收益矩阵(cross-level gain matrices)将不同战略层次连接成一个分层体系,并证明在小收益条件下系统存在全局李雅普诺夫函数(Lyapunov function),从而保证每个有限深度层级上均满足选择稳定性、跟踪性及随机稳定性结果。该理论还通过闭包定理表明,系统在块扩展、创新事件、异质效用、连续策略空间和宪法演化下保持动态一致性,且不受限的自我修改不会破坏李雅普诺夫结构,最终实现了进化博弈论、制度设计、创新动力学与宪法政治经济学的数学统一。
链接: https://arxiv.org/abs/2512.07901
作者: Kevin Vallier
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: Draft manuscript, 30k words. Companion to Agentic Capital. Submitted to establish priority
Abstract:This paper develops the Theory of Strategic Evolution, a general model for systems in which the population of players, strategies, and institutional rules evolve together. The theory extends replicator dynamics to settings with endogenous players, multi level selection, innovation, constitutional change, and meta governance. The central mathematical object is a Poiesis stack: a hierarchy of strategic layers linked by cross level gain matrices. Under small gain conditions, the system admits a global Lyapunov function and satisfies selection, tracking, and stochastic stability results at every finite depth. We prove that the class is closed under block extension, innovation events, heterogeneous utilities, continuous strategy spaces, and constitutional evolution. The closure theorem shows that no new dynamics arise at higher levels and that unrestricted self modification cannot preserve Lyapunov structure. The theory unifies results from evolutionary game theory, institutional design, innovation dynamics, and constitutional political economy, providing a general mathematical model of long run strategic adaptation.
zh
[AI-74] MARINE: Theoretical Optimization and Design for Multi-Agent Recursive IN-context Enhancement
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在实际应用中因受限于单次输出而无法充分发挥推理潜力的问题。传统方法通常采用一次性生成或多次采样后排序的策略,难以实现高效的性能提升。其解决方案的关键在于提出MARINE(Multi-Agent Recursive IN-context Enhancement)框架,该框架将测试时推理重构为对一个持续参考轨迹的迭代优化过程,通过设计精巧的精炼算子(refinement operator),可将基础模型的pass@N能力转化为接近最优的pass@1表现;理论分析进一步证明,在固定调用预算下,最小可行批次能最大化预期性能增益,且对数增长的批处理调度策略可在无计算约束条件下持续改进结果,从而实现了参数效率与推理质量的显著提升。
链接: https://arxiv.org/abs/2512.07898
作者: Hongwei Zhang,Ji Lu,Yongsheng Du,Yanqin Gao,Lingjun Huang,Baoli Wang,Fang Tan,Peng Zou
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents demonstrate advanced reasoning capabilities, yet practical constraints frequently limit outputs to single responses, leaving significant performance potential unrealized. This paper introduces MARINE (Multi-Agent Recursive IN-context Enhancement), a theoretically grounded framework that reconceptualizes test-time reasoning as iterative refinement of a persistent reference trajectory, fundamentally departing from conventional one-shot or multi-sample paradigms. The MARINE refinement operator systematically converts a base model’s pass@N capabilities into near-optimal pass@1 performance. Rigorous theoretical analysis establishes that minimal feasible batches maximize expected performance gains under fixed invocation budgets, while logarithmically growing batch schedules ensure continuous improvement without computational constraints. Comprehensive evaluation on the BrowserComp-ZH benchmark demonstrates state-of-the-art results, with a 685B-parameter implementation achieving 46.0% pass@1 accuracy. Meanwhile, MARINE establishes a new paradigm for parameter-efficient reasoning: an 80B-parameter model augmented with MARINE matches the performance of standalone 1000B-parameter agents, reducing parameter requirements by over an order of magnitude. Notably, within a fixed computational budget, the proposed MARINE delivers higher-quality samples to alignment and optimization processes than traditional sampling-and-ranking strategies. Consequently, it has great potential to boost post-training efficiency.
zh
[AI-75] ByteStorm: a multi-step data-driven approach for Tropical Cyclones detection and tracking
【速读】:该论文旨在解决热带气旋(Tropical Cyclone, TC)追踪中因传统方法依赖主观阈值而引入区域偏差的问题。解决方案的关键在于提出一种名为ByteStorm的数据驱动框架,其核心创新包括:利用深度学习网络从相对涡度(850 mb)和海平面气压场中自动检测TC中心(通过分类与定位),并采用BYTE算法将检测到的中心点连接成完整轨迹,整个过程无需人工调整阈值。该方法在西北太平洋东部(ENP)和西部(WNP)盆地的评估中表现出优于现有确定性追踪器的性能,尤其在命中率(Probability of Detection)和年际变化相关性方面显著提升,验证了深度学习与计算机视觉结合在高效、精准TC追踪中的潜力。
链接: https://arxiv.org/abs/2512.07885
作者: Davide Donno,Donatello Elia,Gabriele Accarino,Marco De Carlo,Enrico Scoccimarro,Silvio Gualdi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 10 figures
Abstract:Accurate tropical cyclones (TCs) tracking represents a critical challenge in the context of weather and climate science. Traditional tracking schemes mainly rely on subjective thresholds, which may introduce biases in their skills on the geographical region of application. We present ByteStorm, an efficient data-driven framework for reconstructing TC tracks without threshold tuning. It leverages deep learning networks to detect TC centers (via classification and localization), using only relative vorticity (850 mb) and mean sea-level pressure. Then, detected centers are linked into TC tracks through the BYTE algorithm. ByteStorm is evaluated against state-of-the-art deterministic trackers in the East- and West-North Pacific basins (ENP and WNP). The proposed framework achieves superior performance in terms of Probability of Detection ( 85.05% ENP, 79.48% WNP), False Alarm Rate ( 23.26% ENP, 16.14% WNP), and high Inter-Annual Variability correlations ( 0.75 ENP and 0.69 WNP). These results highlight the potential of integrating deep learning and computer vision for fast and accurate TC tracking, offering a robust alternative to traditional approaches.
zh
[AI-76] Artificial Intelligence-Driven Network-on-Chip Design Space Exploration: Neural Network Architectures for Design
【速读】:该论文旨在解决网络芯片(Network-on-Chip, NoC)设计空间探索中因高维配置空间导致的效率低下问题,尤其是在满足严格吞吐量和延迟要求时,传统设计空间探索方法往往速度缓慢且难以处理复杂的非线性参数关系。其解决方案的关键在于提出一种基于机器学习的自动化框架,结合BookSim仿真与反向神经网络模型,通过训练三种架构——多层感知机(Multi-Layer Perceptron, MLP)、条件扩散模型(Conditional Diffusion Model)和条件变分自编码器(Conditional Variational Autoencoder, CVAE)——来预测给定目标性能下的最优NoC参数。实验表明,条件扩散模型在未见过的数据上实现了最低的均方误差(MSE=0.463),显著提升了预测精度,并将设计探索时间缩短了数个数量级,从而为快速、可扩展的NoC协同设计提供了可行方案。
链接: https://arxiv.org/abs/2512.07877
作者: Amogh Anshu N,Harish BP
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Network-on-Chip (NoC) design requires exploring a high-dimensional configuration space to satisfy stringent throughput requirements and latency this http URL design space exploration techniques are often slow and struggle to handle complex, non-linear parameter this http URL work presents a machine learning-driven framework that automates NoC design space exploration using BookSim simulations and reverse neural network this http URL, we compare three architectures - a Multi-Layer Perceptron (MLP),a Conditional Diffusion Model, and a Conditional Variational Autoencoder (CVAE) to predict optimal NoC parameters given target performance this http URL pipeline generates over 150,000 simulation data points across varied mesh this http URL Conditional Diffusion Model achieved the highest predictive accuracy, attaining a mean squared error (MSE) of 0.463 on unseen this http URL, the proposed framework reduces design exploration time by several orders of magnitude, making it a practical solution for rapid and scalable NoC co-design.
zh
[AI-77] Advancing physiological time series reconstruction and imputation via mixture of receptive fields and experts fusion
【速读】:该论文旨在解决医学时间序列信号重建中因多变量、高时间变异性、高噪声和伪影干扰而导致的深度学习方法性能受限的问题。其核心解决方案在于提出一种基于Mixture of Experts (MoE) 的噪声估计器,结合基于得分(score-based)的扩散框架:关键创新包括两个模块——Receptive Field Adaptive MoE (RFAMoE) 模块使每个通道在扩散过程中自适应选择最优感受野,从而提升对生理信号复杂动态的建模能力;以及Fusion MoE模块,通过MoE的路由机制并行生成K个噪声信号并融合,实现单次前向推理即可完成高质量重建,显著优于以往需多次推理平均的方法,在保持高性能的同时大幅降低计算开销与延迟。
链接: https://arxiv.org/abs/2512.07873
作者: Ci Zhang,Huayu Li,Changdi Yang,Jiangnan Xia,Yanzhi Wang,Xiaolong Ma,Jin Lu,Geng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies show that using diffusion models for time series signal reconstruc- tion holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challenging for tasks such as imputation. Hence, we propose a novel Mixture of Experts (MoE)-based noise estimator within a score-based diffusion framework. Specifically, the Receptive Field Adaptive MoE (RFAMoE) module is designed to enable each channel to adap- tively select desired receptive fields throughout the diffusion process. Moreover, recent literature has found that when generating a physiological signal, performing multiple inferences and averaging the reconstructed signals can effectively reduce reconstruction errors, but at the cost of significant computational and latency over- head. We design a Fusion MoE module and innovatively leverage the nature of MoE module to generate K noise signals in parallel, fuse them using a routing mechanism, and complete signal reconstruction in a single inference step. This design not only improves performance over previous methods but also eliminates the substantial computational cost and latency associated with multiple inference processes. Extensive results demonstrate that our proposed framework consistently outperforms diffusion-based SOTA works on different tasks and datasets.
zh
[AI-78] Bayesian Optimization for Function-Valued Responses under Min-Max Criteria
【速读】:该论文旨在解决传统贝叶斯优化(Bayesian Optimization, BO)在处理函数型响应(functional response)时的局限性问题,即现有方法通常仅关注标量响应或最小化积分误差(integrated error),而忽视了对函数域内最大偏差(worst-case deviation)的建模,导致优化结果可能在局部区域表现不佳。其解决方案的关键在于提出一种最小-最大函数型贝叶斯优化(min-max Functional Bayesian Optimization, MM-FBO)框架,通过引入基于函数主成分分析(Functional Principal Component Analysis, FPCA)的表示方法对函数型响应进行降维建模,并构建高斯过程(Gaussian Process, GP)代理模型以捕捉各主成分得分的不确定性;在此基础上,设计了一个集成不确定性采集函数(acquisition function),在探索整个函数域的同时,显式地权衡最坏情况下的期望误差,从而直接最小化函数域上的最大误差。理论层面提供了离散化误差上界和一致性收敛结果,实验证明该方法在合成基准与物理驱动案例中显著优于现有基线。
链接: https://arxiv.org/abs/2512.07868
作者: Pouya Ahadi,Reza Marzban,Ali Adibi,Kamran Paynabar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 25 pages, 6 figures
Abstract:Bayesian optimization is widely used for optimizing expensive black box functions, but most existing approaches focus on scalar responses. In many scientific and engineering settings the response is functional, varying smoothly over an index such as time or wavelength, which makes classical formulations inadequate. Existing methods often minimize integrated error, which captures average performance but neglects worst case deviations. To address this limitation we propose min-max Functional Bayesian Optimization (MM-FBO), a framework that directly minimizes the maximum error across the functional domain. Functional responses are represented using functional principal component analysis, and Gaussian process surrogates are constructed for the principal component scores. Building on this representation, MM-FBO introduces an integrated uncertainty acquisition function that balances exploitation of worst case expected error with exploration across the functional domain. We provide two theoretical guarantees: a discretization bound for the worst case objective, and a consistency result showing that as the surrogate becomes accurate and uncertainty vanishes, the acquisition converges to the true min-max objective. We validate the method through experiments on synthetic benchmarks and physics inspired case studies involving electromagnetic scattering by metaphotonic devices and vapor phase infiltration. Results show that MM-FBO consistently outperforms existing baselines and highlights the importance of explicitly modeling functional uncertainty in Bayesian optimization.
zh
[AI-79] Command Control (C2) Traffic Detection Via Algorithm Generated Domain (Dga) Classification Using Deep Learning And Natural Language Processing
【速读】:该论文旨在解决现代恶意软件(malware)通过域名生成算法(Domain Generation Algorithms, DGA)与命令与控制(Command and Control, C2)服务器通信所带来的安全挑战,此类技术可动态生成大量域名,使得基于静态黑名单的防御机制失效。解决方案的关键在于利用深度学习与自然语言处理(Natural Language Processing, NLP)技术,构建一个基于长短期记忆网络(Recurrent Neural Network, LSTM)的检测模型,通过对50,000个合法和50,000个恶意域名组成的混合数据库进行训练,提取词汇特征并识别DGA生成域名的复杂模式,最终实现97.2%的准确率,并有效降低在模糊合法流量场景下的误报率。
链接: https://arxiv.org/abs/2512.07866
作者: Maria Milena Araujo Felix
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Language: Portuguese
Abstract:The sophistication of modern malware, specifically regarding communication with Command and Control (C2) servers, has rendered static blacklist-based defenses obsolete. The use of Domain Generation Algorithms (DGA) allows attackers to generate thousands of dynamic addresses daily, hindering blocking by traditional firewalls. This paper aims to propose and evaluate a method for detecting DGA domains using Deep Learning and Natural Language Processing (NLP) techniques. The methodology consisted of collecting a hybrid database containing 50,000 legitimate and 50,000 malicious domains, followed by the extraction of lexical features and the training of a Recurrent Neural Network (LSTM). Results demonstrated that while statistical entropy analysis is effective for simple DGAs, the Neural Network approach presents superiority in detecting complex patterns, reaching 97.2% accuracy and reducing the false positive rate in ambiguous lawful traffic scenarios.
zh
[AI-80] GPU Memory Prediction for Multimodal Model Training SOSP2025
【速读】:该论文旨在解决多模态生成式 AI (Generative AI) 系统中因模型规模扩大导致的 GPU 内存溢出(Out-of-Memory, OoM)问题,该问题会中断训练并浪费大量计算资源。现有方法仅针对单模态架构,难以推广至多模态模型。解决方案的关键在于提出一种基于模型结构与训练行为分析的框架,通过将多模态模型分解为各层并应用因子分解策略,精确估算每层的 GPU 内存占用,从而实现对峰值 GPU 内存使用量的高精度预测,实验表明其平均绝对百分比误差(MAPE)仅为 8.7%。
链接: https://arxiv.org/abs/2512.07853
作者: Jinwoo Jeong,Minchul Kang,Younghun Go,Changyong Shin,Hyunho Lee,Junho Yoon,Gyeongsik Yang,Chuck Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 1st Workshop on Systems for Agentic AI (SAA '25), co-located with SOSP 2025
Abstract:As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.
zh
[AI-81] SABER: Small Actions Big Errors - Safeguarding Mutating Steps in LLM Agents ICLR2026
【速读】:该论文旨在解决大语言模型代理(LLM agents)在长周期、工具调用任务中表现脆弱的问题,核心关注点在于:并非所有动作对失败的贡献均等。通过分析 τ-Bench(Airline/Retail)和 SWE-Bench Verified 的执行轨迹,研究发现,仅在环境改变型动作(mutating actions)中出现的“决定性偏离”(decisive deviations)显著降低成功率(每增加一个此类偏离,成功概率下降高达92%~96%),而非环境改变型动作则影响甚微。基于此,作者提出一种模型无关、无需梯度的测试时防护机制 \cm,其关键创新包括:(i) 增加仅在环境改变动作前触发的验证机制,(ii) 在 mutating 步骤前注入目标导向的反思(Targeted Reflection),以及 (iii) 实施基于区块的上下文清理策略。该方法在多个基准上实现显著提升(如 Qwen3-Thinking 在 Airline 上相对增益 +28%),并揭示了 τ-Bench 中标注错误与任务描述模糊导致的性能天花板问题,进而发布 τ-Bench Verified 以恢复评估空间。研究强调行动级分析、针对性防护与可靠评估是构建鲁棒多轮代理的前提。
链接: https://arxiv.org/abs/2512.07850
作者: Alejandro Cuadron,Pengfei Yu,Yang Liu,Arpit Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: submitted to ICLR2026
Abstract:Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emphdo all actions contribute equally to failure? Analyzing execution traces on \tau -Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emphmutating (environment-changing) vs.\ non-mutating steps and formalize \emphdecisive deviations, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto 92% on Airline and upto 96% on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emphTargeted Reflection before mutating steps, and (iii) performs block-based context cleaning. \cm delivers consistent gains, e.g., Qwen3-Thinking: +28% \emphrelative on Airline, +11% on Retail, and +7% on SWE-Bench Verified; Claude: +9%/+7%. We further identify ceiling effects in \tau -Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release \tau -Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.
zh
[AI-82] AudioScene: Integrating Object-Event Audio into 3D Scenes
【速读】:该论文旨在解决现有音频数据集缺乏空间上下文信息的问题,从而限制了音频引导的空间学习研究。其关键解决方案是构建两个新型的音频-空间场景数据集AudioScanNet和AudioRoboTHOR,通过将音频片段与空间对齐的3D场景融合,实现音频事件与空间信息的关联。为提升标注效率与准确性,作者利用大语言模型(Large Language Models, LLMs)的常识推理能力进行初步标注,并辅以严格的人工验证,该方法在保证高精度、完整性和多样性的同时,显著优于纯人工标注的可扩展性,且通过双基准任务——基于音频的3D视觉定位和基于音频的机器人零样本导航,验证了数据集的有效性与实用性。
链接: https://arxiv.org/abs/2512.07845
作者: Shuaihang Yuan,Congcong Wen,Muhammad Shafique,Anthony Tzes,Yi Fang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:The rapid advances in audio analysis underscore its vast potential for humancomputer interaction, environmental monitoring, and public safety; yet, existing audioonly datasets often lack spatial context. To address this gap, we present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR, designed to explore audioconditioned tasks within 3D environments. By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context. To associate audio events with corresponding spatial information, we leverage the common sense reasoning ability of large language models and supplement them with rigorous human verification, This approach offers greater scalability compared to purely manual annotation while maintaining high standards of accuracy, completeness, and diversity, quantified through inter annotator agreement and performance on two benchmark tasks audio based 3D visual grounding and audio based robotic zeroshot navigation. The results highlight the limitations of current audiocentric methods and underscore the practical challenges and significance of our datasets in advancing audio guided spatial learning.
zh
[AI-83] Space Alignment Matters: The Missing Piece for Inducing Neural Collapse in Long-Tailed Learning
【速读】:该论文旨在解决长尾分布(long-tailed regime)下神经网络训练中因样本不平衡导致的神经坍缩(Neural Collapse, NC)现象难以出现的问题,进而影响模型泛化性能。现有方法主要通过约束特征或分类器权重来恢复等距正交框架(ETF)几何结构,但忽略了特征空间与分类器权重空间之间显著的错位问题。论文的关键创新在于从理论上量化了这种空间错位带来的误差指数损害(optimal error exponent analysis),并据此提出三种可即插即用的对齐策略,无需改变网络架构即可提升现有长尾学习方法的性能,实验表明其在CIFAR-10-LT、CIFAR-100-LT和ImageNet-LT等数据集上均取得显著改进并达到当前最优效果。
链接: https://arxiv.org/abs/2512.07844
作者: Jinping Wang,Zhiqiang Gao,Zhiwu Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies on Neural Collapse (NC) reveal that, under class-balanced conditions, the class feature means and classifier weights spontaneously align into a simplex equiangular tight frame (ETF). In long-tailed regimes, however, severe sample imbalance tends to prevent the emergence of the NC phenomenon, resulting in poor generalization performance. Current efforts predominantly seek to recover the ETF geometry by imposing constraints on features or classifier weights, yet overlook a critical problem: There is a pronounced misalignment between the feature and the classifier weight spaces. In this paper, we theoretically quantify the harm of such misalignment through an optimal error exponent analysis. Built on this insight, we propose three explicit alignment strategies that plug-and-play into existing long-tail methods without architectural change. Extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT datasets consistently boost examined baselines and achieve the state-of-the-art performances.
zh
[AI-84] Impact of Data-Oriented and Object-Oriented Design on Performance and Cache Utilization with Artificial Intelligence Algorithms in Multi-Threaded CPUs
【速读】:该论文旨在解决多核中央处理器(CPU)与主内存之间日益扩大的性能差距问题,提出通过硬件感知的软件设计范式来提升计算效率。其解决方案的关键在于对比分析数据导向设计(Data Oriented Design, DOD)与传统面向对象设计(Object-Oriented Design, OOD)在多线程环境下的缓存利用率和执行效率差异。研究通过实现并比较四种A*搜索算法版本(单线程OOD、单线程DOD、多线程OOD、多线程DOD),发现DOD在多线程场景下显著减少缓存未命中次数和系统调用开销,展现出更高的数据密集型操作效率;尽管单线程版本因线程管理开销而优于多线程版本,但DOD在关键性能指标上的持续优势凸显了其在复杂AI和并行计算任务中对硬件资源利用的架构优越性。
链接: https://arxiv.org/abs/2512.07841
作者: Gabriel M. Arantes,Richard F. Pinto,Bruno L. Dalmazo,Eduardo N. Borges,Giancarlo Lucca,Viviane L. D. de Mattos,Fabian C. Cardoso,Rafael A. Berri
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:The growing performance gap between multi-core CPUs and main memory necessitates hardware-aware software design paradigms. This study provides a comprehensive performance analysis of Data Oriented Design (DOD) versus the traditional Object-Oriented Design (OOD), focusing on cache utilization and efficiency in multi-threaded environments. We developed and compared four distinct versions of the A* search algorithm: single-threaded OOD (ST-OOD), single-threaded DOD (ST-DOD), multi-threaded OOD (MT-OOD), and multi-threaded DOD (MT-DOD). The evaluation was based on metrics including execution time, memory usage, and CPU cache misses. In multi-threaded tests, the DOD implementation demonstrated considerable performance gains, with faster execution times and a lower number of raw system calls and cache misses. While OOD occasionally showed marginal advantages in memory usage or percentage-based cache miss rates, DOD’s efficiency in data-intensive operations was more evident. Furthermore, our findings reveal that for a fine-grained task like the A* algorithm, the overhead associated with thread management led to single-threaded versions significantly outperforming their multi-threaded counterparts in both paradigms. We conclude that even when performance differences appear subtle in simple algorithms, the consistent advantages of DOD in critical metrics highlight its foundational architectural superiority, suggesting it is a more effective approach for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks.
zh
[AI-85] Artificial Intelligence and Nuclear Weapons Proliferation: The Technological Arms Race for (In)visibility
【速读】:该论文试图解决的问题是:新兴和颠覆性技术正在重塑核风险格局,特别是由促进扩散的技术(Proliferation-Enabling Technologies, PETs)与增强检测能力的技术(Detection-Enhancing Technologies, DETs)之间的动态博弈,正导致核武器扩散的“可见性”边界日益模糊,传统不扩散机制面临失效风险。解决方案的关键在于构建一个以相对优势指数(Relative Advantage Index, RAI)为核心的量化模型,用于刻画PETs与DETs之间力量平衡的变化趋势,并通过情景模拟揭示不对称技术进步(如AI驱动的PET指数增长 vs. DET的渐进式改进)如何扩大核突破风险的不确定性区间。研究指出,仅依赖检测手段已不足以应对未来挑战,必须同步加强PET治理,形成前瞻性的政策与技术响应体系。
链接: https://arxiv.org/abs/2512.07487
作者: David M. Allison,Stephen Herzog
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Best Paper Award (2025) from Risk Analysis as one of the articles published in the journal that year with the most significant impacts to the theory or practice of risk analysis. Main text: 17 pages, 5 tables, 5 figures. Online appendix: 4 pages, 3 figures, 1 table. Online simulation tool for the formal model available here: this https URL
Abstract:A robust nonproliferation regime has contained the spread of nuclear weapons to just nine states. Yet, emerging and disruptive technologies are reshaping the landscape of nuclear risks, presenting a critical juncture for decision makers. This article lays out the contours of an overlooked but intensifying technological arms race for nuclear (in)visibility, driven by the interplay between proliferation-enabling technologies (PETs) and detection-enhancing technologies (DETs). We argue that the strategic pattern of proliferation will be increasingly shaped by the innovation pace in these domains. Artificial intelligence (AI) introduces unprecedented complexity to this equation, as its rapid scaling and knowledge substitution capabilities accelerate PET development and challenge traditional monitoring and verification methods. To analyze this dynamic, we develop a formal model centered on a Relative Advantage Index (RAI), quantifying the shifting balance between PETs and DETs. Our model explores how asymmetric technological advancement, particularly logistic AI-driven PET growth versus stepwise DET improvements, expands the band of uncertainty surrounding proliferation detectability. Through replicable scenario-based simulations, we evaluate the impact of varying PET growth rates and DET investment strategies on cumulative nuclear breakout risk. We identify a strategic fork ahead, where detection may no longer suffice without broader PET governance. Governments and international organizations should accordingly invest in policies and tools agile enough to keep pace with tomorrow’s technology.
zh
[AI-86] SAQ: Stabilizer-Aware Quantum Error Correction Decoder
【速读】:该论文旨在解决量子纠错码(Quantum Error Correction, QEC)解码中长期存在的准确性与效率权衡问题:传统方法如最小权重完美匹配(Minimum Weight Perfect Matching, MWPM)在不同噪声模型下性能波动且计算复杂度为多项式级别,而张量网络解码器虽精度高但计算成本过高;近期神经网络解码器虽降低了复杂度,却难以达到经典方法的准确性。解决方案的关键在于提出SAQ-Decoder框架,其核心创新包括:(1)基于双流Transformer架构,联合处理测量奇偶校验信息(syndrome)与逻辑信息,并引入不对称注意力机制以增强对关键错误模式的感知能力;(2)设计一种可微分的逻辑损失函数,通过有限域上的平滑近似直接优化逻辑错误率(Logical Error Rate, LER),从而实现对物理错误到逻辑错误映射的端到端学习;该方案在保持线性计算复杂度的同时逼近最大似然(Maximum Likelihood, ML)性能上限,在拓扑码上实现了接近理论极限的阈值(独立噪声下10.99%,去极化噪声下18.6%),显著优于现有神经与经典基线方法。
链接: https://arxiv.org/abs/2512.08914
作者: David Zenati,Eliya Nachmani
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantum Error Correction (QEC) decoding faces a fundamental accuracy-efficiency tradeoff. Classical methods like Minimum Weight Perfect Matching (MWPM) exhibit variable performance across noise models and suffer from polynomial complexity, while tensor network decoders achieve high accuracy but at prohibitively high computational cost. Recent neural decoders reduce complexity but lack the accuracy needed to compete with computationally expensive classical methods. We introduce SAQ-Decoder, a unified framework combining transformer-based learning with constraint aware post-processing that achieves both near Maximum Likelihood (ML) accuracy and linear computational scalability with respect to the syndrome size. Our approach combines a dual-stream transformer architecture that processes syndromes and logical information with asymmetric attention patterns, and a novel differentiable logical loss that directly optimizes Logical Error Rates (LER) through smooth approximations over finite fields. SAQ-Decoder achieves near-optimal performance, with error thresholds of 10.99% (independent noise) and 18.6% (depolarizing noise) on toric codes that approach the ML bounds of 11.0% and 18.9% while outperforming existing neural and classical baselines in accuracy, complexity, and parameter efficiency. Our findings establish that learned decoders can simultaneously achieve competitive decoding accuracy and computational efficiency, addressing key requirements for practical fault-tolerant quantum computing systems.
zh
[AI-87] Harmonizing Community Science Datasets to Model Highly Pathogenic Avian Influenza (HPAI) in Birds in the Subantarctic PRICAI2025
【速读】:该论文旨在解决社区科学观测数据(community science observational datasets)在流行病学和生态学研究中因数据异质性带来的标准化、数据质量控制及工作流管理难题。其解决方案的关键在于构建一套系统性的数据处理流程,用于清洗和统一多个来源的社区科学数据集(如eBird、iNaturalist、GBIF等),并通过案例研究对南半球亚南极地区鸟类种群受高致病性禽流感(Highly Pathogenic Avian Influenza, HPAI)影响进行建模,从而估算未知种群结构下的个体数量及新型死亡率估计值。
链接: https://arxiv.org/abs/2512.07907
作者: Richard Littauer,Kris Bubendorfer
机构: 未知
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI)
备注: Proceedings of Pacific Rim International Conference on Artificial Intelligence 2025 (PRICAI 2025): Artificial Intelligence for Earth and Environmental Science 2025 (AIEES 2025) Workshop, 17-21 Nov 2025, Wellington, New Zealand. Changes from presentation paper: small spelling edits, change of preferred email, inclusion of Codeberg source code
Abstract:Community science observational datasets are useful in epidemiology and ecology for modeling species distributions, but the heterogeneous nature of the data presents significant challenges for standardization, data quality assurance and control, and workflow management. In this paper, we present a data workflow for cleaning and harmonizing multiple community science datasets, which we implement in a case study using eBird, iNaturalist, GBIF, and other datasets to model the impact of highly pathogenic avian influenza in populations of birds in the subantarctic. We predict population sizes for several species where the demographics are not known, and we present novel estimates for potential mortality rates from HPAI for those species, based on a novel aggregated dataset of mortality rates in the subantarctic.
zh
[AI-88] Functional Random Forest with Adaptive Cost-Sensitive Splitting for Imbalanced Functional Data Classification
【速读】:该论文旨在解决功能数据(functional data)分类中因类别严重不平衡导致的少数类识别困难问题,传统随机森林算法在处理曲线或轨迹类观测时难以捕捉其内在结构且对少数类敏感性不足。解决方案的关键在于提出一种自适应代价敏感分割的函数随机森林(Functional Random Forest with Adaptive Cost-Sensitive Splitting, FRF-ACS):首先通过基展开和函数主成分分析(Functional Principal Component Analysis, FPCA)将高维曲线降维为低维特征;其次引入局部动态代价敏感分裂准则,在每个节点调整类别权重以增强对少数类的判别能力;同时结合函数SMOTE与加权自助采样策略实现混合采样优化;最后使用曲线特异性相似度度量替代欧氏距离进行叶节点分配,从而保留功能数据的连续性和形态特征。该方法显著提升了少数类召回率和整体预测性能,适用于生物医学信号、传感器轨迹等场景下的可解释性高维功能数据分析。
链接: https://arxiv.org/abs/2512.07888
作者: Fahad Mostafa,Hafiz Khan
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
备注: 23 pages, 4 figures
Abstract:Classification of functional data where observations are curves or trajectories poses unique challenges, particularly under severe class imbalance. Traditional Random Forest algorithms, while robust for tabular data, often fail to capture the intrinsic structure of functional observations and struggle with minority class detection. This paper introduces Functional Random Forest with Adaptive Cost-Sensitive Splitting (FRF-ACS), a novel ensemble framework designed for imbalanced functional data classification. The proposed method leverages basis expansions and Functional Principal Component Analysis (FPCA) to represent curves efficiently, enabling trees to operate on low dimensional functional features. To address imbalance, we incorporate a dynamic cost sensitive splitting criterion that adjusts class weights locally at each node, combined with a hybrid sampling strategy integrating functional SMOTE and weighted bootstrapping. Additionally, curve specific similarity metrics replace traditional Euclidean measures to preserve functional characteristics during leaf assignment. Extensive experiments on synthetic and real world datasets including biomedical signals and sensor trajectories demonstrate that FRF-ACS significantly improves minority class recall and overall predictive performance compared to existing functional classifiers and imbalance handling techniques. This work provides a scalable, interpretable solution for high dimensional functional data analysis in domains where minority class detection is critical.
zh
[AI-89] Referenceless Proton Resonance Frequency Thermometry Using Deep Learning with Self-Attention
【速读】:该论文旨在解决高强聚焦超声(FUS)热消融过程中,由于组织界面处磁 susceptibility引起的相位不连续性导致的无参考型质子共振频率(PRF)磁共振测温方法(如复场估计法CFE和相位有限差分法PFD)出现误差的问题。解决方案的关键在于提出一种改进的相位处理策略,以有效抑制或校正这些由组织界面引发的相位伪影,从而提升测温精度与可靠性。
链接: https://arxiv.org/abs/2512.07882
作者: Yueran Zhao,Chang-Sheng Mei,Nathan J. McDannold,Shenyan Zong,Guofeng Shen
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Accurate proton resonance frequency (PRF) MR thermometry is essential for monitoring temperature rise during thermal ablation with high intensity focused ultrasound (FUS). Conventional referenceless methods such as complex field estimation (CFE) and phase finite difference (PFD) tend to exhibit errors when susceptibility-induced phase discontinuities occur at tissue interfaces.
zh
[AI-90] Quantum Circuit Reasoning Models: A Variational Framework for Differentiable Logical Inference
【速读】:该论文旨在解决传统推理系统在处理复杂逻辑结构和不确定性时的局限性,尤其是如何将量子计算中的基本操作与逻辑推理过程深度融合,从而构建具备可微分优化能力的新型推理架构。其解决方案的关键在于提出量子电路推理模型(Quantum Circuit Reasoning Models, QCRM),通过将量子力学中的超位置(superposition)、纠缠(entanglement)、干涉(interference)和测量(measurement)等原语映射为假设分支、约束传播、一致性强制和决策等逻辑推理步骤,使推理过程表现为振幅演化和干涉驱动的状态自洽选择;同时,通过将逻辑规则编码为作用于命题量子比特(proposition-qubit)状态上的酉变换,并设计基于经典梯度下降的训练目标,实现可微分优化下的推理能力学习,最终提出量子推理层(Quantum Reasoning Layer, QRL)作为可组合的混合推理组件,适用于科学、生物医学和化学等领域的推理建模。
链接: https://arxiv.org/abs/2512.07871
作者: Andrew Kiruluta
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This report introduces a novel class of reasoning architectures, termed Quantum Circuit Reasoning Models (QCRM), which extend the concept of Variational Quantum Circuits (VQC) from energy minimization and classification tasks to structured logical inference and reasoning. We posit that fundamental quantum mechanical operations, superposition, entanglement, interference, and measurement, naturally map to essential reasoning primitives such as hypothesis branching, constraint propagation, consistency enforcement, and decision making. The resulting framework combines quantum-inspired computation with differentiable optimization, enabling reasoning to emerge as a process of amplitude evolution and interference-driven selection of self-consistent states. We develop the mathematical foundation of QCRM, define its parameterized circuit architecture, and show how logical rules can be encoded as unitary transformations over proposition-qubit states. We further formalize a training objective grounded in classical gradient descent over circuit parameters and discuss simulation-based implementations on classical hardware. Finally, we propose the Quantum Reasoning Layer (QRL) as a differentiable hybrid component for composable reasoning models applicable to scientific, biomedical, and chemical inference domains.
zh
[AI-91] Manifolds and Modules: How Function Develops in a Neural Foundation Model NEURIPS2025
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 模型在神经科学应用中因“黑箱”特性而难以解释其内部机制的问题,从而限制了对生物视觉系统功能的理解。解决方案的关键在于采用生理学家视角,通过分析模型中每个“神经元”对参数化刺激的时域响应特性,构建解码流形(decoding manifolds)和编码流形(neural encoding manifolds),揭示不同模块(前馈编码器、递归模块与读出模块)在神经活动空间与刺激-响应空间中的表征结构差异。研究发现,递归模块通过“拉开”不同时间模式的表示显著提升处理能力,而读出模块则通过大量专用特征图实现生物学保真度,而非依赖生物合理机制。这一方法为理解神经基础模型内部运作提供了新的可解释性框架,并加深了对其生物相关性的认知。
链接: https://arxiv.org/abs/2512.07869
作者: Johannes Bertram,Luciano Dyballa,T. Anderson Keller,Savik Kinger,Steven W. Zucker
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures, accepted at Data on the Brain Mind Findings, NeurIPS 2025
Abstract:Foundation models have shown remarkable success in fitting biological visual systems; however, their black-box nature inherently limits their utility for under- standing brain function. Here, we peek inside a SOTA foundation model of neural activity (Wang et al., 2025) as a physiologist might, characterizing each ‘neuron’ based on its temporal response properties to parametric stimuli. We analyze how different stimuli are represented in neural activity space by building decoding man- ifolds, and we analyze how different neurons are represented in stimulus-response space by building neural encoding manifolds. We find that the different processing stages of the model (i.e., the feedforward encoder, recurrent, and readout modules) each exhibit qualitatively different representational structures in these manifolds. The recurrent module shows a jump in capabilities over the encoder module by ‘pushing apart’ the representations of different temporal stimulus patterns; while the readout module achieves biological fidelity by using numerous specialized feature maps rather than biologically plausible mechanisms. Overall, we present this work as a study of the inner workings of a prominent neural foundation model, gaining insights into the biological relevance of its internals through the novel analysis of its neurons’ joint temporal response patterns.
zh
[AI-92] LLM -Generated Counterfactual Stress Scenarios for Portfolio Risk Simulation via Hybrid Prompt-RAG Pipeline
【速读】:该论文旨在解决传统宏观金融压力测试框架在生成经济情景时缺乏可解释性、灵活性不足以及难以规模化的问题。其核心挑战在于如何利用大语言模型(Large Language Models, LLMs)构建一个既透明又可审计的自动化流程,以生成符合实际经济逻辑的宏观经济情景,并将其转化为投资组合风险指标(如风险价值 Value-at-Risk 和预期短缺 Expected Shortfall)。解决方案的关键在于设计了一个结构化提示(structured prompting)与可选检索机制相结合的端到端管道:通过引入国家基本面数据和新闻信息增强情景生成的准确性,同时采用因子映射将宏观变量转换为风险损失,确保结果具备可计算性和可比性;此外,系统通过快照记录、确定性运行模式及哈希验证的输出产物保障了可复现性和审计能力,从而实现了对传统计量经济学基准的有效补充。
链接: https://arxiv.org/abs/2512.07867
作者: Masoud Soleimani
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: 22 pages, 8 figures, 10 tables
Abstract:We develop a transparent and fully auditable LLM-based pipeline for macro-financial stress testing, combining structured prompting with optional retrieval of country fundamentals and news. The system generates machine-readable macroeconomic scenarios for the G7, which cover GDP growth, inflation, and policy rates, and are translated into portfolio losses through a factor-based mapping that enables Value-at-Risk and Expected Shortfall assessment relative to classical econometric baselines. Across models, countries, and retrieval settings, the LLMs produce coherent and country-specific stress narratives, yielding stable tail-risk amplification with limited sensitivity to retrieval choices. Comprehensive plausibility checks, scenario diagnostics, and ANOVA-based variance decomposition show that risk variation is driven primarily by portfolio composition and prompt design rather than by the retrieval mechanism. The pipeline incorporates snapshotting, deterministic modes, and hash-verified artifacts to ensure reproducibility and auditability. Overall, the results demonstrate that LLM-generated macro scenarios, when paired with transparent structure and rigorous validation, can provide a scalable and interpretable complement to traditional stress-testing frameworks.
zh
[AI-93] Automating High Energy Physics Data Analysis with LLM -Powered Agents NEURIPS2025
【速读】:该论文旨在解决高能物理(High Energy Physics, HEP)分析中自动化流程的复杂性与人工干预依赖的问题,通过引入大语言模型(Large Language Model, LLM)代理来实现从代码生成到执行及纠错的全流程自动化。其解决方案的关键在于构建一个混合系统:以Snakemake工作流管理器保障分析步骤的可复现性和确定性,同时利用LLM驱动的监督-编码代理自主生成、执行并迭代修正分析代码,从而在真实科学计算环境中实现对HEP任务(如希格斯玻色子二光子截面测量)的端到端自动化处理。该架构不仅支持多阶段工作流的量化评估(如成功率、错误分布、任务成本等),还为不同LLM在实际科研场景中的能力、稳定性与局限性提供了系统性基准测试框架。
链接: https://arxiv.org/abs/2512.07785
作者: Eli Gendreau-Distler,Joshua Ho,Dongwon Kim,Luc Tomas Le Pottier,Haichen Wang,Chengxi Yang
机构: 未知
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
备注: 16 pages, 6 figures, 2 tables, the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) - Machine Learning and the Physical Sciences (ML4PS) workshop (poster)
Abstract:We present a proof-of-principle study demonstrating the use of large language model (LLM) agents to automate a representative high energy physics (HEP) analysis. Using the Higgs boson diphoton cross-section measurement as a case study with ATLAS Open Data, we design a hybrid system that combines an LLM-based supervisor-coder agent with the Snakemake workflow manager. In this architecture, the workflow manager enforces reproducibility and determinism, while the agent autonomously generates, executes, and iteratively corrects analysis code in response to user instructions. We define quantitative evaluation metrics including success rate, error distribution, costs per specific task, and average number of API calls, to assess agent performance across multi-stage workflows. To characterize variability across architectures, we benchmark a representative selection of state-of-the-art LLMs spanning the Gemini and GPT-5 series, the Claude family, and leading open-weight models. While the workflow manager ensures deterministic execution of all analysis steps, the final outputs still show stochastic variation. Although we set the temperature to zero, other sampling parameters (e.g., top-p, top-k) remained at their defaults, and some reasoning-oriented models internally adjust these settings. Consequently, the models do not produce fully deterministic results. This study establishes the first LLM-agent-driven automated data-analysis framework in HEP, enabling systematic benchmarking of model capabilities, stability, and limitations in real-world scientific computing environments. The baseline code used in this work is available at this https URL. This work was accepted as a poster at the Machine Learning and the Physical Sciences (ML4PS) workshop at NeurIPS 2025. The initial submission was made on August 30, 2025.
zh
机器学习
[LG-0] OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer
链接: https://arxiv.org/abs/2512.08920
作者: Jessica Yin,Haozhi Qi,Youngsun Wi,Sayantan Kundu,Mike Lambeta,William Yang,Changhao Wang,Tingfan Wu,Jitendra Malik,Tess Hellebrekers
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project website: this https URL
Abstract:Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer. The glove features 12 three-axis tactile sensors across the fingertips and palm and is designed to be compatible with state-of-the-art hand-tracking methods for in-the-wild data collection. We demonstrate that a robot policy trained exclusively on human demonstrations collected with OSMO, without any real robot data, is capable of executing a challenging contact-rich manipulation task. By equipping both the human and the robot with the same glove, OSMO minimizes the visual and tactile embodiment gap, enabling the transfer of continuous shear and normal force feedback while avoiding the need for image inpainting or other vision-based force inference. On a real-world wiping task requiring sustained contact pressure, our tactile-aware policy achieves a 72% success rate, outperforming vision-only baselines by eliminating contact-related failure modes. We release complete hardware designs, firmware, and assembly instructions to support community adoption.
[LG-1] Open Polymer Challenge: Post-Competition Report WWW NEURIPS
链接: https://arxiv.org/abs/2512.08896
作者: Gang Liu,Sobin Alosious,Subhamoy Mahajan,Eric Inae,Yihan Zhu,Yuhan Liu,Renzheng Zhang,Jiaxin Xu,Addison Howard,Ying Li,Tengfei Luo,Meng Jiang
类目: Machine Learning (cs.LG)
*备注: The report for the competition: “NeurIPS - Open Polymer Prediction 2025”. Kaggle Page: this https URL . Website: this https URL
Abstract:Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at this https URL. We also release the data generation pipeline at this https URL, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.
[LG-2] Unsupervised Learning of Density Estimates with Topological Optimization
链接: https://arxiv.org/abs/2512.08895
作者: Suina Tanweer,Firas A. Khasawneh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwidth is critical as it controls the bias-variance trade-off by over- or under-smoothing the topological features. Topological data analysis provides methods to mathematically quantify topological characteristics, such as connected components, loops, voids et cetera, even in high dimensions where visualization of density estimates is impossible. In this paper, we propose an unsupervised learning approach using a topology-based loss function for the automated and unsupervised selection of the optimal bandwidth and benchmark it against classical techniques – demonstrating its potential across different dimensions.
[LG-3] Explainable Anomaly Detection for Industrial IoT Data Streams
链接: https://arxiv.org/abs/2512.08885
作者: Ana Rita Paupério,Diogo Risca,Afonso Lourenço,Goreti Marreiros,Ricardo Martins
类目: Machine Learning (cs.LG)
*备注: Accepted at 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026)
Abstract:Industrial maintenance is being transformed by the Internet of Things and edge computing, generating continuous data streams that demand real-time, adaptive decision-making under limited computational resources. While data stream mining (DSM) addresses this challenge, most methods assume fully supervised settings, yet in practice, ground-truth labels are often delayed or unavailable. This paper presents a collaborative DSM framework that integrates unsupervised anomaly detection with interactive, human-in-the-loop learning to support maintenance decisions. We employ an online Isolation Forest and enhance interpretability using incremental Partial Dependence Plots and a feature importance score, derived from deviations of Individual Conditional Expectation curves from a fading average, enabling users to dynamically reassess feature relevance and adjust anomaly thresholds. We describe the real-time implementation and provide initial results for fault detection in a Jacquard loom unit. Ongoing work targets continuous monitoring to predict and explain imminent bearing failures.
[LG-4] Decentralized Trust for Space AI: Blockchain-Based Federated Learning Across Multi-Vendor LEO Satellite Networks
链接: https://arxiv.org/abs/2512.08882
作者: Mohamed Elmahallawy,Asma Jodeiri Akbarfam
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The rise of space AI is reshaping government and industry through applications such as disaster detection, border surveillance, and climate monitoring, powered by massive data from commercial and governmental low Earth orbit (LEO) satellites. Federated satellite learning (FSL) enables joint model training without sharing raw data, but suffers from slow convergence due to intermittent connectivity and introduces critical trust challenges–where biased or falsified updates can arise across satellite constellations, including those injected through cyberattacks on inter-satellite or satellite-ground communication links. We propose OrbitChain, a blockchain-backed framework that empowers trustworthy multi-vendor collaboration in LEO networks. OrbitChain (i) offloads consensus to high-altitude platforms (HAPs) with greater computational capacity, (ii) ensures transparent, auditable provenance of model updates from different orbits owned by different vendors, and (iii) prevents manipulated or incomplete contributions from affecting global FSL model aggregation. Extensive simulations show that OrbitChain reduces computational and communication overhead while improving privacy, security, and global model accuracy. Its permissioned proof-of-authority ledger finalizes over 1000 blocks with sub-second latency (0.16,s, 0.26,s, 0.35,s for 1-of-5, 3-of-5, and 5-of-5 quorums). Moreover, OrbitChain reduces convergence time by up to 30 hours on real satellite datasets compared to single-vendor, demonstrating its effectiveness for real-time, multi-vendor learning. Our code is available at this https URL
[LG-5] Secure and Privacy-Preserving Federated Learning for Next-Generation Underground Mine Safety
链接: https://arxiv.org/abs/2512.08862
作者: Mohamed Elmahallawy,Sanjay Madria,Samuel Frimpong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Underground mining operations depend on sensor networks to monitor critical parameters such as temperature, gas concentration, and miner movement, enabling timely hazard detection and safety decisions. However, transmitting raw sensor data to a centralized server for machine learning (ML) model training raises serious privacy and security concerns. Federated Learning (FL) offers a promising alternative by enabling decentralized model training without exposing sensitive local data. Yet, applying FL in underground mining presents unique challenges: (i) Adversaries may eavesdrop on shared model updates to launch model inversion or membership inference attacks, compromising data privacy and operational safety; (ii) Non-IID data distributions across mines and sensor noise can hinder model convergence. To address these issues, we propose FedMining–a privacy-preserving FL framework tailored for underground mining. FedMining introduces two core innovations: (1) a Decentralized Functional Encryption (DFE) scheme that keeps local models encrypted, thwarting unauthorized access and inference attacks; and (2) a balancing aggregation mechanism to mitigate data heterogeneity and enhance convergence. Evaluations on real-world mining datasets demonstrate FedMining’s ability to safeguard privacy while maintaining high model accuracy and achieving rapid convergence with reduced communication and computation overhead. These advantages make FedMining both secure and practical for real-time underground safety monitoring.
[LG-6] Refining Diffusion Models for Motion Synthesis with an Acceleration Loss to Generate Realistic IMU Data
链接: https://arxiv.org/abs/2512.08859
作者: Lars Ole Häusler,Lena Uhlenberg,Göran Köber,Diyora Salimova,Oliver Amft
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, 1 table
Abstract:We propose a text-to-IMU (inertial measurement unit) motion-synthesis framework to obtain realistic IMU data by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). L_acc enforces consistency in the discrete second-order temporal differences of the generated motion, thereby aligning the diffusion prior with IMU-specific acceleration patterns. We integrate L_acc into the training objective of an existing diffusion model, finetune the model to obtain an IMU-specific motion prior, and evaluate the model with an existing text-to-IMU framework that comprises surface modelling and virtual sensor simulation. We analysed acceleration signal fidelity and differences between synthetic motion representation and actual IMU recordings. As a downstream application, we evaluated Human Activity Recognition (HAR) and compared the classification performance using data of our method with the earlier diffusion model and two additional diffusion model baselines. When we augmented the earlier diffusion model objective with L_acc and continued training, L_acc decreased by 12.7% relative to the original model. The improvements were considerably larger in high-dynamic activities (i.e., running, jumping) compared to low-dynamic activities~(i.e., sitting, standing). In a low-dimensional embedding, the synthetic IMU data produced by our refined model shifts closer to the distribution of real IMU recordings. HAR classification trained exclusively on our refined synthetic IMU data improved performance by 8.7% compared to the earlier diffusion model and by 7.6% over the best-performing comparison diffusion model. We conclude that acceleration-aware diffusion refinement provides an effective approach to align motion generation and IMU synthesis and highlights how flexible deep learning pipelines are for specialising generic text-to-motion priors to sensor-specific tasks.
[LG-7] Reinforcement Learning From State and Temporal Differences
链接: https://arxiv.org/abs/2512.08855
作者: Lex Weaver,Jonathan Baxter
类目: Machine Learning (cs.LG)
*备注:
Abstract:TD( \lambda ) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD( \lambda ) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD( \lambda )–starting from an optimal policy–converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD( \lambda ), called STD( \lambda ), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD( \lambda ) in the context of the two-state system, is presented, along with a comparison with Bertsekas’ differential training method [1]. This is followed by successful demonstrations of STD( \lambda ) on the two-state system and a variation on the well known acrobot problem.
[LG-8] Forecasting Fails: Unveiling Evasion Attacks in Weather Prediction Models
链接: https://arxiv.org/abs/2512.08832
作者: Huzaifa Arif,Pin-Yu Chen,Alex Gittens,James Diffenderfer,Bhavya Kailkhura
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the increasing reliance on AI models for weather forecasting, it is imperative to evaluate their vulnerability to adversarial perturbations. This work introduces Weather Adaptive Adversarial Perturbation Optimization (WAAPO), a novel framework for generating targeted adversarial perturbations that are both effective in manipulating forecasts and stealthy to avoid detection. WAAPO achieves this by incorporating constraints for channel sparsity, spatial localization, and smoothness, ensuring that perturbations remain physically realistic and imperceptible. Using the ERA5 dataset and FourCastNet (Pathak et al. 2022), we demonstrate WAAPO’s ability to generate adversarial trajectories that align closely with predefined targets, even under constrained conditions. Our experiments highlight critical vulnerabilities in AI-driven forecasting models, where small perturbations to initial conditions can result in significant deviations in predicted weather patterns. These findings underscore the need for robust safeguards to protect against adversarial exploitation in operational forecasting systems.
[LG-9] Identifying counterfactual probabilities using bivariate distributions and uplift modeling
链接: https://arxiv.org/abs/2512.08805
作者: Théo Verhelst,Gianluca Bontempi
类目: Machine Learning (cs.LG)
*备注: 7 pages. Submitted to the 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
Abstract:Uplift modeling estimates the causal effect of an intervention as the difference between potential outcomes under treatment and control, whereas counterfactual identification aims to recover the joint distribution of these potential outcomes (e.g., “Would this customer still have churned had we given them a marketing offer?”). This joint counterfactual distribution provides richer information than the uplift but is harder to estimate. However, the two approaches are synergistic: uplift models can be leveraged for counterfactual estimation. We propose a counterfactual estimator that fits a bivariate beta distribution to predicted uplift scores, yielding posterior distributions over counterfactual outcomes. Our approach requires no causal assumptions beyond those of uplift modeling. Simulations show the efficacy of the approach, which can be applied, for example, to the problem of customer churn in telecom, where it reveals insights unavailable to standard ML or uplift models alone.
[LG-10] De novo generation of functional terpene synthases using TpsGPT NEURIPS2025
链接: https://arxiv.org/abs/2512.08772
作者: Hamsini Ramanathan,Roman Bushuiev,Matouš Soldát,Jirí Kohout,Téo Hebra,Joshua David Smith,Josef Sivic,Tomáš Pluskal
类目: Machine Learning (cs.LG)
*备注: 11 pages, 8 figures, Accepted at the NeurIPS 2025 AI for Science and MLSB 2025 workshops
Abstract:Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
[LG-11] Learning and Editing Universal Graph Prompt Tuning via Reinforcement Learning KDD2026
链接: https://arxiv.org/abs/2512.08763
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Hewei Wang,Yijie Li,Edith C. H. Ngai
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026
Abstract:Early graph prompt tuning approaches relied on task-specific designs for Graph Neural Networks (GNNs), limiting their adaptability across diverse pre-training strategies. In contrast, another promising line of research has investigated universal graph prompt tuning, which operates directly in the input graph’s feature space and builds a theoretical foundation that universal graph prompt tuning can theoretically achieve an equivalent effect of any prompting function, eliminating dependence on specific pre-training strategies. Recent works propose selective node-based graph prompt tuning to pursue more ideal prompts. However, we argue that selective node-based graph prompt tuning inevitably compromises the theoretical foundation of universal graph prompt tuning. In this paper, we strengthen the theoretical foundation of universal graph prompt tuning by introducing stricter constraints, demonstrating that adding prompts to all nodes is a necessary condition for achieving the universality of graph prompts. To this end, we propose a novel model and paradigm, Learning and Editing Universal GrAph Prompt Tuning (LEAP), which preserves the theoretical foundation of universal graph prompt tuning while pursuing more ideal prompts. Specifically, we first build the basic universal graph prompts to preserve the theoretical foundation and then employ actor-critic reinforcement learning to select nodes and edit prompts. Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show that LEAP consistently outperforms fine-tuning and other prompt-based approaches.
[LG-12] Neural Ordinary Differential Equations for Simulating Metabolic Pathway Dynamics from Time-Series Multiomics Data
链接: https://arxiv.org/abs/2512.08732
作者: Udesh Habaraduwa,Andrei Lixandru
类目: Machine Learning (cs.LG); Subcellular Processes (q-bio.SC)
*备注:
Abstract:The advancement of human healthspan and bioengineering relies heavily on predicting the behavior of complex biological systems. While high-throughput multiomics data is becoming increasingly abundant, converting this data into actionable predictive models remains a bottleneck. High-capacity, datadriven simulation systems are critical in this landscape; unlike classical mechanistic models restricted by prior knowledge, these architectures can infer latent interactions directly from observational data, allowing for the simulation of temporal trajectories and the anticipation of downstream intervention effects in personalized medicine and synthetic biology. To address this challenge, we introduce Neural Ordinary Differential Equations (NODEs) as a dynamic framework for learning the complex interplay between the proteome and metabolome. We applied this framework to time-series data derived from engineered Escherichia coli strains, modeling the continuous dynamics of metabolic pathways. The proposed NODE architecture demonstrates superior performance in capturing system dynamics compared to traditional machine learning pipelines. Our results show a greater than 90% improvement in root mean squared error over baselines across both Limonene (up to 94.38% improvement) and Isopentenol (up to 97.65% improvement) pathway datasets. Furthermore, the NODE models demonstrated a 1000x acceleration in inference time, establishing them as a scalable, high-fidelity tool for the next generation of metabolic engineering and biological discovery.
[LG-13] Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search
链接: https://arxiv.org/abs/2512.08724
作者: Manos Plitsis,Giorgos Bouritsas,Vassilis Katsouros,Yannis Panagakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI’s internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.
[LG-14] Gradient-Informed Monte Carlo Fine-Tuning of Diffusion Models for Low-Thrust Trajectory Design
链接: https://arxiv.org/abs/2512.08705
作者: Jannik Graebner,Ryne Beeson
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-15] An Additive Manufacturing Part Qualification Framework: Transferring Knowledge of Stress-strain Behaviors from Additively Manufactured Polymers to Metals
链接: https://arxiv.org/abs/2512.08699
作者: Chenglong Duan,Dazhong Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Part qualification is crucial in additive manufacturing (AM) because it ensures that additively manufactured parts can be consistently produced and reliably used in critical applications. Part qualification aims at verifying that an additively manufactured part meets performance requirements; therefore, predicting the complex stress-strain behaviors of additively manufactured parts is critical. We develop a dynamic time warping (DTW)-transfer learning (TL) framework for additive manufacturing part qualification by transferring knowledge of the stress-strain behaviors of additively manufactured low-cost polymers to metals. Specifically, the framework employs DTW to select a polymer dataset as the source domain that is the most relevant to the target metal dataset. Using a long short-term memory (LSTM) model, four source polymers (i.e., Nylon, PLA, CF-ABS, and Resin) and three target metals (i.e., AlSi10Mg, Ti6Al4V, and carbon steel) that are fabricated by different AM techniques are utilized to demonstrate the effectiveness of the DTW-TL framework. Experimental results show that the DTW-TL framework identifies the closest match between polymers and metals to select one single polymer dataset as the source domain. The DTW-TL model achieves the lowest mean absolute percentage error of 12.41% and highest coefficient of determination of 0.96 when three metals are used as the target domain, respectively, outperforming the vanilla LSTM model without TL as well as the TL model pre-trained on four polymer datasets as the source domain.
[LG-16] DS FedProxGrad: Asymptotic Stationarity Without Noise Floor in Fair Federated Learning
链接: https://arxiv.org/abs/2512.08671
作者: Huzaifa Arif
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages
Abstract:Recent work \citearifgroup introduced Federated Proximal Gradient \textbf(\textttFedProxGrad) for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textitnoise-dominated neighborhood of stationarity, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \textttFedProxGrad-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbfDS \textttFedProxGrad (Decay Step Size \textttFedProxGrad). Under a Robbins-Monro step-size schedule \citerobbins1951stochastic and a mild decay condition on local inexactness, we prove that \liminf_r\to\infty \mathbbE[|\nabla F(\mathbfx^r)|^2] = 0 , i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.
[LG-17] Direct transfer of optimized controllers to similar systems using dimensionless MPC
链接: https://arxiv.org/abs/2512.08667
作者: Josip Kir Hromatko,Shambhuraj Sawant,Šandor Ileš,Sébastien Gros
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures
Abstract:Scaled model experiments are commonly used in various engineering fields to reduce experimentation costs and overcome constraints associated with full-scale systems. The relevance of such experiments relies on dimensional analysis and the principle of dynamic similarity. However, transferring controllers to full-scale systems often requires additional tuning. In this paper, we propose a method to enable a direct controller transfer using dimensionless model predictive control, tuned automatically for closed-loop performance. With this reformulation, the closed-loop behavior of an optimized controller transfers directly to a new, dynamically similar system. Additionally, the dimensionless formulation allows for the use of data from systems of different scales during parameter optimization. We demonstrate the method on a cartpole swing-up and a car racing problem, applying either reinforcement learning or Bayesian optimization for tuning the controller parameters. Software used to obtain the results in this paper is publicly available at this https URL.
[LG-18] Long-Sequence LSTM Modeling for NBA Game Outcome Prediction Using a Novel Multi-Season Dataset
链接: https://arxiv.org/abs/2512.08591
作者: Charles Rios,Longzhen Han,Almas Baimagambetov,Nikolaos Polatidis
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Predicting the outcomes of professional basketball games, particularly in the National Basketball Association (NBA), has become increasingly important for coaching strategy, fan engagement, and sports betting. However, many existing prediction models struggle with concept drift, limited temporal context, and instability across seasons. To advance forecasting in this domain, we introduce a newly constructed longitudinal NBA dataset covering the 2004-05 to 2024-25 seasons and present a deep learning framework designed to model long-term performance trends. Our primary contribution is a Long Short-Term Memory (LSTM) architecture that leverages an extended sequence length of 9,840 games equivalent to eight full NBA seasons to capture evolving team dynamics and season-over-season dependencies. We compare this model against several traditional Machine Learning (ML) and Deep Learning (DL) baselines, including Logistic Regression, Random Forest, Multi-Layer Perceptron (MLP), and Convolutional Neural Network (CNN). The LSTM achieves the best performance across all metrics, with 72.35 accuracy, 73.15 precision and 76.13 AUC-ROC. These results demonstrate the importance of long-sequence temporal modeling in basketball outcome prediction and highlight the value of our new multi-season dataset for developing robust, generalizable NBA forecasting systems.
[LG-19] Optimal Perturbation Budget Allocation for Data Poisoning in Offline Reinforcement Learning
链接: https://arxiv.org/abs/2512.08485
作者: Junnan Qiu,Jie Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks. Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately. This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations. In this paper, we propose a novel Global Budget Allocation attack strategy. Leveraging the theoretical insight that a sample’s influence on value function convergence is proportional to its Temporal Difference (TD) error, we formulate the attack as a global resource allocation problem. We derive a closed-form solution where perturbation magnitudes are assigned proportional to the TD-error sensitivity under a global L2 constraint. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms baseline strategies, achieving up to 80% performance degradation with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.
[LG-20] Solving Over-Smoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability
链接: https://arxiv.org/abs/2512.08475
作者: Weiqi Guan,Junlin He
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures
Abstract:The relationship between Layer Normalization (LN) placement and the over-smoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid over-smoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience over-smoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing over-smoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters. Key contributions: Theoretical Characterization: Analysis of LN dynamics and their impact on over-smoothing and the curse of depth. A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids over-smoothing and the curse of depth. Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs. Comments: 18 pages, 4 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.08475 [cs.LG] (or arXiv:2512.08475v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08475 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weiqi Guan [view email] [v1] Tue, 9 Dec 2025 10:49:01 UTC (904 KB) Full-text links: Access Paper: View a PDF of the paper titled Solving Over-Smoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability, by Weiqi Guan and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-21] ransformers for Multimodal Brain State Decoding: Integrating Functional Magnetic Resonance Imaging Data and Medical Metadata
链接: https://arxiv.org/abs/2512.08462
作者: Danial Jafarzadeh Jazi,Maryam Hajiesmaeili
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decoding brain states from functional magnetic resonance imaging (fMRI) data is vital for advancing neuroscience and clinical applications. While traditional machine learning and deep learning approaches have made strides in leveraging the high-dimensional and complex nature of fMRI data, they often fail to utilize the contextual richness provided by Digital Imaging and Communications in Medicine (DICOM) metadata. This paper presents a novel framework integrating transformer-based architectures with multimodal inputs, including fMRI data and DICOM metadata. By employing attention mechanisms, the proposed method captures intricate spatial-temporal patterns and contextual relationships, enhancing model accuracy, interpretability, and robustness. The potential of this framework spans applications in clinical diagnostics, cognitive neuroscience, and personalized medicine. Limitations, such as metadata variability and computational demands, are addressed, and future directions for optimizing scalability and generalizability are discussed.
[LG-22] Fully Decentralized Certified Unlearning
链接: https://arxiv.org/abs/2512.08443
作者: Hithem Lamri,Michail Maniatakos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting – where peers communicate without a coordinator remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose RR-DU, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) (\varepsilon,\delta) network-unlearning certificates on client views via subsampled Gaussian Rényi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy–utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), RR-DU matches a given (\varepsilon,\delta) while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing ( \approx 10% ).
[LG-23] Beyond Wave Variables: A Data-Driven Ensemble Approach for Enhanced Teleoperation Transparency and Stability
链接: https://arxiv.org/abs/2512.08436
作者: Nour Mitiche,Farid Ferguene,Mourad Oussalah
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, 5 tables
Abstract:Time delays in communication channels present significant challenges for bilateral teleoperation systems, affecting both transparency and stability. Although traditional wave variable-based methods for a four-channel architecture ensure stability via passivity, they remain vulnerable to wave reflections and disturbances like variable delays and environmental noise. This article presents a data-driven hybrid framework that replaces the conventional wave-variable transform with an ensemble of three advanced sequence models, each optimized separately via the state-of-the-art Optuna optimizer, and combined through a stacking meta-learner. The base predictors include an LSTM augmented with Prophet for trend correction, an LSTM-based feature extractor paired with clustering and a random forest for improved regression, and a CNN-LSTM model for localized and long-term dynamics. Experimental validation was performed in Python using data generated from the baseline system implemented in MATLAB/Simulink. The results show that our optimized ensemble achieves a transparency comparable to the baseline wave-variable system under varying delays and noise, while ensuring stability through passivity constraints.
[LG-24] A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
链接: https://arxiv.org/abs/2512.08371
作者: Simon Chung,Colby J. Vorland,Donna L. Maney,Andrew W. Brown
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
[LG-25] Magneton: Optimizing Energy Efficiency of ML Systems via Differential Energy Debugging
链接: https://arxiv.org/abs/2512.08365
作者: Yi Pan,Wenbo Qian,Dedong Xie,Ruiyan Hu,Yigong Hu,Baris Kasikci
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 10 fi
Abstract:The training and deployment of machine learning (ML) models have become extremely energy-intensive. While existing optimization efforts focus primarily on hardware energy efficiency, a significant but overlooked source of inefficiency is software energy waste caused by poor software design. This often includes redundant or poorly designed operations that consume more energy without improving performance. These inefficiencies arise in widely used ML frameworks and applications, yet developers often lack the visibility and tools to detect and diagnose them. We propose differential energy debugging, a novel approach that leverages the observation that competing ML systems often implement similar functionality with vastly different energy consumption. Building on this insight, we design and implement Magneton, an energy profiler that compares energy consumption between similar ML systems at the operator level and automatically pinpoints code regions and configuration choices responsible for excessive energy use. Applied to 9 popular ML systems spanning LLM inference, general ML frameworks, and image generation, Magneton detects and diagnoses 16 known cases of software energy inefficiency and further discovers 8 previously unknown cases, 7 of which have been confirmed by developers. Comments: 12 pages, 10 fi Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2512.08365 [cs.DC] (or arXiv:2512.08365v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.08365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Multi-Agent Deep Reinforcement Learning for Collaborative UAV Relay Networks under Jamming Atatcks
链接: https://arxiv.org/abs/2512.08341
作者: Thai Duong Nguyen,Ngoc-Tan Nguyen,Thanh-Dao Nguyen,Nguyen Van Huynh,Dinh-Hieu Tran,Symeon Chatzinotas
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: IEEE ICC 2026
Abstract:The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next-generation tactical networks. However, operating in contested environments requires solving a complex trade-off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic-based approaches often struggle to find effective solutions due to the dynamic and multi-objective nature of this problem. This paper formulates this challenge as a cooperative Multi-Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near-zero collision rate. A key finding is that the agents develop an emergent anti-jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade-off between mitigating interference from jammers and maintaining effective communication links with ground users.
[LG-27] Minimizing Layerwise Activation Norm Improves Generalization in Federated Learning WACV2024
链接: https://arxiv.org/abs/2512.08314
作者: M Yashwanth,Gaurav Kumar Nayak,Harsh Rangwani,Arya Singh,R. Venkatesh Babu,Anirban Chakraborty
类目: Machine Learning (cs.LG)
*备注: Accepted to WACV 2024
Abstract:Federated Learning (FL) is an emerging machine learning framework that enables multiple clients (coordinated by a server) to collaboratively train a global model by aggregating the locally trained models without sharing any client’s training data. It has been observed in recent works that learning in a federated manner may lead the aggregated global model to converge to a ‘sharp minimum’ thereby adversely affecting the generalizability of this FL-trained model. Therefore, in this work, we aim to improve the generalization performance of models trained in a federated setup by introducing a ‘flatness’ constrained FL optimization problem. This flatness constraint is imposed on the top eigenvalue of the Hessian computed from the training loss. As each client trains a model on its local data, we further re-formulate this complex problem utilizing the client loss functions and propose a new computationally efficient regularization technique, dubbed ‘MAN,’ which Minimizes Activation’s Norm of each layer on client-side models. We also theoretically show that minimizing the activation norm reduces the top eigenvalue of the layer-wise Hessian of the client’s loss, which in turn decreases the overall Hessian’s top eigenvalue, ensuring convergence to a flat minimum. We apply our proposed flatness-constrained optimization to the existing FL techniques and obtain significant improvements, thereby establishing new state-of-the-art.
[LG-28] Jacobian Aligned Random Forests
链接: https://arxiv.org/abs/2512.08306
作者: Sarwesh Rauniyar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Axis-aligned decision trees are fast and stable but struggle on datasets with rotated or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests address this with per-node hyperplane splits, but at added computational cost and implementation complexity. We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we first fit an axis-aligned forest to estimate class probabilities or regression outputs, compute finite-difference gradients of these predictions with respect to each feature, aggregate them into an expected Jacobian outer product that generalizes the expected gradient outer product (EGOP), and use it as a single global linear preconditioner for all inputs. This supervised preconditioner applies a single global rotation of the feature space, then hands the transformed data back to a standard axis-aligned forest, preserving off-the-shelf training pipelines while capturing oblique boundaries and feature interactions that would otherwise require many axis-aligned splits to approximate. The same construction applies to any model that provides gradients, though we focus on random forests and gradient-boosted trees in this work. On tabular classification and regression benchmarks, this preconditioning consistently improves axis-aligned forests and often matches or surpasses oblique baselines while improving training time. Our experimental results and theoretical analysis together indicate that supervised preconditioning can recover much of the accuracy of oblique forests while retaining the simplicity and robustness of axis-aligned trees.
[LG-29] Probabilistic Multi-Agent Aircraft Landing Time Prediction
链接: https://arxiv.org/abs/2512.08281
作者: Kyungmin Kim,Seokbin Yoon,Keumjin Lee
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures, accepted at AIAA SciTech 2026
Abstract:Accurate and reliable aircraft landing time prediction is essential for effective resource allocation in air traffic management. However, the inherent uncertainty of aircraft trajectories and traffic flows poses significant challenges to both prediction accuracy and trustworthiness. Therefore, prediction models should not only provide point estimates of aircraft landing times but also the uncertainties associated with these predictions. Furthermore, aircraft trajectories are frequently influenced by the presence of nearby aircraft through air traffic control interventions such as radar vectoring. Consequently, landing time prediction models must account for multi-agent interactions in the airspace. In this work, we propose a probabilistic multi-agent aircraft landing time prediction framework that provides the landing times of multiple aircraft as distributions. We evaluate the proposed framework using an air traffic surveillance dataset collected from the terminal airspace of the Incheon International Airport in South Korea. The results demonstrate that the proposed model achieves higher prediction accuracy than the baselines and quantifies the associated uncertainties of its outcomes. In addition, the model uncovered underlying patterns in air traffic control through its attention scores, thereby enhancing explainability.
[LG-30] FedLAD: A Modular and Adaptive Testbed for Federated Log Anomaly Detection
链接: https://arxiv.org/abs/2512.08277
作者: Yihan Liao,Jacky Keung,Zhenyu Mao,Jingyu Zhang,Jialong Li
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted Artifact at ACSOS 2025
Abstract:Log-based anomaly detection (LAD) is critical for ensuring the reliability of large-scale distributed systems. However, most existing LAD approaches assume centralized training, which is often impractical due to privacy constraints and the decentralized nature of system logs. While federated learning (FL) offers a promising alternative, there is a lack of dedicated testbeds tailored to the needs of LAD in federated settings. To address this, we present FedLAD, a unified platform for training and evaluating LAD models under FL constraints. FedLAD supports plug-and-play integration of diverse LAD models, benchmark datasets, and aggregation strategies, while offering runtime support for validation logging (self-monitoring), parameter tuning (self-configuration), and adaptive strategy control (self-adaptation). By enabling reproducible and scalable experimentation, FedLAD bridges the gap between FL frameworks and LAD requirements, providing a solid foundation for future research. Project code is publicly available at: this https URL.
[LG-31] gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs
链接: https://arxiv.org/abs/2512.08274
作者: Humera Sabir,Fatima Farooq,Ashraf Aboulnaga
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node’s global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.
[LG-32] SOFA-FL: Self-Organizing Hierarchical Federated Learning with Adaptive Clustered Data Sharing
链接: https://arxiv.org/abs/2512.08267
作者: Yi Ni,Xinkun Wang,Han Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) faces significant challenges in evolving environments, particularly regarding data heterogeneity and the rigidity of fixed network topologies. To address these issues, this paper proposes \textbfSOFA-FL (Self-Organizing Hierarchical Federated Learning with Adaptive Clustered Data Sharing), a novel framework that enables hierarchical federated systems to self-organize and adapt over time. The framework is built upon three core mechanisms: (1) \textbfDynamic Multi-branch Agglomerative Clustering (DMAC), which constructs an initial efficient hierarchical structure; (2) \textbfSelf-organizing Hierarchical Adaptive Propagation and Evolution (SHAPE), which allows the system to dynamically restructure its topology through atomic operations – grafting, pruning, consolidation, and purification – to adapt to changes in data distribution; and (3) \textbfAdaptive Clustered Data Sharing, which mitigates data heterogeneity by enabling controlled partial data exchange between clients and cluster nodes. By integrating these mechanisms, SOFA-FL effectively captures dynamic relationships among clients and enhances personalization capabilities without relying on predetermined cluster structures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.08267 [cs.LG] (or arXiv:2512.08267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] Mathematical Foundations of Neural Tangents and Infinite-Width Networks
链接: https://arxiv.org/abs/2512.08264
作者: Rachana Mysore,Preksha Girish,Kavitha Jayaram,Shrey Kumar,Preksha Girish,Shravan Sanjeev Bagal,Kavitha Jayaram,Shreya Aravind Shastry
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures
Abstract:We investigate the mathematical foundations of neural networks in the infinite-width regime through the Neural Tangent Kernel (NTK). We propose the NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN), an architecture integrating Fourier feature embeddings, residual connections with layerwise scaling, and stochastic depth to enable rigorous analysis of kernel evolution during training. Our theoretical contributions include deriving bounds on NTK dynamics, characterizing eigenvalue evolution, and linking spectral properties to generalization and optimization stability. Empirical results on synthetic and benchmark datasets validate the predicted kernel behavior and demonstrate improved training stability and generalization. This work provides a comprehensive framework bridging infinite-width theory and practical deep-learning architectures.
[LG-34] Geometric-Stochastic Multimodal Deep Learning for Predictive Modeling of SUDEP and Stroke Vulnerability
链接: https://arxiv.org/abs/2512.08257
作者: Preksha Girish,Rachana Mysore,Mahanthesha U,Shrey Kumar,Misbah Fatimah Annigeri,Tanish Jain
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 7 pages, 3 figures
Abstract:Sudden Unexpected Death in Epilepsy (SUDEP) and acute ischemic stroke are life-threatening conditions involving complex interactions across cortical, brainstem, and autonomic systems. We present a unified geometric-stochastic multimodal deep learning framework that integrates EEG, ECG, respiration, SpO2, EMG, and fMRI signals to model SUDEP and stroke vulnerability. The approach combines Riemannian manifold embeddings, Lie-group invariant feature representations, fractional stochastic dynamics, Hamiltonian energy-flow modeling, and cross-modal attention mechanisms. Stroke propagation is modeled using fractional epidemic diffusion over structural brain graphs. Experiments on the MULTI-CLARID dataset demonstrate improved predictive accuracy and interpretable biomarkers derived from manifold curvature, fractional memory indices, attention entropy, and diffusion centrality. The proposed framework provides a mathematically principled foundation for early detection, risk stratification, and interpretable multimodal modeling in neural-autonomic disorders.
[LG-35] Wavelet-Accelerated Physics-Informed Quantum Neural Network for Multiscale Partial Differential Equations
链接: https://arxiv.org/abs/2512.08256
作者: Deepak Gupta,Himanshu Pandey,Ratikanta Behera
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Quantum Algebra (math.QA)
*备注:
Abstract:This work proposes a wavelet-based physics-informed quantum neural network framework to efficiently address multiscale partial differential equations that involve sharp gradients, stiffness, rapid local variations, and highly oscillatory behavior. Traditional physics-informed neural networks (PINNs) have demonstrated substantial potential in solving differential equations, and their quantum counterparts, quantum-PINNs, exhibit enhanced representational capacity with fewer trainable parameters. However, both approaches face notable challenges in accurately solving multiscale features. Furthermore, their reliance on automatic differentiation for constructing loss functions introduces considerable computational overhead, resulting in longer training times. To overcome these challenges, we developed a wavelet-accelerated physics-informed quantum neural network that eliminates the need for automatic differentiation, significantly reducing computational complexity. The proposed framework incorporates the multiresolution property of wavelets within the quantum neural network architecture, thereby enhancing the network’s ability to effectively capture both local and global features of multiscale problems. Numerical experiments demonstrate that our proposed method achieves superior accuracy while requiring less than five percent of the trainable parameters compared to classical wavelet-based PINNs, resulting in faster convergence. Moreover, it offers a speedup of three to five times compared to existing quantum PINNs, highlighting the potential of the proposed approach for efficiently solving challenging multiscale and oscillatory problems.
[LG-36] SPROCKET: Extending ROCKET to Distance-Based Time-Series Transformations With Prototypes
链接: https://arxiv.org/abs/2512.08246
作者: Nicholas Harner
类目: Machine Learning (cs.LG)
*备注: 63 Pages, 28 in main body with 3 appendices for supplemental figures
Abstract:Classical Time Series Classification algorithms are dominated by feature engineering strategies. One of the most prominent of these transforms is ROCKET, which achieves strong performance through random kernel features. We introduce SPROCKET (Selected Prototype Random Convolutional Kernel Transform), which implements a new feature engineering strategy based on prototypes. On a majority of the UCR and UEA Time Series Classification archives, SPROCKET achieves performance comparable to existing convolutional algorithms and the new MR-HY-SP ( MultiROCKET-HYDRA-SPROCKET) ensemble’s average accuracy ranking exceeds HYDRA-MR, the previous best convolutional ensemble’s performance. These experimental results demonstrate that prototype-based feature transformation can enhance both accuracy and robustness in time series classification.
[LG-37] Persistent Topological Structures and Cohomological Flows as a Mathematical Framework for Brain-Inspired Representation Learning
链接: https://arxiv.org/abs/2512.08241
作者: Preksha Girish,Rachana Mysore,Mahanthesha U,Shrey Kumar,Shipra Prashant
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures
Abstract:This paper presents a mathematically rigorous framework for brain-inspired representation learning founded on the interplay between persistent topological structures and cohomological flows. Neural computation is reformulated as the evolution of cochain maps over dynamic simplicial complexes, enabling representations that capture invariants across temporal, spatial, and functional brain states. The proposed architecture integrates algebraic topology with differential geometry to construct cohomological operators that generalize gradient-based learning within a homological landscape. Synthetic data with controlled topological signatures and real neural datasets are jointly analyzed using persistent homology, sheaf cohomology, and spectral Laplacians to quantify stability, continuity, and structural preservation. Empirical results demonstrate that the model achieves superior manifold consistency and noise resilience compared to graph neural and manifold-based deep architectures, establishing a coherent mathematical foundation for topology-driven representation learning.
[LG-38] Correction of Decoupled Weight Decay
链接: https://arxiv.org/abs/2512.08217
作者: Jason Chuan-Chih Chou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate \gamma without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set \propto \gamma^2 instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay \propto \gamma^2 results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay \propto \gamma^2 leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.
[LG-39] MobileFineTuner: A Unified End-to-End Framework for Fine-Tuning LLM s on Mobile Phones
链接: https://arxiv.org/abs/2512.08211
作者: Jiaxiang Geng,Lunyu Zhao,Yiyi Lu,Bing Luo
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, submitted to Mobisys 2026
Abstract:Mobile phones are the most ubiquitous end devices, generating vast amounts of human-authored data and serving as the primary platform for end-side applications. As high-quality public data for large language models (LLMs) approaches exhaustion, on-device fine-tuning provides an opportunity to leverage private user data while preserving privacy. However, existing approaches are predominantly simulation-based or rely on IoT devices and PCs, leaving commodity mobile phones largely unexplored. A key gap is the absence of an open-source framework that enables practical LLM fine-tuning on mobile phones. We present MobileFineTuner, a unified open-source framework that enables end-to-end LLM fine-tuning directly on commodity mobile phones. MobileFineTuner is designed for efficiency, scalability, and usability, supporting full-parameters fine-tuning (Full-FT) and parameter-efficient fine-tuning (PEFT). To address the memory and energy limitations inherent to mobile phones, we introduce system-level optimizations including parameter sharding, gradient accumulation, and energy-aware computation scheduling. We demonstrate the practicality of MobileFineTuner by fine-tuning GPT-2, Gemma 3, and Qwen 2.5 on real mobile phones. Extensive experiments and ablation studies validate the effectiveness of the proposed optimizations and establish MobileFineTuner as a viable foundation for future research on on-device LLM training.
[LG-40] PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection
链接: https://arxiv.org/abs/2512.08143
作者: Ali Lotfi Rezaabad,Bikram Khanal,Shashwat Chaurasia,Lu Zeng,Dezhi Hong,Hossein Beshashati,Thomas Butler,Megan Ganji
类目: Machine Learning (cs.LG)
*备注:
Abstract:Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases–such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets–Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching)–PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.
[LG-41] Robust Agents in Open-Ended Worlds
链接: https://arxiv.org/abs/2512.08139
作者: Mikayel Samvelyan
类目: Machine Learning (cs.LG)
*备注: PhD Thesis
Abstract:The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.
[LG-42] Robust equilibria in continuous games: From strategic to dynamic robustness
链接: https://arxiv.org/abs/2512.08138
作者: Kyriakos Lotidis,Panayotis Mertikopoulos,Nicholas Bambos,Jose Blanchet
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 33 pages, 5 figures
Abstract:In this paper, we examine the robustness of Nash equilibria in continuous games, under both strategic and dynamic uncertainty. Starting with the former, we introduce the notion of a robust equilibrium as those equilibria that remain invariant to small – but otherwise arbitrary – perturbations to the game’s payoff structure, and we provide a crisp geometric characterization thereof. Subsequently, we turn to the question of dynamic robustness, and we examine which equilibria may arise as stable limit points of the dynamics of “follow the regularized leader” (FTRL) in the presence of randomness and uncertainty. Despite their very distinct origins, we establish a structural correspondence between these two notions of robustness: strategic robustness implies dynamic robustness, and, conversely, the requirement of strategic robustness cannot be relaxed if dynamic robustness is to be maintained. Finally, we examine the rate of convergence to robust equilibria as a function of the underlying regularizer, and we show that entropically regularized learning converges at a geometric rate in games with affinely constrained action spaces.
[LG-43] Multi-agent learning under uncertainty: Recurrence vs. concentration
链接: https://arxiv.org/abs/2512.08132
作者: Kyriakos Lotidis,Panayotis Mertikopoulos,Nicholas Bambos,Jose Blanchet
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 44 pages, 17 figures
Abstract:In this paper, we examine the convergence landscape of multi-agent learning under uncertainty. Specifically, we analyze two stochastic models of regularized learning in continuous games – one in continuous and one in discrete time with the aim of characterizing the long-run behavior of the induced sequence of play. In stark contrast to deterministic, full-information models of learning (or models with a vanishing learning rate), we show that the resulting dynamics do not converge in general. In lieu of this, we ask instead which actions are played more often in the long run, and by how much. We show that, in strongly monotone games, the dynamics of regularized learning may wander away from equilibrium infinitely often, but they always return to its vicinity in finite time (which we estimate), and their long-run distribution is sharply concentrated around a neighborhood thereof. We quantify the degree of this concentration, and we show that these favorable properties may all break down if the underlying game is not strongly monotone – underscoring in this way the limits of regularized learning in the presence of persistent randomness and uncertainty.
[LG-44] Improving the Sensitivity of Backdoor Detectors via Class Subspace Orthogonalization
链接: https://arxiv.org/abs/2512.08129
作者: Guangmingmei Yang,David J. Miller,George Kesidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class’s intrinsic features. We dub this plug-and-play approach Class Subspace Orthogonalization (CSO) and assess it against challenging mixed-label and adaptive attacks.
[LG-45] Complexity of One-Dimensional ReLU DNNs
链接: https://arxiv.org/abs/2512.08091
作者: Jonathan Kogan,Hayden Jananthan,Jeremy Kepner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Presented at IEEE MIT URTC 2025
Abstract:We study the expressivity of one-dimensional (1D) ReLU deep neural networks through the lens of their linear regions. For randomly initialized, fully connected 1D ReLU networks (He scaling with nonzero bias) in the infinite-width limit, we prove that the expected number of linear regions grows as \sum_i = 1^L n_i + \mathopo\left(\sum_i = 1^Ln_i\right) + 1 , where n_\ell denotes the number of neurons in the \ell -th hidden layer. We also propose a function-adaptive notion of sparsity that compares the expected regions used by the network to the minimal number needed to approximate a target within a fixed tolerance.
[LG-46] Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders
链接: https://arxiv.org/abs/2512.08077
作者: Jaron Cohen,Alexander G. Hasson,Sara Tanovic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.
[LG-47] CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification
链接: https://arxiv.org/abs/2512.08071
作者: Pingchuan Ma,Chengshuai Zhao,Bohan Jiang,Saketh Vishnubhatla,Ujun Jeong,Alimohammad Beigi,Adrienne Raglin,Huan Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Crisis classification in social media aims to extract actionable disaster-related information from multimodal posts, which is a crucial task for enhancing situational awareness and facilitating timely emergency responses. However, the wide variation in crisis types makes achieving generalizable performance across unseen disasters a persistent challenge. Existing approaches primarily leverage deep learning to fuse textual and visual cues for crisis classification, achieving numerically plausible results under in-domain settings. However, they exhibit poor generalization across unseen crisis types because they 1. do not disentangle spurious and causal features, resulting in performance degradation under domain shift, and 2. fail to align heterogeneous modality representations within a shared space, which hinders the direct adaptation of established single-modality domain generalization (DG) techniques to the multimodal setting. To address these issues, we introduce a causality-guided multimodal domain generalization (MMDG) framework that combines adversarial disentanglement with unified representation learning for crisis classification. The adversarial objective encourages the model to disentangle and focus on domain-invariant causal features, leading to more generalizable classifications grounded in stable causal mechanisms. The unified representation aligns features from different modalities within a shared latent space, enabling single-modality DG strategies to be seamlessly extended to multimodal learning. Experiments on the different datasets demonstrate that our approach achieves the best performance in unseen disaster scenarios.
[LG-48] Deep Kernel Aalen-Johansen Estimator: An Interpretable and Flexible Neural Net Framework for Competing Risks ML4H ALT
链接: https://arxiv.org/abs/2512.08063
作者: Xiaobin Shen,George H. Chen
类目: Machine Learning (cs.LG)
*备注: Machine Learning for Health (ML4H) 2025 Spotlight
Abstract:We propose an interpretable deep competing risks model called the Deep Kernel Aalen-Johansen (DKAJ) estimator, which generalizes the classical Aalen-Johansen nonparametric estimate of cumulative incidence functions (CIFs). Each data point (e.g., patient) is represented as a weighted combination of clusters. If a data point has nonzero weight only for one cluster, then its predicted CIFs correspond to those of the classical Aalen-Johansen estimator restricted to data points from that cluster. These weights come from an automatically learned kernel function that measures how similar any two data points are. On four standard competing risks datasets, we show that DKAJ is competitive with state-of-the-art baselines while being able to provide visualizations to assist model interpretation.
[LG-49] LUNA: Linear Universal Neural Attention with Generalization Guarantees
链接: https://arxiv.org/abs/2512.08061
作者: Ashkan Shahbazi,Ping He,Ali Abbasi,Yikun Bai,Xinran Liu,Elaheh Akbari,Darian Salehi,Navid NaderiAlizadeh,Soheil Kolouri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Scaling attention faces a critical bottleneck: the \mathcalO(n^2) quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to \mathcalO(n) , they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textscLUNA, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textscLUNA is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textscLUNA learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textscLuna implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textscLuna achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textscLuna also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.
[LG-50] Fairness-aware PageRank via Edge Reweighting
链接: https://arxiv.org/abs/2512.08055
作者: Honglian Wang,Haoyun Chen,Aristides Gionis
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Link-analysis algorithms, such as PageRank, are instrumental in understanding the structural dynamics of networks by evaluating the importance of individual vertices based on their connectivity. Recently, with the rising importance of responsible AI, the question of fairness in link-analysis algorithms has gained traction. In this paper, we present a new approach for incorporating group fairness into the PageRank algorithm by reweighting the transition probabilities in the underlying transition matrix. We formulate the problem of achieving fair PageRank by seeking to minimize the fairness loss, which is the difference between the original group-wise PageRank distribution and a target PageRank distribution. We further define a group-adapted fairness notion, which accounts for group homophily by considering random walks with group-biased restart for each group. Since the fairness loss is non-convex, we propose an efficient projected gradient-descent method for computing locally-optimal edge weights. Unlike earlier approaches, we do not recommend adding new edges to the network, nor do we adjust the restart vector. Instead, we keep the topology of the underlying network unchanged and only modify the relative importance of existing edges. We empirically compare our approach with state-of-the-art baselines and demonstrate the efficacy of our method, where very small changes in the transition matrix lead to significant improvement in the fairness of the PageRank algorithm.
[LG-51] An Introduction to Deep Reinforcement and Imitation Learning
链接: https://arxiv.org/abs/2512.08052
作者: Pedro Santana
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.
[LG-52] Learning Dynamics from Infrequent Output Measurements for Uncertainty-Aware Optimal Control
链接: https://arxiv.org/abs/2512.08013
作者: Robert Lefringhausen,Theodor Springer,Sandra Hirche
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to the 2026 IFAC World Congress
Abstract:Reliable optimal control is challenging when the dynamics of a nonlinear system are unknown and only infrequent, noisy output measurements are available. This work addresses this setting of limited sensing by formulating a Bayesian prior over the continuous-time dynamics and latent state trajectory in state-space form and updating it through a targeted marginal Metropolis-Hastings sampler equipped with a numerical ODE integrator. The resulting posterior samples are used to formulate a scenario-based optimal control problem that accounts for both model and measurement uncertainty and is solved using standard nonlinear programming methods. The approach is validated in a numerical case study on glucose regulation using a Type 1 diabetes model.
[LG-53] Benchmarking Offline Multi-Objective Reinforcement Learning in Critical Care
链接: https://arxiv.org/abs/2512.08012
作者: Aryaman Bansal,Divya Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:In critical care settings such as the Intensive Care Unit, clinicians face the complex challenge of balancing conflicting objectives, primarily maximizing patient survival while minimizing resource utilization (e.g., length of stay). Single-objective Reinforcement Learning approaches typically address this by optimizing a fixed scalarized reward function, resulting in rigid policies that fail to adapt to varying clinical priorities. Multi-objective Reinforcement Learning (MORL) offers a solution by learning a set of optimal policies along the Pareto Frontier, allowing for dynamic preference selection at test time. However, applying MORL in healthcare necessitates strict offline learning from historical data. In this paper, we benchmark three offline MORL algorithms, Conditioned Conservative Pareto Q-Learning (CPQL), Adaptive CPQL, and a modified Pareto Efficient Decision Agent (PEDA) Decision Transformer (PEDA DT), against three scalarized single-objective baselines (BC, CQL, and DDQN) on the MIMIC-IV dataset. Using Off-Policy Evaluation (OPE) metrics, we demonstrate that PEDA DT algorithm offers superior flexibility compared to static scalarized baselines. Notably, our results extend previous findings on single-objective Decision Transformers in healthcare, confirming that sequence modeling architectures remain robust and effective when scaled to multi-objective conditioned generation. These findings suggest that offline MORL is a promising framework for enabling personalized, adjustable decision-making in critical care without the need for retraining. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.08012 [cs.LG] (or arXiv:2512.08012v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08012 Focus to learn more arXiv-issued DOI via DataCite
[LG-54] A Comparative Study of EMG- and IMU-based Gesture Recognition at the Wrist and Forearm
链接: https://arxiv.org/abs/2512.07997
作者: Soroush Baghernezhad,Elaheh Mohammadreza,Vinicius Prado da Fonseca,Ting Zou,Xianta Jiang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Gestures are an integral part of our daily interactions with the environment. Hand gesture recognition (HGR) is the process of interpreting human intent through various input modalities, such as visual data (images and videos) and bio-signals. Bio-signals are widely used in HGR due to their ability to be captured non-invasively via sensors placed on the arm. Among these, surface electromyography (sEMG), which measures the electrical activity of muscles, is the most extensively studied modality. However, less-explored alternatives such as inertial measurement units (IMUs) can provide complementary information on subtle muscle movements, which makes them valuable for gesture recognition. In this study, we investigate the potential of using IMU signals from different muscle groups to capture user intent. Our results demonstrate that IMU signals contain sufficient information to serve as the sole input sensor for static gesture recognition. Moreover, we compare different muscle groups and check the quality of pattern recognition on individual muscle groups. We further found that tendon-induced micro-movement captured by IMUs is a major contributor to static gesture recognition. We believe that leveraging muscle micro-movement information can enhance the usability of prosthetic arms for amputees. This approach also offers new possibilities for hand gesture recognition in fields such as robotics, teleoperation, sign language interpretation, and beyond.
[LG-55] Bridging the Clinical Expertise Gap: Development of a Web-Based Platform for Accessible Time Series Forecasting and Analysis
链接: https://arxiv.org/abs/2512.07992
作者: Aaron D. Mullen,Daniel R. Harris,Svetla Slavova,V.K. Cody Bumgardner
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Time series forecasting has applications across domains and industries, especially in healthcare, but the technical expertise required to analyze data, build models, and interpret results can be a barrier to using these techniques. This article presents a web platform that makes the process of analyzing and plotting data, training forecasting models, and interpreting and viewing results accessible to researchers and clinicians. Users can upload data and generate plots to showcase their variables and the relationships between them. The platform supports multiple forecasting models and training techniques which are highly customizable according to the user’s needs. Additionally, recommendations and explanations can be generated from a large language model that can help the user choose appropriate parameters for their data and understand the results for each model. The goal is to integrate this platform into learning health systems for continuous data collection and inference from clinical pipelines.
[LG-56] HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability
链接: https://arxiv.org/abs/2512.07988
作者: Sudhanva Manjunath Athreya,Paul Rosen
类目: Machine Learning (cs.LG); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Deep learning models have achieved remarkable success across various domains, yet their learned representations and decision-making processes remain largely opaque and hard to interpret. This work introduces HOLE (Homological Observation of Latent Embeddings), a method for analyzing and interpreting deep neural networks through persistent homology. HOLE extracts topological features from neural activations and presents them using a suite of visualization techniques, including Sankey diagrams, heatmaps, dendrograms, and blob graphs. These tools facilitate the examination of representation structure and quality across layers. We evaluate HOLE on standard datasets using a range of discriminative models, focusing on representation quality, interpretability across layers, and robustness to input perturbations and model compression. The results indicate that topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing a complementary perspective for understanding and improving deep learning systems.
[LG-57] owards symbolic regression for interpretable clinical decision scores
链接: https://arxiv.org/abs/2512.07961
作者: Guilherme Seidyo Imai Aldeia,Joseph D. Romano,Fabricio Olivetti de Franca,Daniel S. Herman,William G. La Cava
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 5 figures. Accepted for publication in Philosophical Transactions A. Autor Accepted Manuscript version
Abstract:Medical decision-making makes frequent use of algorithms that combine risk equations with rules, providing clear and standardized treatment pathways. Symbolic regression (SR) traditionally limits its search space to continuous function forms and their parameters, making it difficult to model this decision-making. However, due to its ability to derive data-driven, interpretable models, SR holds promise for developing data-driven clinical risk scores. To that end we introduce Brush, an SR algorithm that combines decision-tree-like splitting algorithms with non-linear constant optimization, allowing for seamless integration of rule-based logic into symbolic regression and classification models. Brush achieves Pareto-optimal performance on SRBench, and was applied to recapitulate two widely used clinical scoring systems, achieving high accuracy and interpretable models. Compared to decision trees, random forests, and other SR methods, Brush achieves comparable or superior predictive performance while producing simpler models.
[LG-58] CrowdLLM : Building LLM -Based Digital Populations Augmented with Generative Models
链接: https://arxiv.org/abs/2512.07890
作者: Ryan Feng Lin,Keyu Tian,Hanming Zheng,Congjing Zhang,Li Zeng,Shuai Huang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.
[LG-59] Semi-Supervised Contrastive Learning with Orthonormal Prototypes
链接: https://arxiv.org/abs/2512.07880
作者: Huanran Li(1),Manh Nguyen(2),Daniel Pimentel-Alarcón(3) ((1) Department of Electrical Engineering, (2) Statistics, (3) Biostatistics, Wisconsin Institute of Discovery, University of Wisconsin-Madison)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Contrastive learning has emerged as a powerful method in deep learning, excelling at learning effective representations through contrasting samples from different distributions. However, dimensional collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge, especially in semi-supervised and self-supervised setups. In this paper, we first identify a critical learning-rate threshold, beyond which standard contrastive losses converge to collapsed solutions. Building on these insights, we propose CLOP, a novel semi-supervised loss function designed to prevent dimensional collapse by promoting the formation of orthogonal linear subspaces among class embeddings. Through extensive experiments on real and synthetic datasets, we demonstrate that CLOP improves performance in image classification and object detection tasks while also exhibiting greater stability across different learning rates and batch sizes.
[LG-60] Nonnegative Matrix Factorization through Cone Collapse
链接: https://arxiv.org/abs/2512.07879
作者: Manh Nguyen(1),Daniel Pimentel-Alarcón(2) ((1) Department of Statistics, (2) Department of Biostatistics and Medical Informatics, Wisconsin Institute of Discovery, University of Wisconsin-Madison)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Nonnegative matrix factorization (NMF) is a widely used tool for learning parts-based, low-dimensional representations of nonnegative data, with applications in vision, text, and bioinformatics. In clustering applications, orthogonal NMF (ONMF) variants further impose (approximate) orthogonality on the representation matrix so that its rows behave like soft cluster indicators. Existing algorithms, however, are typically derived from optimization viewpoints and do not explicitly exploit the conic geometry induced by NMF: data points lie in a convex cone whose extreme rays encode fundamental directions or “topics”. In this work we revisit NMF from this geometric perspective and propose Cone Collapse, an algorithm that starts from the full nonnegative orthant and iteratively shrinks it toward the minimal cone generated by the data. We prove that, under mild assumptions on the data, Cone Collapse terminates in finitely many steps and recovers the minimal generating cone of \mathbfX^\top . Building on this basis, we then derive a cone-aware orthogonal NMF model (CC-NMF) by applying uni-orthogonal NMF to the recovered extreme rays. Across 16 benchmark gene-expression, text, and image datasets, CC-NMF consistently matches or outperforms strong NMF baselines-including multiplicative updates, ANLS, projective NMF, ONMF, and sparse NMF-in terms of clustering purity. These results demonstrate that explicitly recovering the data cone can yield both theoretically grounded and empirically strong NMF-based clustering methods.
[LG-61] Graph Contrastive Learning via Spectral Graph Alignment
链接: https://arxiv.org/abs/2512.07878
作者: Manh Nguyen,Joshua Cape(Department of Statistics, University of Wisconsin-Madison)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Given augmented views of each input graph, contrastive learning methods (e.g., InfoNCE) optimize pairwise alignment of graph embeddings across views while providing no mechanism to control the global structure of the view specific graph-of-graphs built from these embeddings. We introduce SpecMatch-CL, a novel loss function that aligns the view specific graph-of-graphs by minimizing the difference between their normalized Laplacians. Theoretically, we show that under certain assumptions, the difference between normalized Laplacians provides an upper bound not only for the difference between the ideal Perfect Alignment contrastive loss and the current loss, but also for the Uniformly loss. Empirically, SpecMatch-CL establishes new state of the art on eight TU benchmarks under unsupervised learning and semi-supervised learning at low label rates, and yields consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.
[LG-62] Fourier-Enhanced Recurrent Neural Networks for Electrical Load Time Series Downscaling
链接: https://arxiv.org/abs/2512.07876
作者: Qi Chen,Mihai Anitescu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to IEEE PES General Meeting 2026
Abstract:We present a Fourier-enhanced recurrent neural network (RNN) for downscaling electrical loads. The model combines (i) a recurrent backbone driven by low-resolution inputs, (ii) explicit Fourier seasonal embeddings fused in latent space, and (iii) a self-attention layer that captures dependencies among high-resolution components within each period. Across four PJM territories, the approach yields RMSE lower and flatter horizon-wise than classical Prophet baselines (with and without seasonality/LAA) and than RNN ablations without attention or Fourier features.
[LG-63] Softly Symbolifying Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2512.07875
作者: James Bagrow,Josh Bongard
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 13 pages, 5 figures, 3 tables
Abstract:Kolmogorov-Arnold Networks (KANs) offer a promising path toward interpretable machine learning: their learnable activations can be studied individually, while collectively fitting complex data accurately. In practice, however, trained activations often lack symbolic fidelity, learning pathological decompositions with no meaningful correspondence to interpretable forms. We propose Softly Symbolified Kolmogorov-Arnold Networks (S2KAN), which integrate symbolic primitives directly into training. Each activation draws from a dictionary of symbolic and dense terms, with learnable gates that sparsify the representation. Crucially, this sparsification is differentiable, enabling end-to-end optimization, and is guided by a principled Minimum Description Length objective. When symbolic terms suffice, S2KAN discovers interpretable forms; when they do not, it gracefully degrades to dense splines. We demonstrate competitive or superior accuracy with substantially smaller models across symbolic benchmarks, dynamical systems forecasting, and real-world prediction tasks, and observe evidence of emergent self-sparsification even without regularization pressure.
[LG-64] Controllable risk scenario generation from human crash data for autonomous vehicle testing
链接: https://arxiv.org/abs/2512.07874
作者: Qiujing Lu,Xuanhan Wang,Runze Yuan,Wei Lu,Xinyi Gong,Shuo Feng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring the safety of autonomous vehicles (AV) requires rigorous testing under both everyday driving and rare, safety-critical conditions. A key challenge lies in simulating environment agents, including background vehicles (BVs) and vulnerable road users (VRUs), that behave realistically in nominal traffic while also exhibiting risk-prone behaviors consistent with real-world accidents. We introduce Controllable Risk Agent Generation (CRAG), a framework designed to unify the modeling of dominant nominal behaviors and rare safety-critical behaviors. CRAG constructs a structured latent space that disentangles normal and risk-related behaviors, enabling efficient use of limited crash data. By combining risk-aware latent representations with optimization-based mode-transition mechanisms, the framework allows agents to shift smoothly and plausibly from safe to risk states over extended horizons, while maintaining high fidelity in both regimes. Extensive experiments show that CRAG improves diversity compared to existing baselines, while also enabling controllable generation of risk scenarios for targeted and efficient evaluation of AV robustness.
[LG-65] Using Text-Based Life Trajectories from Swedish Register Data to Predict Residential Mobility with Pretrained Transformers
链接: https://arxiv.org/abs/2512.07865
作者: Philipp Stark,Alexandros Sopasakis,Ola Hall,Markus Grillitsch
类目: Machine Learning (cs.LG)
*备注:
Abstract:We transform large-scale Swedish register data into textual life trajectories to address two long-standing challenges in data analysis: high cardinality of categorical variables and inconsistencies in coding schemes over time. Leveraging this uniquely comprehensive population register, we convert register data from 6.9 million individuals (2001-2013) into semantically rich texts and predict individuals’ residential mobility in later years (2013-2017). These life trajectories combine demographic information with annual changes in residence, work, education, income, and family circumstances, allowing us to assess how effectively such sequences support longitudinal prediction. We compare multiple NLP architectures (including LSTM, DistilBERT, BERT, and Qwen) and find that sequential and transformer-based models capture temporal and semantic structure more effectively than baseline models. The results show that textualized register data preserves meaningful information about individual pathways and supports complex, scalable modeling. Because few countries maintain longitudinal microdata with comparable coverage and precision, this dataset enables analyses and methodological tests that would be difficult or impossible elsewhere, offering a rigorous testbed for developing and evaluating new sequence-modeling approaches. Overall, our findings demonstrate that combining semantically rich register data with modern language models can substantially advance longitudinal analysis in social sciences.
[LG-66] Pattern Recognition of Ozone-Depleting Substance Exports in Global Trade Data
链接: https://arxiv.org/abs/2512.07864
作者: Muhammad Sukri Bin Ramli
类目: Machine Learning (cs.LG); Econometrics (econ.EM); General Economics (econ.GN)
*备注:
Abstract:New methods are needed to monitor environmental treaties, like the Montreal Protocol, by reviewing large, complex customs datasets. This paper introduces a framework using unsupervised machine learning to systematically detect suspicious trade patterns and highlight activities for review. Our methodology, applied to 100,000 trade records, combines several ML techniques. Unsupervised Clustering (K-Means) discovers natural trade archetypes based on shipment value and weight. Anomaly Detection (Isolation Forest and IQR) identifies rare “mega-trades” and shipments with commercially unusual price-per-kilogram values. This is supplemented by Heuristic Flagging to find tactics like vague shipment descriptions. These layers are combined into a priority score, which successfully identified 1,351 price outliers and 1,288 high-priority shipments for customs review. A key finding is that high-priority commodities show a different and more valuable value-to-weight ratio than general goods. This was validated using Explainable AI (SHAP), which confirmed vague descriptions and high value as the most significant risk predictors. The model’s sensitivity was validated by its detection of a massive spike in “mega-trades” in early 2021, correlating directly with the real-world regulatory impact of the US AIM Act. This work presents a repeatable unsupervised learning pipeline to turn raw trade data into prioritized, usable intelligence for regulatory groups.
[LG-67] SetAD: Semi-Supervised Anomaly Learning in Contextual Sets
链接: https://arxiv.org/abs/2512.07863
作者: Jianling Gao,Chongyang Tao,Xuelian Lin,Junfeng Liu,Shuai Ma
类目: Machine Learning (cs.LG)
*备注: 9 pages
Abstract:Semi-supervised anomaly detection (AD) has shown great promise by effectively leveraging limited labeled data. However, existing methods are typically structured around scoring individual points or simple pairs. Such point- or pair-centric view not only overlooks the contextual nature of anomalies, which are defined by their deviation from a collective group, but also fails to exploit the rich supervisory signals that can be generated from the combinatorial composition of sets. Consequently, such models struggle to exploit the high-order interactions within the data, which are critical for learning discriminative representations. To address these limitations, we propose SetAD, a novel framework that reframes semi-supervised AD as a Set-level Anomaly Detection task. SetAD employs an attention-based set encoder trained via a graded learning objective, where the model learns to quantify the degree of anomalousness within an entire set. This approach directly models the complex group-level interactions that define anomalies. Furthermore, to enhance robustness and score calibration, we propose a context-calibrated anomaly scoring mechanism, which assesses a point’s anomaly score by aggregating its normalized deviations from peer behavior across multiple, diverse contextual sets. Extensive experiments on 10 real-world datasets demonstrate that SetAD significantly outperforms state-of-the-art models. Notably, we show that our model’s performance consistently improves with increasing set size, providing strong empirical support for the set-based formulation of anomaly detection.
[LG-68] FAIM: Frequency-Aware Interactive Mamba for Time Series Classification
链接: https://arxiv.org/abs/2512.07858
作者: Da Zhang,Bingyu Li,Zhiyuan Zhao,Yanhan Zhang,Junyu Gao,Feiping Nie,Xuelong Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition. TSC tasks require models to effectively capture discriminative information for accurate class identification. Although deep learning architectures excel at capturing temporal dependencies, they often suffer from high computational cost, sensitivity to noise perturbations, and susceptibility to overfitting on small-scale datasets. To address these challenges, we propose FAIM, a lightweight Frequency-Aware Interactive Mamba model. Specifically, we introduce an Adaptive Filtering Block (AFB) that leverages Fourier Transform to extract frequency-domain features from time series data. The AFB incorporates learnable adaptive thresholds to dynamically suppress noise and employs element-wise coupling of global and local semantic adaptive filtering, enabling in-depth modeling of the synergy among different frequency components. Furthermore, we design an Interactive Mamba Block (IMB) to facilitate efficient multi-granularity information interaction, balancing the extraction of fine-grained discriminative features and comprehensive global contextual information, thereby endowing FAIM with powerful and expressive representations for TSC tasks. Additionally, we incorporate a self-supervised pre-training mechanism to enhance FAIM’s understanding of complex temporal patterns and improve its robustness across various domains and high-noise scenarios. Extensive experiments on multiple benchmarks demonstrate that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency and exhibits outstanding performance.
[LG-69] SA2GFM: Enhancing Robust Graph Foundation Models with Structure-Aware Semantic Augmentation
链接: https://arxiv.org/abs/2512.07857
作者: Junhua Shi,Qingyun Sun,Haonan Yuan,Xingcheng Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present Graph Foundation Models (GFMs) which have made significant progress in various tasks, but their robustness against domain noise, structural perturbations, and adversarial attacks remains underexplored. A key limitation is the insufficient modeling of hierarchical structural semantics, which are crucial for generalization. In this paper, we propose SA^2GFM, a robust GFM framework that improves domain-adaptive representations through Structure-Aware Semantic Augmentation. First, we encode hierarchical structural priors by transforming entropy-based encoding trees into structure-aware textual prompts for feature augmentation. The enhanced inputs are processed by a self-supervised Information Bottleneck mechanism that distills robust, transferable representations via structure-guided compression. To address negative transfer in cross-domain adaptation, we introduce an expert adaptive routing mechanism, combining a mixture-of-experts architecture with a null expert design. For efficient downstream adaptation, we propose a fine-tuning module that optimizes hierarchical structures through joint intra- and inter-community structure learning. Extensive experiments demonstrate that SA^2GFM outperforms 9 state-of-the-art baselines in terms of effectiveness and robustness against random noise and adversarial perturbations for node and graph classification.
[LG-70] Medical Test-free Disease Detection Based on Big Data
链接: https://arxiv.org/abs/2512.07856
作者: Haokun Zhao,Yingzhe Bai,Qingyang Xu,Lixin Zhou,Jianxin Chen,Jicong Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate disease detection is of paramount importance for effective medical treatment and patient care. However, the process of disease detection is often associated with extensive medical testing and considerable costs, making it impractical to perform all possible medical tests on a patient to diagnose or predict hundreds or thousands of diseases. In this work, we propose Collaborative Learning for Disease Detection (CLDD), a novel graph-based deep learning model that formulates disease detection as a collaborative learning task by exploiting associations among diseases and similarities among patients adaptively. CLDD integrates patient-disease interactions and demographic features from electronic health records to detect hundreds or thousands of diseases for every patient, with little to no reliance on the corresponding medical tests. Extensive experiments on a processed version of the MIMIC-IV dataset comprising 61,191 patients and 2,000 diseases demonstrate that CLDD consistently outperforms representative baselines across multiple metrics, achieving a 6.33% improvement in recall and 7.63% improvement in precision. Furthermore, case studies on individual patients illustrate that CLDD can successfully recover masked diseases within its top-ranked predictions, demonstrating both interpretability and reliability in disease prediction. By reducing diagnostic costs and improving accessibility, CLDD holds promise for large-scale disease screening and social health security.
[LG-71] HSTMixer: A Hierarchical MLP-Mixer for Large-Scale Traffic Forecasting
链接: https://arxiv.org/abs/2512.07854
作者: Yongyao Wang,Jingyuan Wang,Xie Yu,Jiahao Ji,Chao Li
类目: Machine Learning (cs.LG)
*备注: 10 pages, 9 figures
Abstract:Traffic forecasting task is significant to modern urban management. Recently, there is growing attention on large-scale forecasting, as it better reflects the complexity of real-world traffic networks. However, existing models often exhibit quadratic computational complexity, making them impractical for large-scale real-world scenarios. In this paper, we propose a novel framework, Hierarchical Spatio-Temporal Mixer (HSTMixer), which leverages an all-MLP architecture for efficient and effective large-scale traffic forecasting. HSTMixer employs a hierarchical spatiotemporal mixing block to extract multi-resolution features through bottom-up aggregation and top-down propagation. Furthermore, an adaptive region mixer generates transformation matrices based on regional semantics, enabling our model to dynamically capture evolving spatiotemporal patterns for different regions. Extensive experiments conducted on four large-scale real-world datasets demonstrate that the proposed method not only achieves state-of-the-art performance but also exhibits competitive computational efficiency.
[LG-72] RaX-Crash: A Resource Efficient and Explainable Small Model Pipeline with an Application to City Scale Injury Severity Prediction
链接: https://arxiv.org/abs/2512.07848
作者: Di Zhu,Chen Xie,Ziwei Wang,Haoyun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:New York City reports over one hundred thousand motor vehicle collisions each year, creating substantial injury and public health burden. We present RaX-Crash, a resource efficient and explainable small model pipeline for structured injury severity prediction on the official NYC Motor Vehicle Collisions dataset. RaX-Crash integrates three linked tables with tens of millions of records, builds a unified feature schema in partitioned storage, and trains compact tree based ensembles (Random Forest and XGBoost) on engineered tabular features, which are compared against locally deployed small language models (SLMs) prompted with textual summaries. On a temporally held out test set, XGBoost and Random Forest achieve accuracies of 0.7828 and 0.7794, clearly outperforming SLMs (0.594 and 0.496); class imbalance analysis shows that simple class weighting improves fatal recall with modest accuracy trade offs, and SHAP attribution highlights human vulnerability factors, timing, and location as dominant drivers of predicted severity. Overall, RaX-Crash indicates that interpretable small model ensembles remain strong baselines for city scale injury analytics, while hybrid pipelines that pair tabular predictors with SLM generated narratives improve communication without sacrificing scalability.
[LG-73] CarBench: A Comprehensive Benchmark for Neural Surrogates on High-Fidelity 3D Car Aerodynamics
链接: https://arxiv.org/abs/2512.07847
作者: Mohamed Elrefaie,Dule Shu,Matt Klenk,Faez Ahmed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Benchmarking has been the cornerstone of progress in computer vision, natural language processing, and the broader deep learning domain, driving algorithmic innovation through standardized datasets and reproducible evaluation protocols. The growing availability of large-scale Computational Fluid Dynamics (CFD) datasets has opened new opportunities for applying machine learning to aerodynamic and engineering design. Yet, despite this progress, there exists no standardized benchmark for large-scale numerical simulations in engineering design. In this work, we introduce CarBench, the first comprehensive benchmark dedicated to large-scale 3D car aerodynamics, performing a large-scale evaluation of state-of-the-art models on DrivAerNet++, the largest public dataset for automotive aerodynamics, containing over 8,000 high-fidelity car simulations. We assess eleven architectures spanning neural operator methods (e.g., Fourier Neural Operator), geometric deep learning (PointNet, RegDGCNN, PointMAE, PointTransformer), transformer-based neural solvers (Transolver, Transolver++, AB-UPT), and implicit field networks (TripNet). Beyond standard interpolation tasks, we perform cross-category experiments in which transformer-based solvers trained on a single car archetype are evaluated on unseen categories. Our analysis covers predictive accuracy, physical consistency, computational efficiency, and statistical uncertainty. To accelerate progress in data-driven engineering, we open-source the benchmark framework, including training pipelines, uncertainty estimation routines based on bootstrap resampling, and pretrained model weights, establishing the first reproducible foundation for large-scale learning from high-fidelity CFD simulations, available at this https URL.
[LG-74] Heuristics for Combinatorial Optimization via Value-based Reinforcement Learning: A Unified Framework and Analysis
链接: https://arxiv.org/abs/2512.08601
作者: Orit Davidovich,Shimrit Shtern,Segev Wasserkrug,Nimrod Megiddo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.
[LG-75] Minimax and Bayes Optimal Adaptive Experimental Design for Treatment Choice
链接: https://arxiv.org/abs/2512.08513
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:We consider an adaptive experiment for treatment choice and design a minimax and Bayes optimal adaptive experiment with respect to regret. Given binary treatments, the experimenter’s goal is to choose the treatment with the highest expected outcome through an adaptive experiment, in order to maximize welfare. We consider adaptive experiments that consist of two phases, the treatment allocation phase and the treatment choice phase. The experiment starts with the treatment allocation phase, where the experimenter allocates treatments to experimental subjects to gather observations. During this phase, the experimenter can adaptively update the allocation probabilities using the observations obtained in the experiment. After the allocation phase, the experimenter proceeds to the treatment choice phase, where one of the treatments is selected as the best. For this adaptive experimental procedure, we propose an adaptive experiment that splits the treatment allocation phase into two stages, where we first estimate the standard deviations and then allocate each treatment proportionally to its standard deviation. We show that this experiment, often referred to as Neyman allocation, is minimax and Bayes optimal in the sense that its regret upper bounds exactly match the lower bounds that we derive. To show this optimality, we derive minimax and Bayes lower bounds for the regret using change-of-measure arguments. Then, we evaluate the corresponding upper bounds using the central limit theorem and large deviation bounds.
[LG-76] Data-Efficient Learning of Anomalous Diffusion with Wavelet Representations: Enabling Direct Learning from Experimental Trajectories
链接: https://arxiv.org/abs/2512.08510
作者: Gongyi Wang,Yu Zhang,Zihan Huang
类目: Biological Physics (physics.bio-ph); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 23 pages, 16 figures
Abstract:Machine learning (ML) has become a versatile tool for analyzing anomalous diffusion trajectories, yet most existing pipelines are trained on large collections of simulated data. In contrast, experimental trajectories, such as those from single-particle tracking (SPT), are typically scarce and may differ substantially from the idealized models used for simulation, leading to degradation or even breakdown of performance when ML methods are applied to real data. To address this mismatch, we introduce a wavelet-based representation of anomalous diffusion that enables data-efficient learning directly from experimental recordings. This representation is constructed by applying six complementary wavelet families to each trajectory and combining the resulting wavelet modulus scalograms. We first evaluate the wavelet representation on simulated trajectories from the andi-datasets benchmark, where it clearly outperforms both feature-based and trajectory-based methods with as few as 1000 training trajectories and still retains an advantage on large training sets. We then use this representation to learn directly from experimental SPT trajectories of fluorescent beads diffusing in F-actin networks, where the wavelet representation remains superior to existing alternatives for both diffusion-exponent regression and mesh-size classification. In particular, when predicting the diffusion exponents of experimental trajectories, a model trained on 1200 experimental tracks using the wavelet representation achieves significantly lower errors than state-of-the-art deep learning models trained purely on 10^6 simulated trajectories. We associate this data efficiency with the emergence of distinct scale fingerprints disentangling underlying diffusion mechanisms in the wavelet spectra.
[LG-77] Fused Gromov-Wasserstein Contrastive Learning for Effective Enzyme-Reaction Screening
链接: https://arxiv.org/abs/2512.08508
作者: Gengmo Zhou,Feng Yu,Wenda Wang,Zhifeng Gao,Guolin Ke,Zhewei Wei,Zhen Wang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Enzymes are crucial catalysts that enable a wide range of biochemical reactions. Efficiently identifying specific enzymes from vast protein libraries is essential for advancing biocatalysis. Traditional computational methods for enzyme screening and retrieval are time-consuming and resource-intensive. Recently, deep learning approaches have shown promise. However, these methods focus solely on the interaction between enzymes and reactions, overlooking the inherent hierarchical relationships within each domain. To address these limitations, we introduce FGW-CLIP, a novel contrastive learning framework based on optimizing the fused Gromov-Wasserstein distance. FGW-CLIP incorporates multiple alignments, including inter-domain alignment between reactions and enzymes and intra-domain alignment within enzymes and reactions. By introducing a tailored regularization term, our method minimizes the Gromov-Wasserstein distance between enzyme and reaction spaces, which enhances information integration across these domains. Extensive evaluations demonstrate the superiority of FGW-CLIP in challenging enzyme-reaction tasks. On the widely-used EnzymeMap benchmark, FGW-CLIP achieves state-of-the-art performance in enzyme virtual screening, as measured by BEDROC and EF metrics. Moreover, FGW-CLIP consistently outperforms across all three splits of ReactZyme, the largest enzyme-reaction benchmark, demonstrating robust generalization to novel enzymes and reactions. These results position FGW-CLIP as a promising framework for enzyme discovery in complex biochemical settings, with strong adaptability across diverse screening scenarios.
[LG-78] Learned iterative networks: An operator learning perspective
链接: https://arxiv.org/abs/2512.08444
作者: Andreas Hauptmann,Ozan Öktem
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
Abstract:Learned image reconstruction has become a pillar in computational imaging and inverse problems. Among the most successful approaches are learned iterative networks, which are formulated by unrolling classical iterative optimisation algorithms for solving variational problems. While the underlying algorithm is usually formulated in the functional analytic setting, learned approaches are often viewed as purely discrete. In this chapter we present a unified operator view for learned iterative networks. Specifically, we formulate a learned reconstruction operator, defining how to compute, and separately the learning problem, which defines what to compute. In this setting we present common approaches and show that many approaches are closely related in their core. We review linear as well as nonlinear inverse problems in this framework and present a short numerical study to conclude.
[LG-79] Magnetic activity of ultracool dwarfs in the LAMOST DR11
链接: https://arxiv.org/abs/2512.08305
作者: Yue Xiang,Shenghong Gu,Dongtao Cao
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures, accepted for publication in ApJ
Abstract:Ultracool dwarfs consist of lowest-mass stars and brown dwarfs. Their interior is fully convective, different from that of the partly-convective Sun-like stars. Magnetic field generation process beneath the surface of ultracool dwarfs is still poorly understood and controversial. To increase samples of active ultracool dwarfs significantly, we have identified 962 ultracool dwarfs in the latest LAMOST data release, DR11. We also simulate the Chinese Space Station Survey Telescope (CSST) low-resolution slitless spectra by degrading the LAMOST spectra. A semi-supervised machine learning approach with an autoencoder model is built to identify ultracool dwarfs with the simulated CSST spectra, which demonstrates the capability of the CSST all-sky slitless spectroscopic survey on the detection of ultracool dwarfs. Magnetic activity of the ultracool dwarfs is investigated by using the H \alpha line emission as a proxy. The rotational periods of 82 ultracool dwarfs are derived based on the Kepler/K2 light curves. We also derive the activity-rotation relation of the ultracool dwarfs, which is saturated around a Rossby number of 0.12.
[LG-80] Worst-case generation via minimax optimization in Wasserstein space
链接: https://arxiv.org/abs/2512.08176
作者: Xiuyuan Cheng,Yao Xie,Linglingzhi Zhu,Yunqin Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Worst-case generation plays a critical role in evaluating robustness and stress-testing systems under distribution shifts, in applications ranging from machine learning models to power grids and medical prediction systems. We develop a generative modeling framework for worst-case generation for a pre-specified risk, based on min-max optimization over continuous probability distributions, namely the Wasserstein space. Unlike traditional discrete distributionally robust optimization approaches, which often suffer from scalability issues, limited generalization, and costly worst-case inference, our framework exploits the Brenier theorem to characterize the least favorable (worst-case) distribution as the pushforward of a transport map from a continuous reference measure, enabling a continuous and expressive notion of risk-induced generation beyond classical discrete DRO formulations. Based on the min-max formulation, we propose a Gradient Descent Ascent (GDA)-type scheme that updates the decision model and the transport map in a single loop, establishing global convergence guarantees under mild regularity assumptions and possibly without convexity-concavity. We also propose to parameterize the transport map using a neural network that can be trained simultaneously with the GDA iterations by matching the transported training samples, thereby achieving a simulation-free approach. The efficiency of the proposed method as a risk-induced worst-case generator is validated by numerical experiments on synthetic and image data.
[LG-81] Provable Diffusion Posterior Sampling for Bayesian Inversion
链接: https://arxiv.org/abs/2512.08022
作者: Jinyuan Chang,Chenguang Duan,Yuling Jiao,Ruoxuan Li,Jerry Zhijian Yang,Cheng Yuan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:This paper proposes a novel diffusion-based posterior sampling method within a plug-and-play (PnP) framework. Our approach constructs a probability transport from an easy-to-sample terminal distribution to the target posterior, using a warm-start strategy to initialize the particles. To approximate the posterior score, we develop a Monte Carlo estimator in which particles are generated using Langevin dynamics, avoiding the heuristic approximations commonly used in prior work. The score governing the Langevin dynamics is learned from data, enabling the model to capture rich structural features of the underlying prior distribution. On the theoretical side, we provide non-asymptotic error bounds, showing that the method converges even for complex, multi-modal target posterior distributions. These bounds explicitly quantify the errors arising from posterior score estimation, the warm-start initialization, and the posterior sampling procedure. Our analysis further clarifies how the prior score-matching error and the condition number of the Bayesian inverse problem influence overall performance. Finally, we present numerical experiments demonstrating the effectiveness of the proposed method across a range of inverse problems.
[LG-82] Conformal Defects in Neural Network Field Theories
链接: https://arxiv.org/abs/2512.07946
作者: Pietro Capuozzo,Brandon Robinson,Benjamin Suzzoni
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 23 pages, 1 figure
Abstract:Neural Network Field Theories (NN-FTs) represent a novel construction of arbitrary field theories, including those of conformal fields, through the specification of the network architecture and prior distribution for the network parameters. In this work, we present a formalism for the construction of conformally invariant defects in these NN-FTs. We demonstrate this new formalism in two toy models of NN scalar field theories. We develop an NN interpretation of an expansion akin to the defect OPE in two-point correlation functions in these models.
[LG-83] Integrating LSTM Networks with Neural Levy Processes for Financial Forecasting
链接: https://arxiv.org/abs/2512.07860
作者: Mohammed Alruqimi,Luca Di Persio
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates an optimal integration of deep learning with financial models for robust asset price forecasting. Specifically, we developed a hybrid framework combining a Long Short-Term Memory (LSTM) network with the Merton-Lévy jump-diffusion model. To optimise this framework, we employed the Grey Wolf Optimizer (GWO) for the LSTM hyperparameter tuning, and we explored three calibration methods for the Merton-Levy model parameters: Artificial Neural Networks (ANNs), the Marine Predators Algorithm (MPA), and the PyTorch-based TorchSDE library. To evaluate the predictive performance of our hybrid model, we compared it against several benchmark models, including a standard LSTM and an LSTM combined with the Fractional Heston model. This evaluation used three real-world financial datasets: Brent oil prices, the STOXX 600 index, and the IT40 index. Performance was assessed using standard metrics, including Mean Squared Error (MSE), Mean Absolute Error(MAE), Mean Squared Percentage Error (MSPE), and the coefficient of determination (R2). Our experimental results demonstrate that the hybrid model, combining a GWO-optimized LSTM network with the Levy-Merton Jump-Diffusion model calibrated using an ANN, outperformed the base LSTM model and all other models developed in this study.
信息检索
[IR-0] VI-MMRec: Similarity-Aware Training Cost-free Virtual User-Item Interactions for Multimodal Recommendation KDD2026
链接: https://arxiv.org/abs/2512.08702
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Zitong Wan,Hewei Wang,Weijie Liu,Yijie Li,Edith C. H. Ngai
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2026
Abstract:Although existing multimodal recommendation models have shown promising performance, their effectiveness continues to be limited by the pervasive data sparsity problem. This problem arises because users typically interact with only a small subset of available items, leading existing models to arbitrarily treat unobserved items as negative samples. To this end, we propose VI-MMRec, a model-agnostic and training cost-free framework that enriches sparse user-item interactions via similarity-aware virtual user-item interactions. These virtual interactions are constructed based on modality-specific feature similarities of user-interacted items. Specifically, VI-MMRec introduces two different strategies: (1) Overlay, which independently aggregates modality-specific similarities to preserve modality-specific user preferences, and (2) Synergistic, which holistically fuses cross-modal similarities to capture complementary user preferences. To ensure high-quality augmentation, we design a statistically informed weight allocation mechanism that adaptively assigns weights to virtual user-item interactions based on dataset-specific modality relevance. As a plug-and-play framework, VI-MMRec seamlessly integrates with existing models to enhance their performance without modifying their core architecture. Its flexibility allows it to be easily incorporated into various existing models, maximizing performance with minimal implementation effort. Moreover, VI-MMRec introduces no additional overhead during training, making it significantly advantageous for practical deployment. Comprehensive experiments conducted on six real-world datasets using seven state-of-the-art multimodal recommendation models validate the effectiveness of our VI-MMRec.
[IR-1] Exploiting the Randomness of Large Language Models (LLM ) in Text Classification Tasks: Locating Privileged Documents in Legal Matters
链接: https://arxiv.org/abs/2512.08083
作者: Keith Huffman,Jianping Zhang,Nathaniel Huber-Fliflet,Fusheng Wei,Peter Gronvall
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In legal matters, text classification models are most often used to filter through large datasets in search of documents that meet certain pre-selected criteria like relevance to a certain subject matter, such as legally privileged communications and attorney-directed documents. In this context, large language models have demonstrated strong performance. This paper presents an empirical study investigating the role of randomness in LLM-based classification for attorney-client privileged document detection, focusing on four key dimensions: (1) the effectiveness of LLMs in identifying legally privileged documents, (2) the influence of randomness control parameters on classification outputs, (3) their impact on overall classification performance, and (4) a methodology for leveraging randomness to enhance accuracy. Experimental results showed that LLMs can identify privileged documents effectively, randomness control parameters have minimal impact on classification performance, and importantly, our developed methodology for leveraging randomness can have a significant impact on improving accuracy. Notably, this methodology that leverages randomness could also enhance a corporation’s confidence in an LLM’s output when incorporated into its sanctions-compliance processes. As organizations increasingly rely on LLMs to augment compliance workflows, reducing output variability helps build internal and regulatory confidence in LLM-derived sanctions-screening decisions.
[IR-2] Leverag ing Machine Learning and Large Language Models for Automated Image Clustering and Description in Legal Discovery
链接: https://arxiv.org/abs/2512.08079
作者: Qiang Mao,Fusheng Wei,Robert Neary,Charles Wang,Han Qin,Jianping Zhang,Nathaniel Huber-Fliflet
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The rapid increase in digital image creation and retention presents substantial challenges during legal discovery, digital archive, and content management. Corporations and legal teams must organize, analyze, and extract meaningful insights from large image collections under strict time pressures, making manual review impractical and costly. These demands have intensified interest in automated methods that can efficiently organize and describe large-scale image datasets. This paper presents a systematic investigation of automated cluster description generation through the integration of image clustering, image captioning, and large language models (LLMs). We apply K-means clustering to group images into 20 visually coherent clusters and generate base captions using the Azure AI Vision API. We then evaluate three critical dimensions of the cluster description process: (1) image sampling strategies, comparing random, centroid-based, stratified, hybrid, and density-based sampling against using all cluster images; (2) prompting techniques, contrasting standard prompting with chain-of-thought prompting; and (3) description generation methods, comparing LLM-based generation with traditional TF-IDF and template-based approaches. We assess description quality using semantic similarity and coverage metrics. Results show that strategic sampling with 20 images per cluster performs comparably to exhaustive inclusion while significantly reducing computational cost, with only stratified sampling showing modest degradation. LLM-based methods consistently outperform TF-IDF baselines, and standard prompts outperform chain-of-thought prompts for this task. These findings provide practical guidance for deploying scalable, accurate cluster description systems that support high-volume workflows in legal discovery and other domains requiring automated organization of large image collections.
[IR-3] A Comparative Study of Retrieval Methods in Azure AI Search
链接: https://arxiv.org/abs/2512.08078
作者: Qiang Mao,Han Qin,Robert Neary,Charles Wang,Fusheng Wei,Jianping Zhang,Nathaniel Huber-Fliflet
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Increasingly, attorneys are interested in moving beyond keyword and semantic search to improve the efficiency of how they find key information during a document review task. Large language models (LLMs) are now seen as tools that attorneys can use to ask natural language questions of their data during document review to receive accurate and concise answers. This study evaluates retrieval strategies within Microsoft Azure’s Retrieval-Augmented Generation (RAG) framework to identify effective approaches for Early Case Assessment (ECA) in eDiscovery. During ECA, legal teams analyze data at the outset of a matter to gain a general understanding of the data and attempt to determine key facts and risks before beginning full-scale review. In this paper, we compare the performance of Azure AI Search’s keyword, semantic, vector, hybrid, and hybrid-semantic retrieval methods. We then present the accuracy, relevance, and consistency of each method’s AI-generated responses. Legal practitioners can use the results of this study to enhance how they select RAG configurations in the future.
[IR-4] Detecting Privileged Documents by Ranking Connected Network Entities
链接: https://arxiv.org/abs/2512.08073
作者: Jianping Zhang,Han Qin,Nathaniel Huber-Fliflet
类目: Information Retrieval (cs.IR)
*备注:
Abstract:This paper presents a link analysis approach for identifying privileged documents by constructing a network of human entities derived from email header metadata. Entities are classified as either counsel or non-counsel based on a predefined list of known legal professionals. The core assumption is that individuals with frequent interactions with lawyers are more likely to participate in privileged communications. To quantify this likelihood, an algorithm assigns a score to each entity within the network. By utilizing both entity scores and the strength of their connections, the method enhances the identification of privileged documents. Experimental results demonstrate the algorithm’s effectiveness in ranking legal entities for privileged document detection.

