本篇博文主要内容为 2025-12-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-05)

今日共更新518篇论文,其中:

  • 自然语言处理53篇(Computation and Language (cs.CL))
  • 人工智能160篇(Artificial Intelligence (cs.AI))
  • 计算机视觉137篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习116篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

【速读】: 该论文旨在解决当前统一多模态大语言模型(Multimodal Large Language Models, MLLMs)在文本到图像生成中因依赖粗粒度文本规划或孤立生成导致的语义错位与罕见属性组合难以实现的问题。其解决方案的关键在于提出一种名为Draft-as-CoT(DraCo)的交错式推理范式,通过先生成低分辨率草图图像作为结构化视觉规划依据,再利用模型内在理解能力验证草图与输入提示之间的语义一致性,并基于超分辨率进行选择性修正,从而实现更精准的生成规划与验证。该方法显著提升了GenEval、Imagine-Bench等基准上的性能表现。

链接: https://arxiv.org/abs/2512.05112
作者: Dongzhi Jiang,Renrui Zhang,Haodong Li,Zhuofan Zong,Ziyu Guo,Jun He,Claire Guo,Junyan Ye,Rongyao Fang,Weijia Li,Rui Liu,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); CUHK IMIXR (香港中文大学智能媒体与交互研究中心); Sun Yat-Sen University (中山大学); SCUT (华南理工大学); CUHK (Shenzhen) (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model’s inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
zh

[NLP-1] Semantic Soft Bootstrapping: Long Context Reasoning in LLM s without Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理任务中因强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法存在的稀疏奖励和样本效率低下的问题,从而导致后训练阶段计算资源消耗巨大的挑战。解决方案的关键在于提出一种名为**语义软蒸馏(Semantic Soft Bootstrapping, SSB)**的自蒸馏技术:同一基础语言模型同时扮演教师和学生角色,但在训练时接收不同语义上下文来判断其输出正确性。具体而言,模型首先生成多个解题路径,从中筛选出正确答案和最常见的错误答案,并将其作为上下文输入以引导模型生成更鲁棒的分步解释与验证后的最终答案;该过程自动构建配对的师生训练数据集,无需人工干预,且利用生成的logits序列指导学生模型仅基于问题本身进行学习,显著提升了推理能力。实验表明,在GSM8K上微调Qwen2.5-3B-Instruct后,该方法在MATH500和AIME2024基准测试中分别较GRPO算法提升10.6%和10%的准确率。

链接: https://arxiv.org/abs/2512.05105
作者: Purbesh Mitra,Sennur Ulukus
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbfSemantic Soft Bootstrapping (SSB), a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at this https URL, and the model, curated dataset is available at this https URL.
zh

[NLP-2] Structured Document Translation via Format Reinforcement Learning AACL2025

【速读】: 该论文旨在解决结构化文本(如XML或HTML格式的文档)在翻译过程中难以有效处理复杂层级结构的问题,现有方法多局限于句子级别,无法保证文档级结构的完整性与准确性。其解决方案的关键在于提出一种基于强化学习的框架Format Reinforcement Learning (FormatRL),该框架在监督微调模型基础上引入组相对策略优化(Group Relative Policy Optimization),并设计了两种结构感知奖励函数:TreeSim用于衡量预测与参考XML树之间的结构相似性,Node-chrF则从XML节点层面评估翻译质量;同时引入StrucAUC这一细粒度指标以区分轻微错误与重大结构失效,从而实现对文档结构和语义内容的协同优化。

链接: https://arxiv.org/abs/2512.05100
作者: Haiyue Song,Johannes Eschbach-Dymanus,Hour Kaing,Sumire Honda,Hideki Tanaka,Bianka Buschbeck,Masao Utiyama
机构: National Institute of Information and Communications Technology, Japan (日本信息通信技术研究所); SAP, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IJCNLP-AACL 2025 Main (Oral)

点击查看摘要

Abstract:Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbfFormat Reinforcement Learning (FormatRL), which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
zh

[NLP-3] Multi-LLM Collaboration for Medication Recommendation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床决策支持中因幻觉(hallucination)和推理不一致导致的可靠性问题,以及传统模型集成方法难以提供稳定且可信推荐的挑战。解决方案的关键在于引入基于“LLM Chemistry”的协作建模框架,通过量化多模型间的协同兼容性,引导LLMs以交互感知的方式协作,从而实现有效(利用互补优势)、稳定(保持输出一致性)和校准(减少干扰与误差放大)的多模型集成策略,提升药物推荐的可信度与临床实用性。

链接: https://arxiv.org/abs/2512.05066
作者: Huascar Sanchez,Briland Hitaj,Jules Bergmann,Linda Briesemeister
机构: SRI International (SRI 国际); University of Maryland St. Joseph Medical Center (马里兰大学圣约瑟夫医疗中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 1 table

点击查看摘要

Abstract:As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
zh

[NLP-4] Arbitrag e: Efficient Reasoning via Advantage-Aware Speculation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理任务中因传统逐标记(token-level)推测解码(Speculative Decoding)导致的计算效率低下问题,尤其是由于语义等价步骤中的标记不匹配所引发的无效拒绝,以及现有逐步(step-level)方法在拒绝后仍需大量重生成而浪费目标模型(target model)算力的问题。解决方案的关键在于提出一种名为 Arbitrage 的新型逐步推测生成框架,其核心创新是引入一个轻量级路由机制(router),该机制通过训练预测目标模型是否能产生更高质量的推理步骤,动态决定是否采纳草稿模型(draft model)的输出,从而近似理想仲裁Oracle(Arbitrage Oracle)的行为,在保持精度的同时实现接近最优的效率-准确性权衡。

链接: https://arxiv.org/abs/2512.05033
作者: Monishwaran Maheswaran,Rishabh Tiwari,Yuezhou Hu,Kerem Dilmen,Coleman Hooper,Haocheng Xi,Nicholas Lee,Mehrdad Farajtabar,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
机构: UC Berkeley (加州大学伯克利分校); Apple (苹果公司); ICSI (国际计算机科学研究所); LBNL (劳伦斯伯克利国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to \sim2\times at matched accuracy.
zh

[NLP-5] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因检索到不准确或误导性信息而导致的幻觉问题,尤其是在医疗等安全关键领域中,确保生成结果基于可靠的事实证据。解决方案的关键在于提出一种自解释对比证据重排序(Self-Explaining Contrastive Evidence Re-Ranking, CER)方法:通过对比学习对嵌入空间进行微调,自动选择硬负样本(基于主观性判定),使模型在嵌入空间中将事实性推理依据拉近,同时推离主观或误导性解释;此外,该方法还能生成每个检索段落的词级别归因理由(token-level attribution rationales),从而实现可解释、以证据为导向的检索,显著提升检索准确性与系统可靠性。

链接: https://arxiv.org/abs/2512.05012
作者: Francielle Vargas,Daniel Pedronette
机构: São Paulo State University (圣保罗州立大学)
类目: Computation and Language (cs.CL)
备注: This work was presented as a poster at the Applied Social Media Lab during the 2025 Synthesizer Open Showcase at the Berkman Klein Center for Internet Society at Harvard University

点击查看摘要

Abstract:This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
zh

[NLP-6] Nex-N1: Agent ic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)从被动响应向自主代理(autonomous agents)演进过程中,因缺乏可扩展的基础设施而导致高质量交互信号难以构建、进而阻碍有效策略学习的问题。解决方案的关键在于提出一套系统化的方法,通过三个正交维度实现交互环境的多样性与复杂性规模化:(1) 复杂性维度引入NexAU框架,支持通过简单配置构建复杂的代理层级结构;(2) 多样性维度采用NexA4A自动从自然语言生成多样化代理层级以覆盖无限任务域;(3) 保真度维度设计NexGAP,融合动态真实环境以弥合仿真与现实之间的差距,从而合成具有地面实况(grounded)特性的轨迹。基于此基础设施训练的Nex-N1模型在SWE-bench和tau2等基准上显著优于当前开源最优模型,并达到与前沿闭源模型相当的性能。

链接: https://arxiv.org/abs/2512.04987
作者: Nex-AGI Team:Yuxuan Cai,Lu Chen,Qiaoling Chen,Yuyang Ding,Liwen Fan,Wenjie Fu,Yufei Gao,Honglin Guo,Pinxue Guo,Zhenhua Han,Zhengfu He,Hanglei Hu,Kai Hu,Shengjia Hua,Tianyu Huai,Baodai Huang,Li Ji,Zhen Jiang,Zhikai Lei,Bufan Li,Jiahang Lin,Lizhi Lin,Jinxiu Liu,Shichun Liu,Ziming Liu,Yuchen Ni,Pengfang Qian,Yujiong Shen,Qingyun Shi,Wentao Shu,Peng Sun,Yiran Suo,Tian Tang,Boyu Tian,Guoteng Wang,Junzhe Wang,Peixin Wang,Zhiheng Xi,Hang Yan,Jie Yang,Zhixiong Yang,Tianchu Yao,Guangze Ye,Qianxi Yu,Shuo Zhang,Xinyue Zhang,Yiqi Zhang,Jiarong Zhao,Miao Zheng,Rui Zheng,Enyu Zhou,Jiazheng Zhou,Maosen Zhou,Yuhao Zhou,Tao Gui,Yining Zheng,Xinchi Chen,Jie Zhou,Siyuan Feng,Qin Chen,Liang He,Qi Zhang,Xuanjing Huang,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms – from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
zh

[NLP-7] LLM s Know More Than Words: A Genre Study with Syntax Metaphor Phonetics

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够从原始文本中有效学习并利用深层语言特征(如句法结构、语音线索和韵律模式)以提升自然语言相关任务性能的问题。其解决方案的关键在于构建一个基于Project Gutenberg的多语言体裁分类数据集,涵盖诗歌、小说与戏剧三类文本,并在每个二元分类任务中引入三种显式语言特征(句法树结构、隐喻计数和语音度量),通过对比模型在仅使用原始文本与加入显式特征时的表现,揭示不同语言特征对分类性能的非均匀贡献,从而强调在训练过程中融入更复杂语言信号的重要性。

链接: https://arxiv.org/abs/2512.04957
作者: Weiye Shi,Zhaowei Zhang,Shaoheng Yan,Yaodong Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
zh

[NLP-8] CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

【速读】: 该论文旨在解决多步任务中传统群体级策略优化算法因假设每个动作对最终结果贡献均等而表现不佳的问题。研究表明,在多步交互环境中,仅有少量关键动作对最终结果具有决定性影响。解决方案的关键在于提出一种面向关键动作的强化学习算法(CARL),通过为高重要性动作提供动作级别的优化信号,同时在模型更新中排除低重要性动作,从而实现更聚焦的训练过程。实验表明,CARL在多种评估场景下均能提升训练与推理阶段的性能和效率。

链接: https://arxiv.org/abs/2512.04949
作者: Leyang Shen,Yang Zhang,Chun Kai Ling,Xiaoyan Zhao,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
zh

[NLP-9] Algorithmic Thinking Theory

【速读】: 该论文旨在解决如何通过迭代改进和答案聚合来提升大语言模型(Large Language Models, LLMs)在复杂推理任务中的性能问题。其解决方案的关键在于提出一个理论框架,该框架将生成和组合多个解的过程形式化为使用概率 oracle 的推理算法,从而为设计更强大的迭代式推理方法提供理论基础。此框架基于实验证据而非模型架构细节,具有广泛的适用性,可扩展至当前及未来的各类推理或 oracle 系统。

链接: https://arxiv.org/abs/2512.04923
作者: MohammadHossein Bateni,Vincent Cohen-Addad,Yuzhou Gu,Silvio Lattanzi,Simon Meierhans,Christopher Mohri
机构: Google; NYU (纽约大学); ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle. We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2512.04923 [cs.AI] (or arXiv:2512.04923v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.04923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] he AI Consumer Index (ACE)

【速读】: 该论文旨在解决当前前沿生成式 AI (Generative AI) 模型在执行高价值消费者任务时表现不足的问题,尤其是其在真实场景中是否具备可靠、准确的推理与信息整合能力。解决方案的关键在于构建首个 AI Consumer Index (ACE) 基准测试集,包含400个隐藏测试用例,覆盖购物、食品、游戏和DIY四大消费场景,并引入一种新颖的动态评分方法,自动验证模型回答中关键信息是否基于检索到的网络来源,从而客观评估模型的准确性与可信赖性。结果显示,即便最优模型(如GPT 5)在某些领域(如购物)得分仍低于50%,表明当前生成式AI距离满足消费者实际需求仍有显著差距。

链接: https://arxiv.org/abs/2512.04921
作者: Julien Benchek,Rohit Shetty,Benjamin Hunsberger,Ajay Arun,Zach Richards,Brendan Foody,Osvald Nitski,Bertie Vidgen
机构: Mercor
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers’ AI needs.
zh

[NLP-11] STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在时间序列预测任务中未能有效增强原始序列信息、导致其推理能力未被充分利用的问题。现有提示策略依赖静态相关性而非对动态行为的生成式解释,缺乏全局和实例级别的上下文信息。解决方案的关键在于提出STE LLA(Semantic-Temporal Alignment with Language Abstractions)框架,该框架通过动态语义抽象机制将输入序列解耦为趋势、季节性和残差分量,并将其内在行为特征转化为两级语义锚点:用于提供全局上下文的语料级语义先验(Corpus-level Semantic Prior, CSP)和用于捕捉实例级模式的细粒度行为提示(Fine-grained Behavioral Prompt, FBP)。这些语义锚点作为前缀提示注入LLM,引导其建模序列的内在动态特性,在多个基准数据集上验证了其在长短期预测中的优越性能及零样本与少样本场景下的强泛化能力。

链接: https://arxiv.org/abs/2512.04871
作者: Junjie Fan,Hongye Zhao,Linduo Wei,Jiayu Rao,Guijia Li,Jiaxin Yuan,Wenqi Xu,Yong Qi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
zh

[NLP-12] SEAL: Self-Evolving Agent ic Learning for Conversational Question Answering over Knowledge Graphs

【速读】: 该论文旨在解决知识增强型对话问答(Knowledge-based Conversational Question Answering, KBCQA)中长期存在的核心挑战,包括指代消解(coreference resolution)、上下文依赖建模以及复杂逻辑推理的准确性与效率问题。现有方法如端到端语义解析或分步代理式推理常因结构不准确和计算开销过大而受限,尤其在处理大规模知识图谱上的复杂查询时表现不佳。解决方案的关键在于提出一种两阶段语义解析框架 SEAL,其核心创新为:第一阶段由大语言模型(Large Language Model, LLM)提取最小化的 S-expression 核心以保留查询本质语义;第二阶段通过模板化补全机制,在问题类型预测和占位符实例化引导下生成可执行的完整 S-expression。该设计显著提升结构保真度与链接效率,同时引入自演化机制(self-evolving mechanism),结合局部与全局记忆及反思模块,实现无需显式重训练即可从对话历史和执行反馈中持续优化,从而在 SPICE 基准测试中取得最优性能,特别是在多跳推理、比较和聚合任务上表现突出。

链接: https://arxiv.org/abs/2512.04868
作者: Hao Wang,Jialun Zhong,Changcheng Wang,Zhujun Nie,Zheng Li,Shunyu Yao,Yanzeng Li,Xinchi Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework’s capacity for robust and scalable conversational reasoning.
zh

[NLP-13] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLM s via Source-Shielded Updates

【速读】: 该论文旨在解决在低资源条件下将指令微调的大语言模型(Instruct Large Language Models, LLMs)适配到目标语言时面临的两个核心挑战:一是缺乏昂贵的标注数据,二是适应过程中出现的灾难性遗忘(Catastrophic Forgetting)问题。解决方案的关键在于提出一种称为“源语言保护更新”(Source-Shielded Updates, SSU)的选择性参数更新策略,该策略通过引入少量源语言数据和参数重要性评分方法,识别对保持源语言能力至关重要的参数,并采用列级冻结(column-wise freezing)机制在适应前保护这些参数,从而有效缓解灾难性遗忘,同时在目标语言上实现与全量微调相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2512.04844
作者: Atsuki Yamaguchi,Terufumi Morishita,Aline Villavicencio,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学); Hitachi, Ltd. (日立有限公司); University of Exeter (埃克塞特大学); The Alan Turing Institute (艾伦图灵研究所); Federal University of Rio Grande do Norte (里约格朗德联邦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
zh

[NLP-14] DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

【速读】: 该论文旨在解决混合作者文本的分割问题(mixed-authorship text segmentation),即识别文本中人类与人工智能(AI)写作之间切换的边界点,这对确保内容的真实性、建立信任机制以及实现人类对协作生成内容的有效监督具有重要意义。其解决方案的关键在于提出了一种名为Info-Mask的新框架,该框架融合了风格特征(stylometric cues)、基于困惑度(perplexity-driven signals)的信号以及结构化的边界建模方法,从而实现对人机协同内容的精准分割。此外,为评估系统在对抗扰动下的鲁棒性,作者构建并发布了首个专门用于测试分割模型抗干扰能力的对抗基准数据集MAS(Mixed-text Adversarial setting for Segmentation),同时引入人类可解释归因(Human-Interpretable Attribution, HIA)可视化技术,以增强模型决策的透明度,并通过小规模人类实验验证其有效性。

链接: https://arxiv.org/abs/2512.04838
作者: L. D. M. S. Sai Teja,N. Siva Gopala Krishna,Ufaq Khan,Muhammad Haris Khan,Partha Pakray,Atul Mishra
机构: NIT Silchar, India; BML Munjal, Haryana, India; MBZUAI, Abu Dhabi, UAE
类目: Computation and Language (cs.CL)
备注: 18 pages, 10 Figures

点击查看摘要

Abstract:In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.
zh

[NLP-15] Are LLM s Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLM s for Information Retrieval: An Italian Healthcare Use Case

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中信息提取的挑战,特别是针对意大利语临床文本的共病(comorbidity)识别问题。传统自然语言处理(Natural Language Processing, NLP)技术因临床语言的高度复杂性和语义多样性而表现不足。解决方案的关键在于评估开源多语言大语言模型(Large Language Models, LLMs)在零样本(zero-shot)、本地部署(on-premises)环境下对EHR文本的理解与信息抽取能力,发现部分LLMs在特定疾病泛化上存在显著性能波动,表明其在实际医疗场景中的可靠性仍需进一步优化。

链接: https://arxiv.org/abs/2512.04834
作者: Vignesh Kumar Kembu,Pierandrea Morandini,Marta Bianca Maria Ranzini,Antonino Nocera
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
zh

[NLP-16] DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

【速读】: 该论文旨在解决现有丹麦语语言可接受性评估基准覆盖范围有限、任务难度不足以及区分度不强的问题。其解决方案的关键在于构建一个增强型基准,通过系统性地引入14类错误类型(corruption functions)对正确句子进行人工标注和自动验证,从而生成具有多样性和真实性的错误句。这一方法显著提升了基准的广度与深度,使大型语言模型(LLM)在该任务上的表现明显低于现有基准,验证了其更高的任务难度和更强的模型区分能力。

链接: https://arxiv.org/abs/2512.04799
作者: Gianluca Barmina,Nathalie Carmen Hau Norman,Peter Schneider-Kamp,Lukas Galke
机构: University of Southern Denmark (南丹麦大学); University of Copenhagen (哥本哈根大学); Danish Foundation Models (丹麦基础模型基金会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
zh

[NLP-17] AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多语言机器翻译(Multilingual Machine Translation, MT)系统对印度部落语言(如Bhili、Mundari、Gondi和Santali)的显著忽视问题,这一现象加剧了教育、治理与数字参与中的结构性不平等。解决方案的关键在于提出AdiBhashaa这一由社区驱动的倡议,其核心包括:通过本地母语者参与的数据共建、引入人类在环(Human-in-the-Loop)验证机制,以及对编码器-解码器MT模型和大语言模型进行系统性评估;同时强调以本地知识为中心、培养边缘化群体早期研究者能力,并将人工验证置于语言技术开发的核心位置,从而推动更公平的AI研究范式。

链接: https://arxiv.org/abs/2512.04765
作者: Pooja Singh,Sandeep Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.
zh

[NLP-18] MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

【速读】: 该论文旨在解决当前记忆增强型大语言模型(Memory-augmented Large Language Models, LLMs)在本地设备部署时面临的两大核心问题:一是LLMs计算资源消耗高,难以适配边缘设备;二是现有系统缺乏原生视觉理解能力,限制了其在多模态场景中的应用。解决方案的关键在于提出两种创新架构:(i) MemLoRA,一种基于小型语言模型(Small Language Models, SLMs)的专用记忆适配器机制,通过知识蒸馏训练独立的适配模块(分别对应知识提取、记忆更新和记忆增强生成),实现低资源下精准的本地记忆操作;(ii) MemLoRA-V,进一步集成小规模视觉-语言模型(Small Vision-Language Models, SVLMs),使系统具备原生视觉理解能力。实验表明,MemLoRA在文本任务上超越10倍于其规模的基线模型,并接近60倍规模模型的性能;而MemLoRA-V在视觉问答任务中显著优于基于图像描述的方法(准确率从23.7提升至81.3),同时保持文本任务高性能,验证了其在多模态环境下的有效性。

链接: https://arxiv.org/abs/2512.04763
作者: Massimo Bini,Ondrej Bohdal,Umberto Michieli,Zeynep Akata,Mete Ozay,Taha Ceritli
机构: Samsung R&D Institute UK; Technical University of Munich (慕尼黑工业大学); Helmholtz Munich; MCML
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations \unicodex2013 knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10 \times larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60 \times larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
zh

[NLP-19] Challenging the Abilities of Large Language Models in Italian: a Community Initiative

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在非英语语言(特别是意大利语)中系统性评估不足的问题。现有评测多聚焦于排行榜,缺乏对方法论的深入探讨与标准化流程,导致评估结果难以复现且覆盖维度有限。其解决方案的关键在于构建一个以方法论为核心的协同基准测试框架——CALAMITA(Challenging the Abilities of LAnguage Models in ITAlian),通过整合来自学术界、产业界和公共部门超过80名贡献者的专业知识,设计并实现了一套涵盖20余项任务、近100个子任务的多样化评估体系,包括语言能力、常识推理、事实一致性、公平性、摘要生成、翻译及代码生成等维度,并建立了支持异构数据集和指标的集中式评估流水线。该框架不仅提供了一个迄今为止最全面、多样化的意大利语LLM评测资源,更强调细粒度任务代表性指标、统一评估流程以及社区协作的实践价值,为其他语言社区开展包容且严谨的模型评估提供了可扩展的范式。

链接: https://arxiv.org/abs/2512.04759
作者: Malvina Nissim,Danilo Croce,Viviana Patti,Pierpaolo Basile,Giuseppe Attanasio,Elio Musacchio,Matteo Rinaldi,Federico Borazio,Maria Francis,Jacopo Gili,Daniel Scalena,Begoña Altuna,Ekhi Azurmendi,Valerio Basile,Luisa Bentivogli,Arianna Bisazza,Marianna Bolognesi,Dominique Brunato,Tommaso Caselli,Silvia Casola,Maria Cassese,Mauro Cettolo,Claudia Collacciani,Leonardo De Cosmo,Maria Pia Di Buono,Andrea Esuli,Julen Etxaniz,Chiara Ferrando,Alessia Fidelangeli,Simona Frenda,Achille Fusco,Marco Gaido,Andrea Galassi,Federico Galli,Luca Giordano,Mattia Goffetti,Itziar Gonzalez-Dios,Lorenzo Gregori,Giulia Grundler,Sandro Iannaccone,Chunyang Jiang,Moreno La Quatra,Francesca Lagioia,Soda Marem Lo,Marco Madeddu,Bernardo Magnini,Raffaele Manna,Fabio Mercorio,Paola Merlo,Arianna Muti,Vivi Nastase,Matteo Negri,Dario Onorati,Elena Palmieri,Sara Papi,Lucia Passaro,Giulia Pensa,Andrea Piergentili,Daniele Potertì,Giovanni Puccetti,Federico Ranaldi,Leonardo Ranaldi,Andrea Amelio Ravelli,Martina Rosola,Elena Sofia Ruzzetti,Giuseppe Samo,Andrea Santilli,Piera Santin,Gabriele Sarti,Giovanni Sartor,Beatrice Savoldi,Antonio Serino,Andrea Seveso,Lucia Siciliani,Paolo Torroni,Rossella Varvara,Andrea Zaninello,Asya Zanollo,Fabio Massimo Zanzotto,Kamyar Zeinalipour,Andrea Zugarini
机构: Alpha Test S.r.l.; ANSA; Bocconi University; European University Institute; Expert.ai; Fondazione Bruno Kessler; Galileo Net; Heriot-Watt University; Idiap Research Institute; ILC-CNR; Instituto de Telecomunicações; ISTI-CNR; IUSS Pavia; Kore University of Enna; LMU Munich; Sapienza University of Rome; Universitat de Barcelona; University of Bari Aldo Moro; University of Bologna; University of Florence; University of Geneva; University of Groningen; University of Milano-Bicocca; University of Naples “L’Orientale”; University of Pavia; University of Pisa; University of Rome Tor Vergata; University of Siena; University of the Basque Country (UPV/EHU); University of Trento; University of Turin
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. “Challenging the Abilities of LAnguage Models in ITAlian” (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource – the most comprehensive and diverse benchmark for Italian to date – and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
zh

[NLP-20] EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

【速读】: 该论文旨在解决知识编辑(Knowledge Editing)方法在理论评估与实际应用之间存在的显著差距问题,即现有方法在受控环境下表现良好,但在真实世界中的持续学习场景中效果不佳。核心问题包括:(1)传统方法导致模型对新事实过拟合,损害预训练能力;(2)缺乏知识巩固阶段,使新知识未能有效融入生成过程中的推理行为,造成参数知识与实际生成不一致。解决方案的关键在于提出“先编辑后巩固”(Edit-then-Consolidate)的新范式:首先通过目标邻近监督微调(Targeted Proximal Supervised Fine-Tuning, TPSFT)限制策略漂移,缓解过拟合;随后利用组相对策略优化(Group Relative Policy Optimization, GRPO)进行轨迹级行为优化,将新知识与基于思维链(Chain-of-Thought, CoT)的推理策略对齐,从而提升编辑的可靠性、泛化性并保持预训练能力的局部性。

链接: https://arxiv.org/abs/2512.04753
作者: Ruilin Li,Yibin Wang,Wenhong Zhu,Chenglin Li,Jinghao Zhang,Chenliang Li,Junchi Yan,Jiaqi Wang
机构: Wuhan University (武汉大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work’s empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs’ inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.
zh

[NLP-21] Model Whisper: Steering Vectors Unlock Large Language Models Potential in Test-time AAAI2026

【速读】: 该论文旨在解决如何高效激活大语言模型(Large Language Models, LLMs)在特定任务或新分布下的推理潜力这一关键挑战。现有测试时适配方法通常需要调整模型参数,不仅计算开销大,还可能损害模型原有的性能。为此,作者提出一种轻量级组件——测试时引导向量(Test-Time Steering Vectors, TTSV),其通过在输入前缀添加可优化的TTSV向量实现对模型内部状态的引导,同时保持LLM参数完全冻结。该方案的核心在于:通过在测试数据上优化TTSV以最小化模型输出熵,使模型进入更高置信度的状态,从而激发其与当前任务最相关的内在能力。TTSV具有极高的计算效率和良好的泛化性,已在多个基础模型和增强推理能力的模型上验证有效性,显著提升任务表现。

链接: https://arxiv.org/abs/2512.04748
作者: Xinyue Kang,Diwei Shi,Li Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted to aaai2026

点击查看摘要

Abstract:It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model’s pre-existing this http URL address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM’s parameters entirely frozen. By optimizing the TTSV on test data to minimize the model’s output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach’s effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.
zh

[NLP-22] SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLM s

【速读】: 该论文旨在解决极端低比特量化(extreme low-bit quantization)在部署大语言模型(Large Language Models, LLMs)时导致的性能严重退化问题,尤其是在2比特和4比特场景下(如MXFP4)。其核心解决方案是提出SignRoundV2,一个无需混合精度(mixed-precision)的后训练量化框架,关键创新在于:(1) 设计了一种快速敏感性度量指标,融合梯度信息与量化引入的偏差,用于指导逐层比特分配;(2) 引入轻量级预微调搜索机制以优化量化尺度,从而显著提升极低比特下的模型性能。该方法可在4–5比特时实现与全精度模型相差约1%的精度,在2比特时仍保持较强性能,达到生产级应用水平。

链接: https://arxiv.org/abs/2512.04746
作者: Wenhua Cheng,Weiwei Zhang,Heng Guo,Haihao Shen
机构: Intel(英特尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at this https URL.
zh

[NLP-23] OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models ICDE

【速读】: 该论文旨在解决自然语言与结构化查询语言(如Overpass Query Language,OverpassQL)之间语义鸿沟的问题,尤其是在大规模OpenStreetMap(OSM)数据场景下,如何高效、准确地实现自然语言到结构化查询的转换。其解决方案的关键在于提出一个开源的标签感知语言模型OsmT,并引入标签检索增强(Tag Retrieval Augmentation, TRA)机制,通过融合上下文相关的标签知识来提升生成查询的准确性与结构有效性;TRA机制特别设计用于捕捉OSM数据库中标签的层次与关联依赖关系,从而应对地理空间查询固有的拓扑复杂性。此外,论文还定义了一个逆向任务——OverpassQL到文本的翻译,以支持查询解释和用户可访问性,最终在公开基准上验证了模型在参数量显著减少的情况下仍能实现与强基线相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2512.04738
作者: Zhuoyue Wan,Wentao Hu,Chen Jason Zhang,Yuanfeng Song,Shuaimin Li,Ruiqiang Xiao,Xiao-Yong Wei,Raymond Chi-Wing Wong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 42nd IEEE International Conference on Data Engineering (ICDE)

点击查看摘要

Abstract:Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.
zh

[NLP-24] owards Ethical Multi-Agent Systems of Large Language Models : A Mechanistic Interpretability Perspective AAAI’26

【速读】: 该论文旨在解决多智能体大语言模型(Multi-Agent Systems of Large Language Models, MALMs)在实际应用中引发的伦理问题,尤其是在自主交互过程中可能出现的不可预测或非道德行为。其解决方案的关键在于从机制可解释性(mechanistic interpretability)的角度出发,通过三个核心策略推进:首先构建涵盖个体、交互与系统层面的综合评估框架以量化伦理行为;其次利用机制可解释性技术揭示导致涌现行为的内部机制;最后实施针对性的、参数高效的对齐方法,在不损害模型性能的前提下引导MALMs向符合伦理的方向演化。

链接: https://arxiv.org/abs/2512.04691
作者: Jae Hee Lee,Anne Lauscher,Stefano V. Albrecht
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted to LaMAS 2026@AAAI’26 ( this https URL )

点击查看摘要

Abstract:Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
zh

[NLP-25] Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

【速读】: 该论文旨在解决关于德语中通用阳性(Generic Masculine, GM)用法的争议问题,即其是否真正具有性别中立性。此前学界对此存在分歧,且缺乏基于真实语料库的实证分析。研究的关键解决方案在于构建一个大规模的新闻文本语料库,并对21个具体的人称名词进行完整的词形标注(共6,195个标记),从而系统考察GM在不同词类、语法结构中的分布特征。结果表明,GM并非主要用于指代整个群体,而是更多出现在复数和不定名词短语中,且在不同词汇类别间存在显著差异,尤其体现在被动角色名词与地位相关名词之间。这一实证发现为心理语言学研究提供了更贴近现实语言使用的刺激材料,推动了对GM现象的精细化理解。

链接: https://arxiv.org/abs/2512.04683
作者: Carolin Mueller-Spitzer,Samira Ochs,Jan Oliver Ruediger,Sascha Wolfer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 pages, 8 figures

点击查看摘要

Abstract:This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.
zh

[NLP-26] opology Matters: Measuring Memory Leakage in Multi-Agent LLM s ACL

【速读】: 该论文旨在解决多智能体大语言模型(Multi-Agent Large Language Models, Multi-Agent LLMs)系统中因图拓扑结构导致的内存泄露(memory leakage)问题,即网络结构如何影响敏感信息从目标智能体内存中被攻击者提取的风险。其解决方案的关键在于提出MAMA(Multi-Agent Memory Attack)框架,通过两阶段协议——Engram(将隐私信息注入目标智能体记忆)和Resonance(多轮交互尝试提取信息),在包含标注个人身份信息(PII)的合成文档上量化泄露程度,并系统评估六种常见网络拓扑(全连接、环形、链式、二叉树、星型、星环型)下的泄露模式。结果揭示了图结构与隐私风险之间的明确映射关系,为设计更安全的多智能体系统提供了可操作的架构指导:优先采用稀疏或分层连接、最大化攻击者与目标间的图距离、限制节点度数和网络半径、避免绕过枢纽节点的捷径路径,并实施基于拓扑感知的访问控制策略。

链接: https://arxiv.org/abs/2512.04668
作者: Jinbo Liu,Defu Cao,Yifei Wei,Tianyao Su,Yuan Liang,Yushun Dong,Yue Zhao,Xiyang Hu
机构: University of Southern California (南加州大学); Florida State University (佛罗里达州立大学); Arizona State University (亚利桑那州立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review at ACL Rolling Review

点击查看摘要

Abstract:Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent’s memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts n\in\4,5,6\ , attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
zh

[NLP-27] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

【速读】: 该论文旨在解决视频大语言模型(VideoLLMs)在视频理解任务中因缺乏有效时序感知能力而导致的时序不一致与因果不合理性问题,即生成内容存在时序幻觉(temporal hallucination),这已成为当前视频理解领域的重要挑战。解决方案的关键在于提出一种无需训练的方法——自诊断对比解码(Self-Diagnostic Contrastive Decoding, SEASON),其核心机制是动态诊断每个输出token的幻觉倾向,并基于此对齐其对应的时序和空间负样本进行自适应对比解码,从而提升生成内容在时序和空间层面的一致性与真实性。

链接: https://arxiv.org/abs/2512.04643
作者: Chang-Hsun Wu,Kai-Po Chang,Yu-Yang Sheng,Hung-Kai Chung,Kuei-Chun Wang,Yu-Chiang Frank Wang
机构: National Taiwan University (国立台湾大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token’s hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
zh

[NLP-28] Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

【速读】: 该论文旨在解决长时程任务中基于稀疏奖励的大型语言模型(Large Language Model, LLM)代理训练不稳定、样本效率低以及探索困难的问题。现有方法依赖策略梯度算法,易受轨迹级奖励噪声影响,且在自然语言动作空间中难以有效探索。解决方案的关键在于提出一种名为自然语言演员-评论家(Natural Language Actor-Critic, NLAC)的新颖演员-评论家算法,其核心创新是使用生成式LLM作为评论家(critic),输出自然语言而非标量值来提供更丰富、可解释的训练信号;该机制能够通过自然语言解释说明为何某一动作次优,从而指导LLM代理进行推理改进,无需依赖随机探索,并支持离策略训练,显著提升训练稳定性与数据效率。

链接: https://arxiv.org/abs/2512.04601
作者: Joey Hong,Kang Liu,Zhan Ling,Jiecao Chen,Sergey Levine
机构: UC Berkeley (加州大学伯克利分校); ByteDance Seed (字节跳动种子团队)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:Large language model (LLM) agents – LLMs that dynamically interact with an environment over long horizons – have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
zh

[NLP-29] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

【速读】: 该论文旨在解决现有法律人工智能(Legal AI)评估基准过于结果导向、缺乏对大语言模型(Large Language Models, LLMs)法律智能能力系统性评测的问题,从而阻碍了法律通用智能(Legal General Intelligence, GI)的发展。其解决方案的关键在于提出LexGenius——一个面向中文法律场景的专家级基准,采用“维度-任务-能力”框架,涵盖七个维度、十一个任务和二十项能力,并通过结合真实法律案例与考试题、人工与LLM协同审核的方式构建多选题数据集,有效降低数据泄露风险,确保测评的准确性与可靠性。实证评估表明,即便最先进的LLMs在多项法律智能能力上仍显著落后于人类法律从业者,验证了LexGenius在量化评估与推动法律GI发展方面的价值。

链接: https://arxiv.org/abs/2512.04578
作者: Wenjin Liu,Haoran Luo,Xin Feng,Xiang Ji,Lijuan Zhou,Rui Mao,Jiapu Wang,Shirui Pan,Erik Cambria
机构: Hainan University (海南大学); Nanyang Technological University (南洋理工大学); Nanjing University of Science and Technology (南京理工大学); Griffith University (格里菲斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at this https URL.
zh

[NLP-30] ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

【速读】: 该论文旨在解决多任务指令微调(multi-task instruction tuning)中任务采样比例设定困难的问题,即如何在有限的token预算下自动分配训练资源以提升模型性能。传统方法通常依赖人工设定固定的任务权重或按任务规模比例混合数据,难以适应不同任务的相对难度与下游表现差异。其解决方案的关键在于提出ADAPT算法——一种基于元学习(meta-learning)的动态任务采样机制,通过优化一个平滑的最坏情况验证目标函数,持续更新任务分布,从而形成自适应课程(adaptive curriculum),在不增加总token消耗的前提下,将更多训练资源倾斜至对下游任务更有价值的 harder、benchmark-aligned tasks,实现更高效的训练与稳定性能提升。

链接: https://arxiv.org/abs/2512.04555
作者: Pritam Kadasi,Abhishek Upperwal,Mayank SIngh
机构: Lingo Research Group, Indian Institute of Technology Gandhinagar, India (印度理工学院甘纳格尔分校语言研究组); Soket AI, India (Soket AI)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:We propose ADAPT, a meta-learning algorithm that \emphlearns task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three \sim 1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of 1% , 5% , and 10% of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.
zh

[NLP-31] AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中自注意力机制的二次复杂度限制,这一限制阻碍了模型对长上下文的有效处理,而长上下文处理能力对于许多高级应用至关重要。现有上下文压缩方法存在局限:显式方法易丢失局部细节,隐式方法则可能引入位置偏差、信息退化或无法捕捉长距离语义依赖。解决方案的关键在于提出AdmTree框架,其通过动态基于信息密度分割输入,并利用概要标记(gist tokens)将不同长度的片段作为语义二叉树的叶节点进行层次化压缩;结合轻量级聚合机制与冻结的骨干LLM结构,实现了高效且高保真的语义抽象,在保留细粒度细节和全局语义一致性的同时,有效缓解位置偏差并自适应内容变化。

链接: https://arxiv.org/abs/2512.04550
作者: Yangning Li,Shaoshen Chen,Yinghui Li,Yankai Chen,Hai-Tao Zheng,Hui Wang,Wenhao Jiang,Philip S. Yu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室); University of Illinois Chicago (芝加哥伊利诺伊大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.
zh

[NLP-32] EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署后难以更新过时知识的问题,特别是现有知识编辑方法在两个方面的局限性:一是依赖结构化的三元组数据,与LLM预训练阶段的自由文本形式不匹配,无法捕捉事实间的复杂关系;二是仅支持单次知识更新,缺乏对连续或终身编辑(lifelong editing)场景的研究。为应对这些挑战,论文提出了一项新任务——终身自由文本知识编辑(Lifelong Free-text Knowledge Editing, LF-Edit),其核心在于支持以自然语言形式表达的知识持续注入,并保持对已有知识的记忆。解决方案的关键在于提出EvoEdit方法,通过潜空间扰动增强(Latent Perturbation Augmentation)提升新知识的注入效果,并利用知识驱动的参数融合机制(Knowledge-driven Parameter Fusion)有效缓解遗忘问题,从而实现高效且稳定的持续知识更新。

链接: https://arxiv.org/abs/2512.04545
作者: Pengfei Cao,Zeao Ji,Daojian Zeng,Jun Zhao,Kang Liu
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Hunan Normal University (湖南师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model’s internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.
zh

[NLP-33] UW-BioNLP at ChemoTimelines 2025: Thinking Fine-Tuning and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

【速读】: 该论文旨在解决从癌症患者电子健康记录(Electronic Health Records, EHR)中的原始临床文本中自动构建系统性化疗时间线(chemotherapy timelines)的问题。其解决方案的关键在于采用两步式工作流:首先利用大语言模型(Large Language Model, LLM)从单个临床笔记中提取化疗事件,随后通过归一化与聚合算法将这些事件整合为患者级别的完整时间线。不同方法的核心差异体现在LLM的使用和训练策略上,包括链式思维(chain-of-thought thinking)、监督微调(supervised fine-tuning)、直接偏好优化(direct preference optimization)以及基于词典的查找机制,其中微调后的Qwen3-14B模型在测试集上取得了最佳官方得分0.678,表明适配特定任务的LLM训练策略对提升时间线抽取性能具有关键作用。

链接: https://arxiv.org/abs/2512.04518
作者: Tianmai M. Zhang,Zhaoyi Sun,Sihang Zeng,Chenxi Li,Neil F. Abernethy,Barbara D. Lam,Fei Xia,Meliha Yetisgen
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in Proceedings of the 7th Clinical Natural Language Processing Workshop

点击查看摘要

Abstract:The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 – generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.
zh

[NLP-34] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

【速读】: 该论文旨在解决生成式 AI(Generative AI)在零样本立场检测(zero-shot stance detection)任务中面对复杂现实场景时表现不足的问题,具体包括:动态背景知识的缺失、目标定义涉及复合实体或事件且需显式关联立场标签,以及讽刺等修辞手法掩盖作者真实意图。解决方案的关键在于提出多阶段、多专家框架(MSME),其核心创新为分阶段协同推理机制:第一阶段进行知识准备以明确背景与标签;第二阶段通过三个专业化模块——知识专家(Knowledge Expert)、标签专家(Label Expert)和语用专家(Pragmatic Expert)分别从事实提取、标签细化和修辞识别角度进行深度分析;第三阶段由元判断器(Meta-Judge)融合各专家输出,实现最终立场预测。该设计显著提升了模型在复杂语境下的理解与推理能力。

链接: https://arxiv.org/abs/2512.04492
作者: Yuanshuo Zhang,Aohua Li,Bo Chen,Jingbo Sun,Xiaobing Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author’s actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.
zh

[NLP-35] RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中特定数据影响难以移除的问题,尤其是在遗忘集合(forget set)较小或不平衡时,传统近似遗忘方法往往不稳定且效率低下。其解决方案的关键在于提出一种基于影响驱动的参数高效遗忘框架 RapidUn:首先通过快速估计模块计算每个样本的影响分数(influence score),随后将这些分数映射为自适应更新权重,从而指导选择性参数更新——在消除有害行为的同时保留通用知识。该方法实现了比完整重训练高至100倍的效率提升,并在分布内和分布外遗忘任务上均优于 Fisher、GA 和 LoReUn 等现有方法,确立了影响引导的参数重加权作为可扩展且可解释的 LLM 遗忘范式。

链接: https://arxiv.org/abs/2512.04457
作者: Guoshenghui Zhao,Huawei Lin,Weijie Zhao
机构: Golisano College of Computing and Information Sciences, Rochester Institute of Technology (罗切斯特理工学院)
类目: Computation and Language (cs.CL)
备注: Code available at: this https URL

点击查看摘要

Abstract:Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates – forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.
zh

[NLP-36] Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

【速读】: 该论文旨在解决在线讨论中讽刺语义(sarcasm)的自动识别问题,其核心挑战在于讽刺内容的意图常与字面意义相悖,导致机器难以准确判断。解决方案的关键在于采用仅基于经典机器学习方法(如逻辑回归、线性支持向量机、多项式朴素贝叶斯和随机森林)与显式特征工程的策略,不依赖神经网络或父级评论的上下文信息。作者利用SARC 2.0数据集,融合词级和字符级TF-IDF特征以及简单的风格指标,最终发现朴素贝叶斯和逻辑回归模型表现最优,F1分数达到约0.57,为轻量且可解释的讽刺检测提供了清晰可复现的基线。

链接: https://arxiv.org/abs/2512.04396
作者: Subrata Karmaker
机构: Technische Universität Chemnitz (开姆尼茨工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 2 figures, includes full Python code. Classical machine learning baseline for sarcasm detection on the SARC 2.0 dataset

点击查看摘要

Abstract:Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.
zh

[NLP-37] MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

【速读】: 该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在自然语言处理(Natural Language Processing, NLP)任务中可解释性不足的问题,特别是现有后验解释方法(如显著性图或特征可视化)难以直接适用于离散词数据的局限性。解决方案的关键在于提出一种模型无关的显著性估计框架(Model-agnostic Saliency Estimation, MASE),其通过在嵌入层(embedding layer)上施加归一化线性高斯扰动(Normalized Linear Gaussian Perturbations, NLGP),而非原始词输入,高效地估计输入特征的显著性,从而为文本预测模型提供局部解释,且无需了解模型内部结构。实验表明,MASE在Delta Accuracy指标上优于其他模型无关解释方法,展现出良好的解释能力。

链接: https://arxiv.org/abs/2512.04386
作者: Zhou Yang,Shunyan Luo,Jiazhen Zhu,Fang Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model’s internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE’s superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.
zh

[NLP-38] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

【速读】: 该论文旨在解决传统布尔可满足性(SAT)求解器对用户输入要求高、难以直接处理自然语言描述的问题,以及现有启发式选择策略在复杂问题中效率不足的挑战。其解决方案的关键在于提出LangSAT框架,该框架由两部分组成:一是Lang2Logic模块,将英文描述自动转化为合取范式(CNF)表达式;二是SmartSAT模块,基于强化学习(RL)优化冲突驱动子句学习(CDCL)过程中的启发式选择策略,通过图结构表示变量-子句关系并提取全局特征,从而提升求解效率。这一设计显著增强了SAT求解的可访问性和适应性,尤其适用于推理、形式验证和调试等场景。

链接: https://arxiv.org/abs/2512.04374
作者: Muyu Pan,Matthew Walter,Dheeraj Kodakandla,Mahfuza Farooque
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfia- bility (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.
zh

[NLP-39] Mitigating Object and Action Hallucinations in Multimodal LLM s via Self-Augmented Contrastive Alignment WACV

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频描述生成中普遍存在的事实性错误问题,特别是视觉对象和时序动作层面的幻觉(hallucination)现象。现有方法主要针对静态图像中的幻觉缓解,而对动态视频中同时涉及物体与动作的联合幻觉尚无有效解决方案。论文提出了一种自增强对比对齐(Self-Augmented Contrastive Alignment, SANTA)框架,其核心在于通过一种幻觉自增强机制识别MLLM内部潜在的幻觉内容,并将原始描述转化为对比负样本;同时设计了轨迹片段-短语对比对齐策略,实现区域对象与关系引导的动作与其对应视觉和时序短语的精准匹配,从而抑制虚假相关性、强化视觉事实一致性,显著提升视频描述的真实性与准确性。

链接: https://arxiv.org/abs/2512.04356
作者: Kai-Po Chang,Wei-Yuan Cheng,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang
机构: National Taiwan University (国立台湾大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. Project page: this https URL

点击查看摘要

Abstract:Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
zh

[NLP-40] ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

【速读】: 该论文旨在解决传统文本聚类方法在特定领域场景下性能受限的问题,尤其是在缺乏昂贵微调的情况下,基于预训练嵌入的聚类算法难以有效适应领域特异性语境。其解决方案的关键在于提出ClusterFusion框架,将大语言模型(Large Language Models, LLMs)作为聚类核心,通过轻量级嵌入方法进行引导,分三个阶段实现:嵌入引导的子集划分、LLM驱动的主题摘要生成以及基于LLM的主题分配。这一设计使领域知识和用户偏好能够直接融入聚类过程,充分挖掘LLM的上下文适应能力,从而在标准任务和特定领域数据集上均取得显著性能提升。

链接: https://arxiv.org/abs/2512.04350
作者: Yiming Xu,Yuan Yuan,Vijay Viswanathan,Graham Neubig
机构: Adobe(Adobe); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.
zh

[NLP-41] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

【速读】: 该论文旨在解决当前生成式 AI 在企业级数据智能工作流中能力不足的问题,特别是针对数据工程(Data Engineering, DE)与数据分析(Data Analysis, DA)两大核心环节的复杂性与协同挑战。现有模型虽在代码生成方面取得进展,但在多阶段 SQL 管道构建、需求演化下的系统迭代以及开放性业务问题的战略规划与推理等方面表现薄弱。解决方案的关键在于提出 DAComp——一个包含 210 项任务的基准测试集,精准模拟真实企业数据智能流程:DE 任务通过执行驱动的多指标评估衡量系统级工程能力,DA 任务则由经实验验证的大型语言模型(LLM-judge)依据分层细粒度评分规则进行评估。实证表明,当前最先进代理在 DE 任务上的成功率低于 20%,DA 任务平均得分低于 40%,揭示了工程实现与开放推理是两类独立且亟需突破的能力瓶颈,从而为开发真正具备自主性的企业级数据代理提供了严谨、可量化的研发靶向。

链接: https://arxiv.org/abs/2512.04324
作者: Fangyu Lei,Jinxiang Meng,Yiming Huang,Junjie Zhao,Yitong Zhang,Jianwen Luo,Xin Zou,Ruiyi Yang,Wenbo Shi,Yan Gao,Shizhu He,Zuo Wang,Qian Liu,Yang Wang,Ke Wang,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); ByteDance (字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at this https URL
zh

[NLP-42] xt-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction CVPR2026

【速读】: 该论文旨在解决图像描述生成(Image Captioning)任务中对人工标注图像-文本对数据的高度依赖问题,尤其是在训练阶段完全无需使用对齐的图像-文本配对数据的情况下提升模型性能。其解决方案的关键在于提出一种名为TOMCap的新方法,该方法基于预训练语言模型解码器,并通过CLIP(Contrastive Language–Image Pretraining)模型提取的视觉表征来引导生成过程;同时引入检索增强(retrieval-augmentation)和模态差距缩减(modality gap reduction)机制,以减少文本与视觉表示之间的语义鸿沟,从而实现无需图像-文本对即可有效生成高质量描述的目标。

链接: https://arxiv.org/abs/2512.04309
作者: Rui Fonseca,Bruno Martins,Gil Rocha
机构: INESC-ID; Instituto Superior Tecnico; University of Lisbon
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Submitted to CVPR 2026

点击查看摘要

Abstract:Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
zh

[NLP-43] SQuARE: Structured Query Adaptive Retrieval Engine For Tabular Formats

【速读】: 该论文旨在解决在真实电子表格中实现准确问答(Question Answering, QA)的难题,特别是针对多行表头(multirow headers)、合并单元格(merged cells)和单位注释(unit annotations)等复杂结构导致传统分块(naive chunking)方法失效的问题,同时克服了基于SQL的查询方式在缺乏一致Schema的文件上表现不佳的局限性。其解决方案的关键在于提出了一种混合检索框架SQuARE,通过sheet-level、复杂度感知的路由机制(complexity-aware routing),根据表头深度和合并密度计算连续得分,动态选择结构保持的分块检索或自动构建的关系表示上的SQL查询路径;此外,引入轻量级代理(lightweight agent)在置信度低时监督跨路径的结果融合与优化,从而在保留原始单元格信息(如时间标签、单位)的同时,提升检索精度与端到端答案准确性,并具备可预测的延迟特性。

链接: https://arxiv.org/abs/2512.04292
作者: Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in The IEEE International Workshop on Large Language Models in Finance, Dec 8-11, Macau, China, 2025, Preprint Copy

点击查看摘要

Abstract:Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.
zh

[NLP-44] Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

【速读】: 该论文旨在解决阿拉伯语方言识别中因书写不规范和词汇变异导致的分类难题,特别是针对利比亚方言(Libyan dialect)在社交媒体文本中的自动识别问题。解决方案的关键在于通过特征选择(如卡方检验剔除无关特征)与优化n-gram表示方法(采用(1,2)词n-gram与(1,5)字符n-gram组合),结合多类朴素贝叶斯(Multinomial Naive Bayes, MNB)模型,在QADI语料库上实现了85.89%的准确率和0.85741的F1分数,显著优于逻辑回归和线性支持向量机(Linear SVM),验证了特征工程与模型选择对提升阿拉伯语方言自然语言处理任务性能的重要性。

链接: https://arxiv.org/abs/2512.04257
作者: Mansour Essgaer,Khamis Massud,Rabia Al Mamlook,Najah Ghmaid
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.
zh

[NLP-45] On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

【速读】: 该论文旨在解决工具集成强化学习(Tool-integrated Reinforcement Learning, TIRL)中基于组相对策略优化(Group Relative Policy Optimization, GRPO)方法的训练崩溃问题。研究发现,导致训练崩溃的核心机制是“懒惰似然偏移”(Lazy Likelihood Displacement, LLD),即正确与错误响应的似然值均出现系统性下降或停滞,进而引发自增强的LLD死亡螺旋:似然下降导致低置信度输出,放大梯度,最终造成模型参数失稳。解决方案的关键在于提出一种轻量级、似然保真正则化方法LLDS(Likelihood-preserving Regularization for GRPO),该方法仅在轨迹似然下降时激活,并对引发似然下降的具体token进行细粒度正则化,从而在最小干扰优化过程的前提下有效缓解LLD现象。实验表明,该方法显著提升了多跳问答任务中的训练稳定性与性能表现。

链接: https://arxiv.org/abs/2512.04220
作者: Wenlong Deng,Yushu Li,Boying Gong,Yi Ren,Christos Thrampoulidis,Xiaoxiao Li
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory’s likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
zh

[NLP-46] Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment ML4H2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中部署时面临的安全部署难题,即如何在避免不当合规(unsafe compliance)的同时减少对良性查询的过度拒绝(over-refusing benign queries),从而提升对话式医疗助手的安全性与可信度。其解决方案的关键在于提出一种迭代后部署对齐框架,结合Kahneman-Tversky优化(KTO)和直接偏好优化(Direct Preference Optimization, DPO),利用领域特定的安全信号对模型进行持续微调;并通过CARES-18K基准测试评估多个模型在多轮迭代中的表现,揭示了架构依赖的校准偏差,并通过消融实验明确了自评机制与外部或微调裁判在不同情境下的适用性,从而实现患者安全、用户信任与临床实用性的平衡。

链接: https://arxiv.org/abs/2512.04210
作者: Huy Nghiem,Swetasudha Panda,Devashish Khatwani,Huy V. Nguyen,Krishnaram Kenthapadi,Hal Daumé III
机构: University of Maryland (马里兰大学); Oracle Labs (甲骨文实验室); Oracle Health AI (甲骨文健康人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: ML4H 2025 Proceedings, Best Paper Award

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
zh

[NLP-47] Network of Theseus (like the ship)

【速读】: 该论文试图解决深度学习中一个长期存在的假设问题,即神经网络架构在训练和推理阶段必须保持一致(architecture consistency),这一假设限制了研究者选择具有更优效率或设计特性的目标架构,因为这些架构往往难以优化。为突破此限制,作者提出Network of Theseus (NoT) 方法,其关键在于通过逐步替换引导网络(guide network)中的组件并利用表示相似性度量进行对齐,将训练好的引导网络架构渐进式地转换为目标架构,同时保持原有性能不变。该方法实现了训练与部署架构的解耦,从而显著扩展了推理时可用的架构空间,为精度-效率权衡提供了新路径,并支持更受控的架构设计空间探索。

链接: https://arxiv.org/abs/2512.04198
作者: Vighnesh Subramaniam,Colin Conwell,Boris Katz,Andrei Barbu,Brian Cheung
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); CBMM (计算大脑与机器认知中心); Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 24 pages, 9 figures, 8 tables

点击查看摘要

Abstract:A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.
zh

[NLP-48] Can machines perform a qualitative data analysis? Reading the debate with Alan Turing

【速读】: 该论文试图解决当前学界对在定性数据分析中使用大语言模型(Large Language Models, LLMs)的批判性争议,指出其焦点存在偏差。论文的核心论点是:研究重点不应局限于LLMs是否具备执行定性分析的方法论可行性,而应转向实证考察这类人工系统能否产生与人类分析师相当的分析结果。解决方案的关键在于重构讨论框架——借鉴艾伦·图灵(Alan Turing)在《计算机器与智能》(Computing Machinery and Intelligence)中的思想,将问题从“机器能否进行定性分析”转变为“我们能否借助LLMs生成足够接近人类水平的分析”,从而推动更具实证导向和可操作性的学术对话。

链接: https://arxiv.org/abs/2512.04121
作者: Stefano De Paoli
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper reflects on the literature that rejects the use of Large Language Models (LLMs) in qualitative data analysis. It illustrates through empirical evidence as well as critical reflections why the current critical debate is focusing on the wrong problems. The paper proposes that the focus of researching the use of the LLMs for qualitative analysis is not the method per se, but rather the empirical investigation of an artificial system performing an analysis. The paper builds on the seminal work of Alan Turing and reads the current debate using key ideas from Turing “Computing Machinery and Intelligence”. This paper therefore reframes the debate on qualitative analysis with LLMs and states that rather than asking whether machines can perform qualitative analysis in principle, we should ask whether with LLMs we can produce analyses that are sufficiently comparable to human analysts. In the final part the contrary views to performing qualitative analysis with LLMs are analysed using the same writing and rhetorical style that Turing used in his seminal work, to discuss the contrary views to the main question.
zh

[NLP-49] owards Contextual Sensitive Data Detection

【速读】: 该论文旨在解决开放数据门户中敏感数据保护不足的问题,特别是传统方法过度聚焦于个人数据(personal data),而忽视了数据敏感性具有高度情境依赖性的特性。解决方案的关键在于提出两种基于上下文的敏感数据检测机制:一是类型情境化(type contextualization),通过识别数据值的语义类型并结合其在数据集或文档中的整体上下文进行判断,显著降低基于类型检测的误报率;二是领域情境化(domain contextualization),通过从相关规则文档中检索敏感性规范(如数据主题和地理来源),实现非标准数据领域(如人道主义数据)中的情境化敏感数据识别。实验表明,这两种机制借助大语言模型(LLM)显著提升了检测准确性和召回率,并为人工审核提供可解释的指导,从而增强数据审计的一致性与有效性。

链接: https://arxiv.org/abs/2512.04120
作者: Liang Telkamp,Madelon Hulsebos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that con- sider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at this https URL.
zh

[NLP-50] Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

【速读】: 该论文旨在解决少样本提示(few-shot prompting)在复杂领域中性能受限的问题,尤其是在代码漏洞检测(code vulnerability detection)任务中,其效果高度依赖于上下文示例的选择与质量。解决方案的关键在于引入检索增强提示(retrieval-augmented prompting)策略,即通过语义相似性从知识库中检索相关示例作为上下文,从而提升模型对目标漏洞类别的识别能力。实验表明,该方法在20个示例下实现了74.05%的F1分数和83.90%的部分匹配准确率,显著优于随机示例的少样本提示及零样本提示,并且在性能上接近甚至超越微调后的大型语言模型(如Gemini-1.5-Flash),同时避免了微调所需的时间和计算成本。

链接: https://arxiv.org/abs/2512.04106
作者: Fouad Trad,Ali Chehab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted in the 3rd International Conference on Foundation and Large Language Models (FLLM2025)

点击查看摘要

Abstract:Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.
zh

[NLP-51] Limit cycles for speech

【速读】: 该论文试图解决的问题是:人类言语产生过程中,尽管神经振荡(cortical oscillations)和声学能量存在周期性波动,但作为言语生成基础的发音运动是否也具有相应的节奏性仍不明确;传统观点将言语动作视为离散的目标导向行为,难以与观测到的生物节律现象相统一。解决方案的关键在于提出一种非直观但同样合理的离散动作表征方式,揭示了发音运动中普遍存在的极限环(limit cycle)组织结构,并由此恢复出此前无法获取的言语运动底层节律特征,从而在个体发音动作层面实现了对生物节律性和离散性的统一解释。

链接: https://arxiv.org/abs/2512.04642
作者: Adamantios I. Gafos,Stephan R. Kuberski
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive – but no less principled than the conventional – representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.
zh

[NLP-52] Human-Centred Evaluation of Text-to-Image Generation Models for Self-expression of Mental Distress: A Dataset Based on GPT -4o

【速读】: 该论文旨在解决国际学生在心理健康沟通中因语言和文化障碍而难以有效表达心理困扰的问题。其解决方案的关键在于利用生成式 AI(Generative AI)技术,通过基于当代咨询实践设计的四种角色化提示模板(persona-based prompt templates),使用 GPT-4o 将学生的文字描述转化为图像,以辅助其情绪表达。研究发现,提示设计显著影响图像的感知帮助性,其中“插画师”角色提示生成的图像获得最高评价,表明角色化提示策略是提升图像辅助表达效果的核心因素。

链接: https://arxiv.org/abs/2512.04087
作者: Sui He,Shenbin Qian
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective communication is central to achieving positive healthcare outcomes in mental health contexts, yet international students often face linguistic and cultural barriers that hinder their communication of mental distress. In this study, we evaluate the effectiveness of AI-generated images in supporting self-expression of mental distress. To achieve this, twenty Chinese international students studying at UK universities were invited to describe their personal experiences of mental distress. These descriptions were elaborated using GPT-4o with four persona-based prompt templates rooted in contemporary counselling practice to generate corresponding images. Participants then evaluated the helpfulness of generated images in facilitating the expression of their feelings based on their original descriptions. The resulting dataset comprises 100 textual descriptions of mental distress, 400 generated images, and corresponding human evaluation scores. Findings indicate that prompt design substantially affects perceived helpfulness, with the illustrator persona achieving the highest ratings. This work introduces the first publicly available text-to-image evaluation dataset with human judgment scores in the mental health domain, offering valuable resources for image evaluation, reinforcement learning with human feedback, and multi-modal research on mental health communication.
zh

计算机视觉

[CV-0] he Universal Weight Subspace Hypothesis

【速读】:该论文旨在解决深度神经网络在不同任务和初始化条件下是否共享低维参数子空间的问题,即探索模型内部结构是否存在普遍性的组织规律。其解决方案的关键在于通过大规模实证分析(涵盖超过1100个模型,包括Mistral-7B LoRA、Vision Transformer和LLaMA-8B等),利用谱分解技术对权重矩阵进行逐模式分析,发现无论训练任务或数据域如何变化,神经网络均系统性收敛至一组共享的低维谱子空间,且这些子空间能用少数主方向捕获大部分方差,揭示了深度网络中信息内在组织的普适性结构。

链接: https://arxiv.org/abs/2512.05117
作者: Prakhar Kaushik,Shravan Chaudhari,Ankit Vaidya,Rama Chellappa,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages

点击查看摘要

Abstract:We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
zh

[CV-1] Value Gradient Guidance for Flow Matching Alignment NEURIPS2025

【速读】:该论文旨在解决流匹配模型(flow matching models)在对齐人类偏好时存在的两大问题:适应效率低与概率上合理的先验保持不足。现有方法难以同时实现快速微调和保留原始模型的概率结构。解决方案的关键在于引入最优控制理论,提出VGG-Flow方法——该方法通过梯度匹配机制,使微调后的速度场(velocity field)与奖励模型对应的值函数(value function)的梯度场相一致,从而利用奖励模型的一阶信息提升对齐效果,并借助值函数的启发式初始化实现高效适应。实验表明,该方法可在有限计算预算下有效微调Stable Diffusion 3等文本到图像生成模型,同时保持先验分布的合理性。

链接: https://arxiv.org/abs/2512.05116
作者: Zhen Liu,Tim Z. Xiao,Carles Domingo-Enrich,Weiyang Liu,Dinghuai Zhang
机构: The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳) ); University of Tübingen(图宾根大学); Microsoft Research(微软研究院); The Chinese University of Hong Kong(香港中文大学); Mila – Quebec AI Institute(蒙特利尔人工智能研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025; 26 pages, 20 figures

点击查看摘要

Abstract:While methods exist for aligning flow matching models–a popular and effective class of generative models–with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
zh

[CV-2] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

【速读】:该论文旨在解决视频生成中光照保真度与时间一致性之间的权衡问题,以及实现对真实场景中相机轨迹和光照的联合可控渲染。其核心挑战在于如何在单目视频基础上同时精确控制视角变化和光照条件,从而支持生成式建模。解决方案的关键在于提出Light-X框架:首先采用解耦设计,通过沿用户定义相机轨迹投影动态点云来捕获几何与运动信息,同时利用一致投影至相同几何结构的重光照帧提供照明线索,实现几何与光照信号的显式分离;其次,为应对缺乏配对多视角与多光照视频的问题,引入Light-Syn合成管道,基于退化建模与逆映射策略从真实世界单目视频中生成训练数据,显著提升模型鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2512.05115
作者: Tianqi Liu,Zhaoxi Chen,Zihao Huang,Shaocong Xu,Saining Zhang,Chongjie Ye,Bohan Li,Zhiguo Cao,Wei Li,Hao Zhao,Ziwei Liu
机构: S-Lab, NTU (新加坡南洋理工大学); BAAI (北京人工智能研究院); HUST (华中科技大学); AIR, THU (清华大学人工智能研究院); FNii, CUHKSZ (香港中文大学(深圳)未来网络研究院); SJTU (上海交通大学); EIT (Ningbo) (宁波工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
zh

[CV-3] Deep infant brain segmentation from multi-contrast MRI

【速读】:该论文旨在解决婴幼儿和儿童脑部磁共振成像(MRI)分割难题,其核心挑战在于图像采集条件不稳定、模态不一致、视野中包含大量非头部结构以及频繁的运动伪影,导致现有分割模型往往局限于特定图像类型或狭窄年龄范围,且对临床实际场景中的变异性敏感。解决方案的关键在于提出BabySeg框架,该框架基于域随机化(domain randomization)技术合成超出真实边界的数据以增强模型对数据分布偏移的鲁棒性,并设计了一种灵活的特征池化与交互机制,使模型能够处理任意数量的输入扫描,从而实现跨多种MRI协议、重复扫描及训练时未见图像类型的统一高精度分割。

链接: https://arxiv.org/abs/2512.05114
作者: Malte Hoffmann,Lilla Zöllei,Adrian V. Dalca
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 8 figures, 1 table, website at this https URL , presented at the 2025 IEEE Asilomar Conference on Signals, Systems, and Computers

点击查看摘要

Abstract:Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
zh

[CV-4] Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting WACV2025

【速读】:该论文致力于解决从单目Mannequin-Challenge(MC)视频中合成高保真冻结三维场景的问题,其核心挑战在于如何在不依赖多视角信息的情况下,保留细微动态以支持用户可控的即时时间选择。传统动态场景重建侧重于建模运动,而本文的目标是构建一个“冻结”场景,同时策略性地保留局部时间变化。解决方案的关键在于提出一种架构无关的正则化方法——Splannequin,该方法识别高斯基元(Gaussian primitives)的两种状态:隐藏状态(hidden)和缺陷状态(defective),并分别通过时间锚定机制进行处理:隐藏状态锚定至近期良好观测的历史状态,缺陷状态锚定至未来监督更强的状态。此方法可无缝集成至现有动态高斯溅射(dynamic Gaussian splatting)流程中,仅需添加简单损失项,无需架构修改且推理无额外开销,显著提升视觉质量,实现高质量、用户可选时间点的渲染效果,在用户偏好测试中达到96%的优选率。

链接: https://arxiv.org/abs/2512.05113
作者: Hao-Jen Chien,Yi-Chuan Huang,Chung-Ho Wu,Wei-Lun Chao,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025. Project page: this https URL

点击查看摘要

Abstract:Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model’s time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: this https URL
zh

[CV-5] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agent ic Tool Use and Visual Reasoning

【速读】:该论文旨在解决当前视觉-语言系统中奖励模型(Reward Model)存在的三大问题:幻觉(hallucination)、视觉定位能力弱以及缺乏工具调用验证机制,这些问题限制了其在复杂多模态推理任务中的可靠性。解决方案的关键在于提出一种具有代理能力的多模态奖励模型 ARM-Thinker,它能够自主调用外部工具(如图像裁剪、文档页面检索)来基于可验证证据进行判断,从而替代静态且非交互式的评分方式。该模型通过多阶段强化学习联合优化工具调用决策与判断准确性,显著提升了细粒度视觉定位、跨页文档理解及指令遵循等关键能力,实现了更准确、可解释的奖励建模。

链接: https://arxiv.org/abs/2512.05111
作者: Shengyuan Ding,Xinyu Fang,Ziyu Liu,Yuhang Zang,Yuhang Cao,Xiangyu Zhao,Haodong Duan,Xiaoyi Dong,Jianze Liang,Bin Wang,Conghui He,Dahua Lin,Jiaqi Wang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an Agentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
zh

[CV-6] ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

【速读】:该论文旨在解决如何将普通三维物体转化为具有艺术性的阴影绘画(shadow-drawing)作品的问题,即通过优化场景参数(如物体姿态和光照条件),使投射的阴影能够补全线稿形成可识别图像。其解决方案的关键在于构建一个端到端的框架 ShadowDraw,该框架联合优化场景配置以生成有意义的阴影,并利用阴影笔触引导线稿生成,同时引入自动评估机制确保阴影与线条之间的语义一致性及视觉质量,从而实现从算法设计到艺术表达的有效衔接。

链接: https://arxiv.org/abs/2512.05110
作者: Rundong Luo,Noah Snavely,Wei-Chiu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page this https URL for more results and an end-to-end real-world demonstration of our pipeline!
zh

[CV-7] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

【速读】:该论文旨在解决标准扩散模型在图像生成过程中因使用高斯噪声破坏相位信息而导致空间结构丢失的问题,这使得其在需要几何一致性的任务(如重渲染、模拟增强和图像到图像翻译)中表现不佳。解决方案的关键在于提出一种模型无关的扩散过程重构方法——相位保持扩散(Phase-Preserving Diffusion, \phi-PD),该方法在去噪过程中保留输入数据的相位信息而仅随机化幅度,从而实现结构对齐的生成,且无需修改网络架构或引入额外参数;进一步提出的频域选择性结构化(Frequency-Selective Structured, FSS)噪声通过单一频率截止参数提供对结构刚度的连续控制,显著提升了生成结果的空间一致性与可控性。

链接: https://arxiv.org/abs/2512.05106
作者: Yu Zeng,Charles Ochoa,Mingyuan Zhou,Vishal M. Patel,Vitor Guizilini,Rowan McAllister
机构: Toyota Research Institute (丰田研究院); University of Texas, Austin (德克萨斯大学奥斯汀分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion \phi-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. \phi-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, \phi-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, \phi-PD improves CARLA-to-Waymo planner performance by 50%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \hrefthis https URLproject page.
zh

[CV-8] EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

【速读】:该论文针对全一体化图像复原(All-in-One Image Restoration, AiOIR)任务中普遍存在的多样化退化问题,旨在提升模型在异构退化场景下的泛化能力。现有方法通常缺乏显式的频域建模机制,且依赖固定或启发式优化策略,难以有效适应复杂多变的退化类型。解决方案的关键在于提出EvoIR框架,其核心创新包括:(1)频率调制模块(Frequency-Modulated Module, FMM),通过显式分解特征至高低频分支并自适应调制,从而同时增强结构保真度与细节还原能力;(2)进化优化策略(Evolutionary Optimization Strategy, EOS),基于种群进化的动态调整机制,迭代优化频率感知目标,平衡结构准确性和感知保真度,并缓解跨退化类型的梯度冲突、加速收敛。二者协同作用显著优于单一组件,验证了其互补性与有效性。

链接: https://arxiv.org/abs/2512.05104
作者: Jiaqi Ma,Shengkai Hu,Jun Wan,Jiaxing Huang,Lefei Zhang,Salman Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Zhongnan University of Economics and Law (中南财经政法大学); Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
zh

[CV-9] V2TV: A Unified Framework for Interleaved Language and Video Generation

【速读】:该论文旨在解决当前视频生成模型在处理需要复杂语义分支或重复高层推理的任务时表现不佳的问题,尤其是在生成长视频时难以保持视觉质量和与文本提示的一致性。其解决方案的关键在于提出TV2TV框架,这是一种统一的生成建模方法,通过将视频生成过程分解为交错的文本和视频生成步骤来实现。TV2TV利用混合Transformer(Mixture-of-Transformers, MoT)架构联合学习语言建模(next-token prediction)和视频流匹配(next-frame prediction),并在推理阶段动态决定何时在生成文本和视频帧之间切换,使模型能够“用文字思考”后续内容后再“以像素行动”生成画面。这一设计将决策责任主要交由语言建模模块承担,从而提升了生成视频的视觉质量、提示对齐度,并实现了细粒度可控性,允许用户在任意时刻通过文本干预调整生成轨迹。

链接: https://arxiv.org/abs/2512.05103
作者: Xiaochuang Han,Youssef Emad,Melissa Hall,John Nguyen,Karthik Padthe,Liam Robbins,Amir Bar,Delong Chen,Michal Drozdzal,Maha Elbayad,Yushi Hu,Shang-Wen Li,Sreya Dutta Roy,Jakob Verbeek,XuDong Wang,Marjan Ghazvininejad,Luke Zettlemoyer,Emily Dinan
机构: Meta(Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to “think in words” about subsequent content before ``acting in pixels’’ to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model’s ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
zh

[CV-10] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

【速读】:该论文旨在解决当前图像质量评估(Image Quality Assessment, IQA)方法在AI生成图像(AIGI)领域中对室内场景缺乏系统性评价的问题。现有方法主要聚焦于人物肖像和艺术图像,难以有效衡量空间美学(Spatial Aesthetics)的核心维度——布局、和谐性、光照与失真。其解决方案的关键在于提出首个面向空间美学的基准数据集SA-BENCH(含18,000张图像和50,000条精细标注),并基于此构建SA-IQA模型:通过多大语言模型(MLLM)微调与多维融合策略,形成一个综合奖励框架,用于量化评估室内图像的空间美学质量。该框架被进一步应用于生成式AI优化任务中,包括结合GRPO强化学习改进AIGC生成流程,以及采用Best-of-N选择机制筛选高质量图像,显著优于现有方法,在空间美学评估上树立了新标准。

链接: https://arxiv.org/abs/2512.05098
作者: Yuan Gao,Jin Song
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
zh

[CV-11] From Generated Human Videos to Physically Plausible Robot Trajectories

【速读】:该论文旨在解决如何让类人机器人在零样本(zero-shot)条件下执行由视频生成模型合成的人类动作这一关键问题。由于生成视频通常存在噪声和形态失真,直接模仿难以实现,因此其核心挑战在于从不完美、非真实的视频中提取可迁移的运动信息并转化为机器人可控的动作策略。解决方案的关键在于提出一个两阶段框架:第一阶段将视频像素提升为4D人体表示,并映射到机器人本体结构;第二阶段设计了GenMimic——一种基于3D关键点的物理感知强化学习策略,结合对称性正则化与关键点加权跟踪奖励机制,在无需微调的情况下实现对噪声生成视频中人类动作的稳定、物理合理模仿。

链接: https://arxiv.org/abs/2512.05094
作者: James Ni,Zekai Wang,Wei Lin,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik,Roei Herzig
机构: University of California, Berkeley (加州大学伯克利分校); New York University (纽约大学); Johannes Kepler University (约翰尼斯·开普勒大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: For project website, see this https URL

点击查看摘要

Abstract:Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
zh

[CV-12] Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理过程中缺乏可解释性的问题,即模型通常仅输出最终预测结果,而无法提供支持该结果的中间推理步骤或细粒度证据(如像素、位置等)。为应对这一挑战,作者提出视觉推理追踪(Visual Reasoning Tracer, VRT)任务,要求模型不仅定位目标对象,还需显式预测构成推理路径的中间对象。解决方案的关键在于构建了三个核心组件:(1) VRT-Bench——一个由人工标注的基准测试集用于评估视觉推理能力;(2) 一种衡量推理轨迹质量的新指标;(3) VRT-80k——一个大规模训练数据集,使模型能够学习到更准确的推理路径。实验表明,基于VRT-80k训练的模型显著提升了对推理路径的追踪能力,从而增强了模型的可解释性和人类感知的一致性。

链接: https://arxiv.org/abs/2512.05091
作者: Haobo Yuan,Yueyi Sun,Yanwei Li,Tao Zhang,Xueqing Deng,Henghui Ding,Lu Qi,Anran Wang,Xiangtai Li,Ming-Hsuan Yang
机构: UC Merced (加州大学默塞德分校); PKU (北京大学); NTU (南洋理工大学); CUHK (香港中文大学); WHU (武汉大学); FDU (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
zh

[CV-13] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

【速读】:该论文旨在解决自回归视频扩散模型在实时帧流式生成过程中存在的时序重复、运动漂移和运动减速等问题。现有方法直接套用StreamingLLM风格的注意力缓存机制会导致图像保真度下降和运动停滞。其解决方案的关键在于提出一种无需微调的KV缓存管理策略——Deep Forcing,包含两个核心机制:1)Deep Sink通过将滑动窗口一半空间分配给持久化的sink tokens,并重新对齐其时间RoPE相位至当前时间线,从而在长序列生成中稳定全局上下文;2)Participative Compression基于重要性感知的KV缓存剪枝策略,仅保留近期注意力活跃的token,安全丢弃冗余与退化的历史信息,有效抑制误差累积。二者协同实现超过12倍的外推能力(如5秒训练扩展至60秒以上生成),同时在图像质量、美学表现和动态程度上优于现有方法,且保持实时生成性能。

链接: https://arxiv.org/abs/2512.05081
作者: Jung Yi,Wooseok Jang,Paul Hyunbin Cho,Jisu Nam,Heeji Yoon,Seungryong Kim
机构: KAIST AI (韩国科学技术院人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
zh

[CV-14] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

【速读】:该论文旨在解决机器人操作中物体几何重建的挑战性问题,即由于相机仅能获取物体的部分观测信息(尤其在存在遮挡时),导致几何结构 ambiguity 较高。解决方案的关键在于融合两种额外信息源:一是利用生成式 AI (Generative AI) 模型学习常见物体形状先验,从而对不可见部分进行合理推测;二是通过视频或物理交互获得接触信息,提供稀疏但有效的边界约束。最终,通过接触引导的三维生成方法将二者有机结合,其指导机制受生成模型中基于拖拽的编辑启发,显著提升了重建精度,优于纯三维生成或仅依赖接触优化的方法。

链接: https://arxiv.org/abs/2512.05079
作者: Minghan Zhu,Zhiyi Wang,Qihang Sun,Maani Ghaffari,Michael Posa
机构: University of Pennsylvania (宾夕法尼亚大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
zh

[CV-15] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

【速读】:该论文旨在解决当前视频扩散模型(video diffusion models)中场景动态与相机运动耦合的问题,这种耦合限制了对空间和时间维度的精确控制。解决方案的关键在于提出一种4D可控视频扩散框架,通过显式解耦场景动态与相机位姿(camera pose),实现对两者进行细粒度操控;具体方法包括将连续的世界时间序列和相机轨迹作为条件输入,利用4D位置编码注入注意力层,并采用自适应归一化进行特征调制,从而在保持高生成质量的同时显著提升可控性。

链接: https://arxiv.org/abs/2512.05076
作者: Yiming Wang,Qihang Zhang,Shengqu Cai,Tong Wu,Jan Ackermann,Zhengfei Kuang,Yang Zheng,Frano Rajič,Siyu Tang,Gordon Wetzstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: this https URL
zh

[CV-16] 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

【速读】:该论文旨在解决现有4D语义场(4D semantic field)构建方法依赖场景特定的高斯点绘(Gaussian splatting)所带来的局限性,如需逐场景优化、泛化能力弱及难以规模化部署等问题。其核心解决方案是提出首个基于Transformer的统一前馈框架4DLangVGGT,该框架通过两个关键组件实现几何感知与语言对齐的联合建模:一是4D视觉几何Transformer(StreamVGGT),用于捕捉动态场景的时空几何表征;二是语义桥接解码器(Semantic Bridging Decoder, SBD),将几何感知特征映射至语言对齐的语义空间,在保持结构保真度的同时提升语义可解释性。该设计支持跨多个动态场景的联合训练,并在推理阶段直接应用,显著提升了部署效率与泛化性能,为开放词汇的4D场景理解建立了新范式。

链接: https://arxiv.org/abs/2512.05060
作者: Xianfeng Wu,Yajing Bai,Minghan Li,Xianzu Wu,Xueqi Zhao,Zhongyuan Lai,Wenyu Liu,Xinggang Wang
机构: Jianghan University (江汉大学); Harvard University (哈佛大学); Huazhong University of Science and Technology (华中科技大学); The Hong Kong Polytechnic University (香港理工大学); Hong Kong Baptist University (香港浸会大学); Hubei University of Education (湖北教育学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL , Webpage: this https URL

点击查看摘要

Abstract:Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in this https URL
zh

[CV-17] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

【速读】:该论文旨在解决从单张静态图像生成交互式且动态的4D场景这一核心挑战,现有方法如“生成后重构”(generate-then-reconstruct)和“重构后生成”(reconstruct-then-generate)通常将几何结构与运动信息解耦,导致时空不一致性及泛化能力差的问题。其解决方案的关键在于提出MoRe4D框架,首次在重构-生成范式中联合进行运动生成与几何重建(Motion generation and geometric Reconstruction for 4D Synthesis),并通过一个基于扩散模型的4D点轨迹生成器(4D-STraG)实现几何一致且运动合理的4D点轨迹联合建模;同时设计了深度引导的运动归一化策略与运动感知模块以有效融合单视角先验信息,并引入4D视图合成模块(4D-ViSM)从4D点轨迹表示中渲染任意相机轨迹的视频,从而实现高质量、多视角一致且富含动态细节的4D场景重建。

链接: https://arxiv.org/abs/2512.05044
作者: Yanran Zhang,Ziyi Wang,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 Pages

点击查看摘要

Abstract:Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: this https URL.
zh

[CV-18] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding CVPR-2025

【速读】:该论文旨在解决人脸图像修复(Facial Image Inpainting)中因大尺度不规则遮挡导致的边缘模糊、语义不一致及面部结构失真问题,这些问题主要源于现有方法依赖像素级直接合成且未能充分挖掘人脸先验信息。解决方案的关键在于提出一种语义引导的分层合成架构:第一阶段通过融合卷积神经网络(CNNs)与视觉Transformer(Vision Transformers)提取局部与全局特征,生成清晰的语义布局;第二阶段利用多模态纹理生成器(Multi-Modal Texture Generator)在多尺度上细化纹理,确保结构一致性与视觉真实感。该架构通过动态注意力机制自然适应任意掩码配置,无需特定掩码训练,在CelebA-HQ和FFHQ数据集上显著优于现有方法,尤其在LPIPS、PSNR和SSIM等指标上表现优异。

链接: https://arxiv.org/abs/2512.05039
作者: Abhigyan Bhattacharya,Hiranmoy Roy
机构: RCC Institute of Information Technology (RCC信息科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for review CVPR-2025

点击查看摘要

Abstract:Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
zh

[CV-19] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

【速读】:该论文旨在解决地球观测(Earth Observation, EO)数据在多模态、多分辨率场景下表示学习的泛化能力不足问题,特别是现有基础模型通常依赖固定输入分辨率或传感器特定编码器,难以跨异构EO模态实现统一表征。其解决方案的关键在于提出RAMEN——一种可调节分辨率的多模态编码器,通过将模态类型、空间分辨率和时间分辨率作为输入特征显式建模,并首次将空间分辨率定义为可控输出参数,从而在统一潜在空间中实现跨传感器、跨分辨率的数据协同分析,同时允许用户在推理阶段灵活权衡空间精度与计算成本。

链接: https://arxiv.org/abs/2512.05025
作者: Nicolas Houdré,Diego Marcos,Hugo Riffaud de Turckheim,Dino Ienco,Laurent Wendling,Camille Kurtz,Sylvain Lobry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at this https URL.
zh

[CV-20] HTR-ConvText: Leverag ing Convolution and Textual Information for Handwritten Text Recognition

【速读】:该论文旨在解决手写文本识别(Handwritten Text Recognition, HTR)中因训练数据有限、书写风格差异大以及含复杂变音符号(diacritics)的脚本所带来的挑战。现有方法虽在一定程度上缓解了这些问题,但在缺乏大量合成数据的情况下泛化能力不足。其解决方案的关键在于提出HTR-ConvText模型:首先通过融合残差卷积神经网络(Residual CNN)与带位置编码的MobileViT模块,在特征提取阶段同时捕捉笔画级局部细节和全局上下文信息;其次设计了一种分层结构的ConvText编码器,有效压缩序列长度以提升效率并保留局部与全局特征的协同作用;最后引入辅助模块注入文本语义信息,增强连接时序分类(Connectionist Temporal Classification, CTC)对复杂场景的鲁棒性。实验表明,该方法在IAM、READ2016、LAM及HANDS-VNOnDB等多个基准上的性能优于现有方法,尤其在小样本和高多样性书写场景下表现出更强的泛化能力。

链接: https://arxiv.org/abs/2512.05021
作者: Pham Thach Thanh Truc,Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
zh

[CV-21] Generative Neural Video Compression via Video Diffusion Prior

【速读】:该论文旨在解决现有感知类神经视频压缩(perceptual neural video compression)方法在帧级重建中缺乏时序建模能力,导致视觉闪烁(perceptual flickering)的问题。其核心解决方案是提出GNVC-VD框架,首次基于DiT(Diffusion Transformer)构建生成式神经视频压缩架构,通过统一时空潜在表示压缩与序列级生成精修,在单个编解码器中实现时序一致性增强;关键创新在于引入一个统一的流匹配潜在精修模块(flow-matching latent refinement module),利用视频扩散Transformer对时空潜在变量进行序列级去噪,而非从纯高斯噪声开始,而是从解码后的潜在表示出发,学习适应压缩退化特征的修正项,并结合条件适配器(conditioning adaptor)向DiT中间层注入压缩感知提示,从而在极低比特率(如低于0.01 bpp)下有效去除伪影并保持时序一致性。

链接: https://arxiv.org/abs/2512.05016
作者: Qi Mao,Hao Cheng,Tinghan Yang,Libiao Jin,Siwei Ma
机构: Communication University of China (中国传媒大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
zh

[CV-22] Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects

【速读】:该论文旨在解决透明物体深度感知难题,即传统深度传感器因光线折射和反射难以准确获取透明物体的深度信息。现有方法通常依赖大量带标注的深度图数据训练神经网络进行深度补全,但标注成本高昂。其解决方案的关键在于提出一种新的自监督学习方法:通过在非透明区域模拟透明物体导致的深度缺失,并利用原始深度图作为监督信号,从而无需人工标注即可训练深度补全网络。实验表明,该方法性能可媲美有监督方法,且在小样本场景下预训练能显著提升模型表现。

链接: https://arxiv.org/abs/2512.05006
作者: Xianghui Fan,Zhaoyu Chen,Mengyang Pan,Anping Deng,Hang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: conference

点击查看摘要

Abstract:The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.
zh

[CV-23] Reflection Removal through Efficient Adaptation of Diffusion Transformers

【速读】:该论文旨在解决单图像去反射(single-image reflection removal)问题,即从包含反射干扰的图像中恢复出干净的透射层(transmission layer)。其解决方案的关键在于:利用预训练的扩散变换器(diffusion-transformer, DiT)基础模型,并通过条件输入反射污染图像、引导模型生成无反射的透射层,实现任务迁移;同时,为缓解真实数据稀缺问题,构建基于物理渲染(physically based rendering, PBR)的合成数据生成管道,使用Blender中的 Principled BSDF 模拟逼真的玻璃材质与反射效果;并通过高效LoRA(Low-Rank Adaptation)微调策略,在合成数据上对基础模型进行轻量化适配,从而在域内和零样本基准测试中均达到当前最优性能。这一方法表明,结合物理驱动的数据合成与高效适应机制,可为去反射任务提供高保真且可扩展的解决方案。

链接: https://arxiv.org/abs/2512.05000
作者: Daniyar Zakarin,Thiemo Wandel,Anton Obukhov,Dengxin Dai
机构: ETH Zürich (苏黎世联邦理工学院); HUAWEI Bayer Lab (华为拜耳实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: this https URL
zh

[CV-24] A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs

【速读】:该论文旨在解决VANICP(一种高性能点云配准算法)在嵌入式GPU上部署时因内存消耗过大而受限的问题。其关键解决方案是提出了一种面向GPU的动态内存分配策略,通过优化膨胀操作(dilation operation)的内存使用方式,在保持原始性能的同时实现了超过97%的内存占用降低,从而使得VANICP能够在硬件资源受限的嵌入式系统中高效运行。

链接: https://arxiv.org/abs/2512.04996
作者: Qiong Chang,Weimin Wang,Junpei Zhong,Jun Miyazaki
机构: Institute of Science Tokyo (东京科学研究所); Dalian University of Technology (大连理工大学); University of Wollongong (College HK) (伍伦贡大学(香港分校)); School of Computing (计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: this https URL.
zh

[CV-25] Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

【速读】:该论文旨在解决基于大视觉语言模型(Large Vision-Language Model, LVLM)的文本到图像(Text-to-Image, T2I)生成系统中社会偏见放大问题,即这些模型相较于非LVLM基线模型更易生成具有性别、种族等人口统计学属性偏见的图像。研究发现,系统提示(system prompts)——即指导LVLM行为的预设指令——是导致偏见传播的主要驱动因素;通过解码中间表示、token概率诊断与嵌入关联分析,揭示了系统提示如何编码人口先验并传递至图像合成过程。解决方案的关键在于提出FairPro,一个无需训练的元提示(meta-prompting)框架,使LVLM能够在推理阶段自我审计并动态构建公平感知的系统提示,从而在不牺牲文本-图像对齐能力的前提下显著降低人口统计学偏见。

链接: https://arxiv.org/abs/2512.04981
作者: NaHyeon Park,Namin An,Kunhee Kim,Soyeon Yoon,Jiahao Huo,Hyunjung Shim
机构: KAIST (韩国科学技术院); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
zh

[CV-26] Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks

【速读】:该论文旨在解决图像像素级表征学习中如何同时捕捉语义信息与几何结构的问题,以实现跨图像的精确点对应关系。其解决方案的关键在于提出了一类稳定的对比损失(stable contrastive losses),通过将图像中的每个像素映射到一个过完备描述符(overcomplete descriptor),使得该描述符具备视角不变性(view-invariant)和语义意义,从而无需依赖基于动量的师生训练机制即可实现高精度的像素级匹配。

链接: https://arxiv.org/abs/2512.04970
作者: Leonid Pogorelyuk,Niels Bracher,Aaron Verkleeren,Lars Kühmichel,Stefan T. Radev
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Technical University Dortmund (多特蒙德工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: UniReps Workshop 2025, 12 pages, 8 figures

点击查看摘要

Abstract:We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
zh

[CV-27] Rethinking the Use of Vision Transformers for AI-Generated Image Detection

【速读】:该论文旨在解决当前AI生成图像检测方法中对预训练视觉Transformer(ViT)特征利用不充分的问题,特别是现有方法主要依赖最终层特征而忽视了中间层特征的潜在价值。研究表明,早期ViT层能够提供更局部化且更具泛化能力的特征,在检测任务中表现优于最终层特征;不同层次捕捉的数据信息各具独特性,共同作用于检测性能提升。解决方案的关键在于提出一种名为MoLD(Multi-layer Adaptive Feature Integration via Gating Mechanism)的新方法,通过门控机制动态融合多层ViT特征,从而实现更高效、鲁棒且可扩展的检测性能,尤其在GAN和扩散模型生成图像上表现出显著优势,并具备良好的跨模型泛化能力和真实场景适应性。

链接: https://arxiv.org/abs/2512.04969
作者: NaHyeon Park,Kunhee Kim,Junsuk Choe,Hyunjung Shim
机构: KAIST AI(韩国科学技术院人工智能); Sogang University(首尔女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
zh

[CV-28] Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis

【速读】:该论文旨在解决眼科影像中自动视网膜疾病诊断面临的挑战,即传统深度学习方法依赖大量标注数据且类别分布严重不均衡,导致模型在少数类疾病(如视盘水肿、视网膜静脉阻塞)上性能下降,临床实用性受限。其解决方案的关键在于提出一种面向Retinal Fundus Multi-Disease Image Dataset (RFMiD) 的平衡少样本(few-shot)任务式学习框架,核心包括:(i) 平衡任务采样策略,确保每个5-way 5-shot任务中所有类别参与机会均等;(ii) 针对性增强技术,采用对比度受限自适应直方图均衡化(CLAHE)及色彩与几何变换提升少数类多样性;(iii) 使用ImageNet预训练的ResNet-50作为编码器以提取细粒度视网膜特征,并在嵌入空间中通过余弦相似度计算原型进行分类,从而显著降低对多数类的偏倚,提升对低频疾病的识别准确率。

链接: https://arxiv.org/abs/2512.04967
作者: Jasmaine Khale,Ravi Prakash Srivastava
机构: Northeastern University (东北大学); IIIT Ranchi (印度信息技术研究所拉贾斯坦分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
zh

[CV-29] GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

【速读】:该论文旨在解决标准视觉Transformer(Vision Transformer)在将二维图像展平为一维序列时破坏自然空间拓扑结构的问题,尤其是传统旋转位置编码(Rotary Positional Embedding, RoPE)因沿用一维序列处理方式,导致空间上相距较远的图像块(如行边缘处的块)被错误地视为序列邻近关系,从而引入虚假的顺序依赖。解决方案的关键在于提出几何位置编码(Geometric Positional Embedding, GeoPE),其核心创新是将旋转操作扩展至三维欧几里得空间,并利用四元数(quaternion)构建统一的旋转变换算子;通过在李代数中计算几何均值来克服非交换性问题并保证对称性,从而实现空间维度间的几何耦合与有效解耦,使模型能够准确捕捉真实的几何结构,显著提升形状偏差(shape bias)和多任务性能。

链接: https://arxiv.org/abs/2512.04963
作者: Yupu Yao,Bowen Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
zh

[CV-30] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

【速读】:该论文旨在解决自回归视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中,因动作标记化(action tokenization)过程存在重建保真度与推理效率之间的权衡问题。其解决方案的关键在于提出FASTer框架,通过引入可学习的标记化器(learnable tokenizer),将动作块编码为单通道图像(single-channel images),从而在保持高压缩比的同时捕捉全局时空依赖关系;在此基础上,进一步构建基于分块自回归解码和轻量级动作专家的策略网络(FASTerVLA),实现更快的推理速度与更高的任务性能,显著优于现有最优VLA模型。

链接: https://arxiv.org/abs/2512.04952
作者: Yicheng Liu,Shiduo Zhang,Zibin Dong,Baijun Ye,Tianyuan Yuan,Xiaopeng Yu,Linqi Yin,Chenhao Lu,Junhao Shi,Luca Jiang-Tao Yu,Liangtao Zheng,Tao Jiang,Jingjing Gong,Xipeng Qiu,Hang Zhao
机构: Tsinghua University (清华大学); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Galaxea AI; Tianjin University (天津大学); Hong Kong University (香港大学); UCSD (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
zh

[CV-31] owards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

【速读】:该论文旨在解决传统单模态(unimodal)动作识别方法在复杂场景下准确率低、鲁棒性差的问题,尤其针对多模态信息融合时如何有效选择和整合不同模态特征的挑战。其解决方案的关键在于引入基于门控机制(gating mechanisms)的自适应多模态融合策略,通过动态调整各模态(RGB图像、光流、音频与深度信息)的权重,实现对关键特征的 selective integration,从而提升动作识别的精度与泛化能力。实验表明,该方法在多个基准数据集上的动作识别、暴力行为检测及自监督学习任务中均显著优于传统单模态方法。

链接: https://arxiv.org/abs/2512.04943
作者: Novanto Yudistira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
zh

[CV-32] LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

【速读】:该论文旨在解决3D视觉基础模型(如VGGT)在处理长序列场景时存在的计算效率低和内存消耗大的问题,从而限制其在大规模场景(数百张图像以上)中的应用。解决方案的关键在于提出LiteVGGT框架,通过两个核心洞察设计了一种几何感知的缓存token合并策略:首先,局部图像区域的token具有内在几何相关性,导致高相似性和计算冗余;其次,相邻网络层间token相似性稳定,支持合并决策的复用。基于此,作者优化锚点token选择以保留关键重建信息,并缓存与重用各层间的合并索引,在显著降低延迟(达10倍加速)和内存占用的同时保持重建精度,同时兼容高效微调与FP8量化,实现性能与效率的平衡。

链接: https://arxiv.org/abs/2512.04939
作者: Zhijian Shu,Cheng Lin,Tao Xie,Wei Yin,Ben Li,Zhiyuan Pu,Weize Li,Yao Yao,Xun Cao,Xiaoyang Guo,Xiao-Xiao Long
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Horizon Robotics ( horizon机器人公司); Nanjing University (南京大学); Zhejiang University (浙江大学); Macau University of Science and Technology (澳门科技大学); TARS Robotics (TARS机器人公司); China Mobile Zijin Innovation Institute (中国移动紫金创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token’s geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT’s core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT’s effectiveness, scalability, and robustness. Project page: this https URL
zh

[CV-33] Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting WACV2026

【速读】:该论文旨在解决赫库兰尼姆纸草卷(Herculaneum Papyri)因严重损毁且极度脆弱而无法物理展开,从而阻碍文本获取的问题。其核心挑战在于如何从高分辨率CT扫描数据中自动重建出可读的二维平面表面表示,以实现虚拟展开(virtual unrolling)。解决方案的关键在于提出了一种全新的自顶向下方法:通过全局拟合一个显式参数化表面模型到现有神经网络预测的卷状纸草可能路径上,确保生成的表面始终为单一连续的二维曲面,即使在CT扫描中无可见信号的区域也能保持拓扑完整性。该方法在两个高分辨率CT扫描样本上验证了其有效性,显著优于现有唯一适用于此类数据的自动化展开方法。

链接: https://arxiv.org/abs/2512.04927
作者: Paul Henderson
机构: University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.
zh

[CV-34] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

【速读】:该论文旨在解决Latent Diffusion Models (LDMs)在生成过程中语义与纹理信息同步去噪导致的效率低下和生成质量受限的问题。尽管LDM天然具备从粗到细的生成特性,但现有方法仍采用同步去噪策略,忽略了语义先于纹理形成的潜在优势。解决方案的关键在于提出Semantic-First Diffusion (SFD),其核心是通过异步去噪机制,利用独立的噪声调度(noise schedule)使语义潜变量先于纹理潜变量被去噪,从而提供更清晰的高层语义引导以优化纹理生成。具体实现上,SFD首先用专用的语义VAE提取紧凑语义潜变量,并将其与纹理潜变量组合成复合潜变量;随后通过时间偏移(temporal offset)控制语义优先于纹理进行去噪,显著提升生成质量和收敛速度。

链接: https://arxiv.org/abs/2512.04926
作者: Yueming Pan,Ruoyu Feng,Qi Dai,Yuqi Wang,Wenfeng Lin,Mingyu Guo,Chong Luo,Nanning Zheng
机构: IAIR, Xi’an Jiaotong University (西安交通大学); Microsoft Research Asia (亚洲微软研究院); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: this https URL.
zh

[CV-35] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

【速读】:该论文旨在解决Flow Matching方法中存在的暴露偏差(exposure bias)问题,其根源在于训练与推理阶段分布不一致:一方面模型在训练时缺乏对偏置输入的泛化能力,另一方面早期去噪过程中低频信息捕捉不足导致偏差累积。解决方案的关键在于提出ReflexFlow,该方法通过两个核心组件实现动态校正:(1) 抗漂移修正(Anti-Drift Rectification, ADR),利用训练时调度采样重新设计损失函数,对偏置输入进行反射式目标调整;(2) 频率补偿(Frequency Compensation, FC),通过暴露偏差重加权损失来识别并补偿缺失的低频成分。ReflexFlow具有模型无关性,兼容所有Flow Matching框架,并在CIFAR-10、CelebA-64和ImageNet-256上显著提升生成质量,尤其在CelebA-64上FID降低35.65%。

链接: https://arxiv.org/abs/2512.04904
作者: Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Shao-Lun Huang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Media Technology Lab, Huawei (华为媒体技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
zh

[CV-36] Equivariant Symmetry-Aware Head Pose Estimation for Fetal MRI

【速读】:该论文旨在解决胎儿头部在诊断性MRI扫描过程中运动导致的6自由度(6-DoF)姿态估计难题,以实现自动适应性地规划2D诊断MRI切片。现有方法因解剖结构固有的对称性、低分辨率、噪声和伪影等因素,在临床MRI数据上难以泛化。解决方案的关键在于提出E(3)-Pose方法,该方法通过显式建模旋转等变性(rotation equivariance)与物体对称性(object symmetry),从架构设计上保证了对解剖对称性和刚体姿态变换的不变性,从而在复杂临床条件下仍能获得鲁棒且准确的胎儿头位估计。

链接: https://arxiv.org/abs/2512.04890
作者: Ramya Muthukrishnan,Borjan Gagoski,Aryn Lee,P. Ellen Grant,Elfar Adalsteinsson,Polina Golland,Benjamin Billot
机构: Massachusetts Institute of Technology (麻省理工学院); Boston Children’s Hospital (波士顿儿童医院); Department of Radiology, Harvard Medical School (哈佛医学院放射科); Inria, Université Côte d’Azur (法国国家信息与自动化研究院,蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at this http URL.
zh

[CV-37] You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

【速读】:该论文旨在解决目标检测中因灾难性遗忘(catastrophic forgetting)导致的模型性能下降问题,即在引入新类别(如零售场景中新商品)时,传统方法需重新训练整个模型并使用全部历史数据,造成高昂的计算成本和时间消耗。解决方案的关键在于提出一种名为 You Only Train Once (YOTO) 的新框架:它结合 YOLO11n 实现高效的目标定位,利用 DeIT(Data-efficient Image Transformer)进行特征提取,并通过 Proxy Anchor Loss 进行度量学习以增强特征区分能力;分类阶段则基于嵌入特征与 Qdrant 向量数据库中存储的样本间的余弦相似度完成,从而实现无需重训即可识别新增产品的能力,显著提升训练效率(达传统方法的近3倍),且在边缘设备上保持较低推理延迟(平均580 ms/图像)。

链接: https://arxiv.org/abs/2512.04888
作者: Priyanto Hidayatullah,Nurjannah Syakrani,Yudi Widhiyasana,Muhammad Rizqi Sholahuddin,Refdinal Tubagus,Zahri Al Adzani Hidayat,Hanri Fajar Ramadhan,Dafa Alfarizki Pratama,Farhan Muhammad Yasin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review in the Elsevier Engineering Journal

点击查看摘要

Abstract:Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework’s feasibility for practical use.
zh

[CV-38] SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

【速读】:该论文旨在解决小尺寸无人机(Small UAV)在边缘设备上实时跟踪时面临的分辨率与速度之间的根本性冲突问题:低分辨率输入会导致小目标特征退化至不可检测阈值以下,而直接处理高分辨率(如1080p)图像又因资源受限导致帧率不足,难以支持平稳的云台控制。其解决方案的关键在于提出一种基于稀疏检测引导的跟踪框架SDG-Track,采用“观察者-跟随者”(Observer-Follower)架构——观察者流在GPU上以低频运行高容量检测器,从1920×1080原始帧中提取精准位置锚点;跟随者流则在CPU上通过ROI约束的稀疏光流实现高频轨迹插值,兼顾效率与精度。此外,为应对遮挡或模型漂移引发的跟踪失败,引入无需训练的双空间恢复机制(Dual-Space Recovery),结合颜色直方图匹配与几何一致性约束实现快速重捕获。

链接: https://arxiv.org/abs/2512.04883
作者: Jiawen Wen,Yu Hu,Suixuan Qiu,Jinshan Huang,Xiaowen Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at this https URL
zh

[CV-39] SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

【速读】:该论文旨在解决胸部X光图像中病灶自动检测依赖专家标注提示(prompt)的问题,这一限制使得现有方法在临床应用中难以推广。解决方案的关键在于提出了一种自提示检测框架SP-Det,其核心创新是设计了一个无需专家标注的双文本提示生成器(Dual-Text Prompt Generator, DTPG),通过语义上下文提示(捕捉全局病理模式)与疾病指示提示(聚焦特定疾病特征)两种互补文本模态,自动构建丰富的诊断上下文;同时引入双向特征增强器(Bidirectional Feature Enhancer, BFE),融合全面的诊断上下文与疾病特异性嵌入,显著提升特征表示能力与检测精度。

链接: https://arxiv.org/abs/2512.04875
作者: Qing Xu,Yanqian Wang,Xiangjian Hea,Yue Li,Yixuan Zhang,Rong Qu,Wenting Duan,Zhen Chen
机构: University of Nottingham Ningbo China (宁波诺丁汉大学); University of Nottingham (诺丁汉大学); University of Lincoln (林肯大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.
zh

[CV-40] Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing ICCV2025

【速读】:该论文旨在解决在自然场景下准确捕捉三维人体姿态(3D human pose)的问题,尤其针对视频驱动的姿态估计方法在自接触(self-contact)场景(如手触脸)中表现不佳的局限性。解决方案的关键在于提出一种融合视觉姿态估计与可穿戴生物阻抗传感(bioimpedance sensing)的新框架 BioTUCH,通过引入接触感知的姿态优化机制,在检测到皮肤接触时最小化重投影误差和输入估计偏差的同时,强制执行顶点邻近约束(vertex proximity constraints),从而提升姿态重建精度。实验表明,该方法在三种不同输入姿态估计器上平均提升了11.7%的重建准确性,并开发了一款微型生物阻抗传感器以支持大规模接触感知训练数据的采集。

链接: https://arxiv.org/abs/2512.04862
作者: Maria-Paola Forte,Nikos Athanasiou,Giulia Ballardini,Jan Ulrich Bartels,Katherine J. Kuchenbecker,Michael J. Black
机构: Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: * Equal contribution. Minor figure corrections compared to the ICCV 2025 version

点击查看摘要

Abstract:Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at this http URL
zh

[CV-41] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

【速读】:该论文旨在解决自回归视觉生成(Autoregressive Visual Generation)中存在的严重内存瓶颈问题,即在解码过程中需缓存所有已生成的视觉标记(Visual Tokens),导致存储开销高且吞吐量低。其解决方案的关键在于提出一种无需训练的渐进式键值(Key-Value, KV)缓存压缩管道 LineAR,通过利用视觉注意力机制的内在特性,在二维视图下以“行”为单位管理缓存:保留具有重要视觉依赖关系的区域,同时基于行间注意力逐步剔除对后续行生成无害的低信息量标记,从而显著减少缓存占用并提升生成效率,同时保持或增强生成质量。

链接: https://arxiv.org/abs/2512.04857
作者: Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Chanfan Gan,Tieyuan Chen,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学); Rakuten (乐天); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbfLineAR, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
zh

[CV-42] A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

【速读】:该论文旨在解决多域人脸伪造检测(Multi-In-Domain Face Forgery Detection, MID-FFD)场景下,现有检测模型在面对未指定域的单张图像时准确率(ACC)低的问题。尽管模型在各单一域内能实现高AUC(即域内区分能力良好),但由于不同域间的差异主导了特征空间,导致模型难以捕捉真实与伪造之间的细微区别。解决方案的关键在于提出一个模型无关的框架DevDet,其包含两个核心组件:Face Forgery Developer(FFDev)用于增强真实与伪造样本之间的差异特征,以及Dose-Adaptive detector Fine-Tuning strategy(DAFT)用于自适应调整训练剂量以优化特征表示,从而让真实/伪造差异在特征空间中占据主导地位,提升跨域泛化能力与单图判别准确性。

链接: https://arxiv.org/abs/2512.04837
作者: Jikang Cheng,Renye Yan,Zhiyuan Yan,Yaozhong Gan,Xueyi Zhang,Zhongyuan Wang,Wei Peng,Ling Liang
机构: Peking University (北京大学); Nanjing University (南京大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Wuhan University (武汉大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing methods for deepfake detection aim to develop generalizable detectors. Although “generalizable” is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.
zh

[CV-43] okenizing Buildings: A Transformer for Layout Synthesis

【速读】:该论文旨在解决建筑信息模型(BIM)场景中布局合成(layout synthesis)的难题,核心挑战在于如何将异构的建筑构件特征统一为序列化表示,同时保留其组合结构。解决方案的关键在于提出Small Building Model (SBM),其创新点包括:首先,通过稀疏属性-特征矩阵(sparse attribute-feature matrix)表征房间属性,实现对建筑元素异构特征的统一编码;其次,设计联合嵌入模块(unified embedding module),学习类别型与连续型特征组的联合表示;最后,采用双模式训练策略——仅编码器路径生成高保真房间嵌入,用于语义检索;编码器-解码器流水线则实现自回归式的房间实体预测(Data-Driven Entity Prediction, DDEP),生成功能合理、碰撞和边界违规更少且导航性更强的建筑布局。

链接: https://arxiv.org/abs/2512.04832
作者: Manuel Ladron de Guevara,Jinmo Rhee,Ardavan Bidgoli,Vaidas Razgaitis,Michael Bergin
机构: Higharc; University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 8 pages, 1 page References, 4 figures

点击查看摘要

Abstract:We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
zh

[CV-44] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

【速读】:该论文旨在解决自动驾驶中闭环仿真与可扩展预训练所需的自由视角驾驶场景合成问题,现有数据集和生成流程普遍缺乏轨迹外观测的一致性,限制了大规模评估与训练。其核心挑战在于如何在不进行每场景优化的前提下,同时实现插值一致性(interpolation consistency)与外推真实感(extrapolation realism)。解决方案的关键是提出FreeGen框架——一种前馈式的重建-生成协同训练机制:重建模型提供稳定的几何表示以保障插值一致性,生成模型则进行几何感知增强以提升未见视角的真实感;通过协同训练,生成先验被蒸馏至重建模型以改善轨迹外渲染效果,而优化后的几何结构又为生成过程提供更强的结构引导,从而实现高质量、一致性的自由视角驾驶场景合成。

链接: https://arxiv.org/abs/2512.04830
作者: Shijie Chen,Peixi Peng
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Novel View Synthesis, Driving Scene, Free Trajectory, Image Generation

点击查看摘要

Abstract:Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.
zh

[CV-45] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中预测准确性不足与不确定性量化缺失的问题。现有方法往往难以同时实现高精度分割和对模型置信度的有效表达,限制了其在临床场景中的可靠应用。解决方案的关键在于提出LatentFM,一种基于潜空间(latent space)的流匹配(flow matching, FM)模型:首先通过两个变分自编码器(variational autoencoders, VAEs)将医学图像及其对应掩膜映射至低维潜空间;随后学习一个条件速度场(conditional velocity field),引导潜变量沿流形演化以生成多样化的分割结果;通过采样多个潜表示,模型能够合成具有像素级方差的分割输出,从而准确刻画数据分布,并生成用于量化置信度的置信图(confidence maps)。该方法在ISIC-2018和CVC-Clinic数据集上验证了其在分割精度与效率上的优势。

链接: https://arxiv.org/abs/2512.04821
作者: Huynh Trinh Ngoc,Hoang Anh Nguyen Kim,Toan Nguyen Hai,Long Tran Quoc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.
zh

[CV-46] RobustSplat: Decoupling Densification Dynamics and Illumination for In-the-Wild 3DGS

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在复杂真实场景(in-the-wild scenes)中因瞬态物体(transient objects)和光照变化导致渲染伪影的问题。现有方法在高密度高斯点生长过程中会无意地建模这些动态干扰因素,从而降低重建质量。解决方案的关键在于三个核心设计:首先,采用延迟高斯增长策略(delayed Gaussian growth strategy),优先优化静态场景结构后再进行高斯分裂,避免早期过拟合瞬态对象;其次,提出尺度级联掩码自举方法(scale-cascaded mask bootstrapping),利用低分辨率特征相似性监督实现鲁棒的初始瞬态掩码估计,并逐步过渡至高分辨率监督以提升精度;最后,将上述策略与外观建模相结合,有效处理包含瞬态和光照变化的真实场景,显著提升了重建稳定性和视觉保真度。

链接: https://arxiv.org/abs/2512.04815
作者: Chuanyu Fu,Guanying Chen,Yuqi Zhang,Kunbin Yao,Yuan Xiong,Chuan Huang,Shuguang Cui,Yasuyuki Matsushita,Xiaochun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2506.02751

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.
zh

[CV-47] Shared Multi-modal Embedding Space for Face-Voice Association ICASSP

【速读】:该论文旨在解决跨模态身份识别中的多语言场景下人脸与语音关联建模问题,即在训练数据未涵盖的测试语言中仍能保持高精度的身份匹配性能。解决方案的关键在于采用独立的单模态处理流程进行通用的人脸和语音特征提取,并引入年龄-性别辅助特征以增强预测能力;随后将提取的单模态特征映射至共享嵌入空间,并使用自适应角度间隔(Adaptive Angular Margin, AAM)损失函数进行联合优化,从而实现跨语言条件下人脸与语音模态的一致性对齐。该方法在FAME 2026挑战赛中取得第一名,平均等错误率(Equal-Error Rate, EER)为23.99%。

链接: https://arxiv.org/abs/2512.04814
作者: Christopher Simic,Korbinian Riedhammer,Tobias Bocklet
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Ranked 1st in Fame 2026 Challenge, ICASSP

点击查看摘要

Abstract:The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
zh

[CV-48] EMMA: Efficient Multimodal Understanding Generation and Editing with a Unified Architecture

【速读】:该论文旨在解决当前统一多模态架构在效率与性能之间难以平衡的问题,特别是如何在保证多模态理解、生成与编辑任务协同优化的同时,显著降低计算资源消耗。其解决方案的关键在于提出EMMA架构,核心创新包括:1)采用压缩比高达32:1的高效自编码器,减少生成所需的token数量并维持理解与生成任务的训练平衡;2)使用通道维度拼接(channel-wise concatenation)替代传统标记维度拼接(token-wise concatenation),进一步压缩视觉token;3)设计共享-解耦网络结构,在任务间实现相互增益的同时满足特定建模需求;4)在视觉理解编码器中引入专家混合机制(mixture-of-experts),以极小参数增量大幅提升感知能力。这些设计使EMMA-4B在效率和性能上均优于现有统一多模态方法(如BAGEL-7B),并达到与顶尖多模态模型(如Qwen3-VL)相当的水平。

链接: https://arxiv.org/abs/2512.04810
作者: Xin He,Longhui Wei,Jianbo Ouyang,Lingxi Xie,Qi Tian
机构: Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
zh

[CV-49] LaFiTe: A Generative Latent Field for 3D Native Texturing

【速读】:该论文旨在解决3D-native texturing(即直接在3D表面上生成高保真、无缝纹理)这一长期存在的挑战,其目标是克服传统基于UV映射和多视角投影方法的局限性。现有方法受限于缺乏强大且通用的潜在表示(latent representation),导致生成纹理的保真度和泛化能力不足。解决方案的关键在于提出LaFiTe框架,该框架通过学习一个生成式稀疏潜在颜色场(3D generative sparse latent color field)来实现高保真纹理生成;其核心创新是利用变分自编码器(VAE)将复杂表面外观编码为结构化的稀疏潜在空间,并解码为连续的颜色场,从而有效分离纹理外观与网格拓扑及UV参数化,显著提升重建质量(PSNR提升达10 dB),并支持多样风格与几何下的高质量纹理合成。

链接: https://arxiv.org/abs/2512.04786
作者: Chia-Hao Chen,Zi-Xin Zou,Yan-Pei Cao,Ze Yuan,Guan Luo,Xiaojuan Qi,Ding Liang,Song-Hai Zhang,Yuan-Chen Guo
机构: Tsinghua University (清华大学); VAST; The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by 10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.
zh

[CV-50] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

【速读】:该论文旨在解决生成式 AI(Generative AI)在多图像场景下保持身份一致性(identity consistency)、风格一致性(style consistency)和逻辑连贯性(logical coherence)的问题,这是实现叙事性图像生成和角色设计等应用的关键挑战。传统监督学习方法受限于缺乏大规模标注数据集以及难以建模人类感知偏好,难以有效实现视觉一致性。论文提出 PaCo-RL 框架作为解决方案,其核心在于两个关键组件:一是 PaCo-Reward,一个基于自动化子图配对构建的大规模数据集训练的成对一致性评分模型,采用生成式自回归机制并结合任务感知指令与思维链(Chain-of-Thought, CoT)推理来提升与人类感知的一致性;二是 PaCo-GRPO,一种具有分辨率解耦优化策略的强化学习算法,显著降低训练成本,并通过日志截断多奖励聚合机制保障奖励优化的平衡性和稳定性。该方案在多个代表性子任务中均实现了优于现有方法的一致性表现,且训练效率和稳定性更优,验证了其在实际应用中的可行性和可扩展性。

链接: https://arxiv.org/abs/2512.04784
作者: Bowen Ping,Chengyou Jia,Minnan Luo,Changliang Xia,Xin Shen,Zhuohang Dang,Hangwei Qian
机构: Xi’an Jiaotong University (西安交通大学); CFAR, A*STAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at this https URL.
zh

[CV-51] Order Matters: 3D Shape Generation from Sequential VR Sketches

【速读】:该论文旨在解决现有VR草图生成3D形状模型时忽略笔画时序性的问题,从而导致结构信息和设计意图丢失。其关键解决方案在于提出VRSketch2Shape框架,包含三个核心创新:一是构建自动化管道以从任意3D形状生成顺序VR草图;二是创建包含20,000条合成与900对人工绘制草图-形状配对的多类别数据集;三是设计一个时序感知的草图编码器与基于扩散模型的3D生成器,能够保留笔画顺序信息并提升几何保真度,同时在少量监督下实现从合成到真实草图的良好泛化能力,并支持部分草图输入场景。

链接: https://arxiv.org/abs/2512.04761
作者: Yizi Chen,Sidi Wu,Tianyi Xiao,Nina Wiedemann,Loic Landrieu
机构: ETH Zurich; LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at this https URL.
zh

[CV-52] MT-Depth: Multi-task Instance feature analysis for the Depth Completion

【速读】:该论文旨在解决深度补全(depth completion)任务中因依赖稀疏深度数据而导致的精度不足问题,尤其是在物体边界、遮挡区域和细长结构等关键场景下。现有方法多基于语义分割引导深度补全,但忽略了对象级别的理解优势。其解决方案的关键在于提出一种实例感知的深度补全框架,通过引入二值实例掩码作为空间先验,结合冻结的YOLO V11实例分割分支与U-Net主干网络,并设计交叉注意力融合模块和注意力引导预测头,使模型在深度 refinement 过程中聚焦于对象中心区域,从而显著提升边界附近及复杂结构处的深度估计准确性。

链接: https://arxiv.org/abs/2512.04734
作者: Abdul Haseeb Nizamani,Dandi Zhou,Xinhai Sun
机构: Saisuode (Shanghai) Intelligent Technology Co., Ltd. (Synthoid.ai)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.
zh

[CV-53] E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

【速读】:该论文旨在解决当前端到端自动驾驶(End-to-End Autonomous Driving, E2E AD)系统中忽视乘客情感状态的问题,而情感状态对驾驶舒适性和用户接受度至关重要。解决方案的关键在于提出E3AD框架,其核心创新包括:(1)引入连续的Valence-Arousal-Dominance(VAD)情绪模型,从自然语言指令中提取情绪的强度与紧迫性;(2)设计双通路空间推理模块,融合第一人称视角(egocentric)与第三人称视角(allocentric)信息,实现类人空间认知;(3)采用一致性导向的训练策略,结合模态预训练与基于偏好的对齐机制,确保情绪意图与驾驶行为的一致性。该方法显著提升了视觉定位、路径规划性能,并在情绪估计方面达到当前最优的VAD相关性表现,推动了更具人性化和情境感知的自动驾驶发展。

链接: https://arxiv.org/abs/2512.04733
作者: Yihong Tang,Haicheng Liao,Tong Nie,Junlin He,Ao Qu,Kehua Chen,Wei Ma,Zhenning Li,Lijun Sun,Chengzhong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger’s emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
zh

[CV-54] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

【速读】:该论文旨在解决生成式心理分析(Generative Psychological Analysis)在真实场景对话中面临的两大核心问题:一是现有视觉-语言模型(Vision-Language Models, VLMs)无法有效区分发音特征与情感表达之间的歧义(Articulatory-Affective Ambiguity),二是缺乏可验证的评估指标来衡量视觉定位准确性和推理深度。解决方案的关键在于提出一个完整的生态系统,其中最核心的是Multilevel Insight Network for Disentanglement (MIND),其通过引入状态判断模块(Status Judgment module)基于时间特征方差算法性地抑制模糊的唇部特征,实现显式的视觉解耦;同时构建了ConvoInsight-DB大规模数据集和PRISM心理推理洞察评分指标,用于提升模型性能评估的客观性与多维性。实验证明,MIND在微表情检测上相较最优基线提升86.95%,且消融实验表明状态判断模块是性能跃升的关键。

链接: https://arxiv.org/abs/2512.04728
作者: Yigui Feng,Qinglin Wang,Haotian Mo,Yang Liu,Ke Liu,Gencheng Liu,Xinhai Chen,Siqi Shen,Songzhu Mei,Jie Liu
机构: National University of Defense Technology (国防科技大学); South China University of Technology (华南理工大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
zh

[CV-55] Hardware-aware Neural Architecture Search of Early Exiting Networks on Edge Accelerators

【速读】:该论文旨在解决在资源受限的边缘计算环境中部署大规模深度学习(Deep Learning, DL)模型所面临的挑战,特别是如何通过优化早期退出神经网络(Early Exiting Neural Networks, EENN)来平衡推理效率与模型精度。其解决方案的关键在于提出了一种面向硬件感知的神经架构搜索(Neural Architecture Search, NAS)框架,该框架系统性地整合了量化(quantization)效应与边缘加速器的硬件资源分配,从而自动优化EENN中早期退出点(early exit points)的位置,实现计算成本显著降低(实验表明在CIFAR-10数据集上可减少超过50%的计算开销),同时保持良好的准确性和能效表现,使模型更适配边缘设备的实际约束条件。

链接: https://arxiv.org/abs/2512.04705
作者: Alaa Zniber,Arne Symons,Ouassim Karrakchou,Marian Verhelst,Mounir Ghogho
机构: TICLab, International University of Rabat, Morocco (国际大学拉巴特实验室,摩洛哥); MICAS, KU Leuven, Belgium (KU鲁汶大学微纳系统与电路中心,比利时); College of Computing, University Mohammed VI Polytechnic, Morocco (穆罕默德六世 polytechnic 大学计算机学院,摩洛哥)
类目: Computational Complexity (cs.CC); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Emerging Topics in Computing

点击查看摘要

Abstract:Advancements in high-performance computing and cloud technologies have enabled the development of increasingly sophisticated Deep Learning (DL) models. However, the growing demand for embedded intelligence at the edge imposes stringent computational and energy constraints, challenging the deployment of these large-scale models. Early Exiting Neural Networks (EENN) have emerged as a promising solution, allowing dynamic termination of inference based on input complexity to enhance efficiency. Despite their potential, EENN performance is highly influenced by the heterogeneity of edge accelerators and the constraints imposed by quantization, affecting accuracy, energy efficiency, and latency. Yet, research on the automatic optimization of EENN design for edge hardware remains limited. To bridge this gap, we propose a hardware-aware Neural Architecture Search (NAS) framework that systematically integrates the effects of quantization and hardware resource allocation to optimize the placement of early exit points within a network backbone. Experimental results on the CIFAR-10 dataset demonstrate that our NAS framework can discover architectures that achieve over a 50% reduction in computational costs compared to conventional static networks, making them more suitable for deployment in resource-constrained edge environments.
zh

[CV-56] OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution

【速读】:该论文旨在解决任意尺度超分辨率(Arbitrary-scale Super-Resolution, ASSR)中现有方法在高倍放大时细节生成不真实、图像模糊或过度幻觉的问题。当前主流ASSR方法依赖隐式神经表示(Implicit Neural Representation, INR),其回归驱动的特征提取机制难以合成精细结构,导致真实感不足;而基于扩散模型的现实图像超分辨率(Realistic Image Super-Resolution, Real-ISR)虽在4倍放大下表现优异,但缺乏显式的尺度控制机制,在超大放大倍数下无法有效调节扩散过程,从而引发失真。解决方案的关键在于提出OmniScaleSR框架,引入与扩散过程原生兼容的显式尺度控制机制,协同利用扩散先验的隐式尺度自适应能力,实现尺度感知和内容感知的扩散过程调制,并结合多域保真度增强设计以提升重建精度,从而在保持高保真度的同时显著增强图像的真实性,尤其在大倍率放大场景下表现突出。

链接: https://arxiv.org/abs/2512.04699
作者: Xinning Chai,Zhengxue Cheng,Yuhong Zhang,Hengsheng Zhang,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Normal University (上海师范大学); Transsion (传音); MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室,上海交通大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as TCSVT, 15 pages

点击查看摘要

Abstract:Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at this https URL.
zh

[CV-57] owards Cross-View Point Correspondence in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在跨视角点对应(Cross-View Point Correspondence, CVPC)任务中难以实现精确点级匹配的问题,这一能力对于具身智能(embodied AI)中的精准交互 affordance(可操作性)至关重要。现有模型(如Gemini-2.5-Pro)在细粒度坐标预测上与人类表现存在超过54.65%的准确率差距,表明从粗粒度判断向细粒度定位的过渡仍面临挑战。解决方案的关键在于:构建了一个分层设计的基准 CrossPoint-Bench,并提出包含378K问答对的 CrossPoint-378K 数据集,聚焦于可操作区域(actionable affordance regions),从而更真实地反映现实世界中的交互场景;在此基础上训练出的 CroPond 模型,在 CrossPoint-Bench 上相较 Gemini-2.5-Pro 提升了39.7%的准确率,为推进跨视角对应研究提供了新基础。

链接: https://arxiv.org/abs/2512.04686
作者: Yipu Wang,Yuheng Ji,Yuyang Liu,Enshen Zhou,Ziqiang Yang,Yuxuan Tian,Ziheng Qin,Yue Liu,Huajie Tan,Cheng Chi,Zhiyuan Ma,Daniel Dajun Zeng,Xiaolong Zheng
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉学科研究院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beihang University (北京航空航天大学); Jilin University (吉林大学); National University of Singapore (新加坡国立大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京智源研究院); Huazhong University of Science And Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of “perceive”, “reason”, and “correspond”. Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at this https URL.
zh

[CV-58] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

【速读】:该论文旨在解决现有视频生成方法在流式视频生成过程中因依赖静态初始帧作为sink tokens而导致的运动动态性下降和初始帧复制问题。其核心解决方案是提出一种名为Reward Forcing的新框架,关键设计包括:(1)EMA-Sink机制,通过指数移动平均(Exponential Moving Average, EMA)持续更新滑动窗口中被移除的token,从而在不增加计算开销的前提下捕获长期上下文与近期动态,避免对初始帧的过度依赖;(2)受奖励的分布匹配蒸馏(Rewarded Distribution Matching Distillation, Re-DMD),利用视觉-语言模型对动态内容评分,引导模型优先学习高动态样本,显著提升运动质量并保持数据保真度。

链接: https://arxiv.org/abs/2512.04678
作者: Yunhong Lu,Yanhong Zeng,Haobo Li,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Jiapeng Zhu,Hengyuan Cao,Zhipeng Zhang,Xing Zhu,Yujun Shen,Min Zhang
机构: ZJU(浙江大学); Ant Group(蚂蚁集团); SIAS-ZJU(SIAS-浙江大学); HUST(华中科技大学); SJTU(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model’s ability to prioritize dynamic content. Instead, Re-DMD biases the model’s output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
zh

[CV-59] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

【速读】:该论文旨在解决基于扩散模型的视频生成方法在实际应用中面临的两大核心挑战:一是由于序列计算导致的高延迟与低效性,二是长时程一致性差所引发的图像质量下降和身份漂移问题。为此,作者提出了一种算法-系统协同设计的框架Live Avatar,其关键创新在于引入了Timestep-forcing Pipeline Parallelism (TPP),通过跨多GPU并行化去噪步骤有效打破自回归瓶颈,实现低延迟实时流式生成;同时提出Rolling Sink Frame Mechanism (RSFM),利用缓存参考帧动态校准外观以增强时间一致性并减少颜色伪影,从而在保持高保真度的前提下实现无限长度的音频驱动虚拟人像生成。

链接: https://arxiv.org/abs/2512.04677
作者: Yubo Huang,Hailong Guo,Fangtai Wu,Shifeng Zhang,Shijie Huang,Qijun Gan,Lin Liu,Sirui Zhao,Enhong Chen,Jiaming Liu,Steven Hoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
zh

[CV-60] I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

【速读】:该论文旨在解决当前图像编辑模型评估中存在的局限性问题,包括任务范围狭窄、评价维度不足以及对人工标注的高度依赖,这些问题严重制约了评估基准的可扩展性和实际应用价值。其解决方案的关键在于提出I2I-Bench,一个全面的图像到图像编辑模型评估基准,其核心创新点包括:(i)涵盖单图与多图编辑任务的10类多样化任务;(ii)引入30个解耦且细粒度的评价维度,并采用融合专用工具与大视觉语言模型(Large Multimodal Models, LMMs)的自动化混合评估方法;(iii)通过严格的对齐验证确保评估结果与人类偏好的一致性。

链接: https://arxiv.org/abs/2512.04660
作者: Juntong Wang,Jiarui Wang,Huiyu Duan,Jiaxiang Kang,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
zh

[CV-61] Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中传统方法对教师模型输出 logits 信息利用不足的问题,尤其是在 Decoupled Knowledge Distillation (DKD) 虽然重新强调了 logit 知识的重要性,但其内在机制仍存在未被充分挖掘的潜力。解决方案的关键在于从预测分布(predictive distribution)视角重新审视 DKD,并提出一种广义解耦知识蒸馏(Generalized Decoupled Knowledge Distillation, GDKD)损失函数,通过更灵活的 logits 解耦策略提升知识迁移效率。特别地,论文揭示了两个关键洞察:一是 top logit 的分区可显著增强非 top logits 之间的关联性,二是强化对非 top logits 的蒸馏损失关注能提升它们之间的知识提取效果;基于此,进一步设计了一种高效分区策略以应对教师模型预测分布的多模态特性,从而在多个基准数据集上实现优于原始 DKD 及其他主流蒸馏方法的性能表现。

链接: https://arxiv.org/abs/2512.04625
作者: Bowen Zheng,Ran Cheng
机构: The Hong Kong Polytechnic University (香港理工大学); The Hong Kong Polytechnic University Shenzhen Research Institute (香港理工大学深圳研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE TNNLS

点击查看摘要

Abstract:In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model’s predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models’ predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD’s superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at this https URL.
zh

[CV-62] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

【速读】:该论文旨在解决零样本点跟踪(zero-shot point tracking)问题,即在无需标注训练数据的情况下实现高精度的视频中目标点的跨帧定位。传统方法依赖大量人工标注数据进行监督学习,而本文提出了一种基于预训练视频扩散模型(Video Diffusion Model)的新型框架HeFT(Head-Frequency Tracker),其核心创新在于通过分析视频扩散Transformer(VDiT)内部表示,发现注意力头(attention head)具有功能分化特性——分别负责匹配、语义理解与位置编码,并且低频特征成分对建立对应关系至关重要,高频成分则引入噪声。解决方案的关键是设计了一个“头-频率感知”的特征选择策略,联合选取最具信息量的注意力头和低频特征分量,结合单步去噪提取判别性特征、软argmax定位及前向-后向一致性验证,从而显著提升跟踪性能,在TAP-Vid基准上达到接近监督方法的精度,同时完全摆脱对标注数据的依赖。

链接: https://arxiv.org/abs/2512.04619
作者: Tianyu Yuan,Yuanbo Yang,Lin-Zhuo Chen,Yao Yao,Zhuzhong Qian
机构: NanJing University (南京大学); ZheJiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.
zh

[CV-63] Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection Element and Location in One-shot

【速读】:该论文旨在解决非法视觉内容检测中缺乏细粒度定位与可解释性的问题,即仅依赖图像级别的不适宜内容(NSFW)标签无法满足监管需求,需明确识别出违法内容的具体对象及其空间位置。解决方案的关键在于提出一种零样本(zero-shot)端到端管道:首先利用基础分割模型(Segment Anything Model, SAM)生成候选对象掩码并优化为独立区域;随后通过视觉语言模型(Vision-Language Model, VLM)使用开放词汇提示对每个区域进行恶意相关性评分,并融合得分生成综合的恶意对象图;最后引入多分割器集成策略提升对针对单一分割方法的自适应攻击的鲁棒性。该方案实现了像素级定位、高召回率与强抗干扰能力,是首个可实用的细粒度、可解释的恶意图像审核工具。

链接: https://arxiv.org/abs/2512.04599
作者: Sheng Hang,Chaoxiang He,Hongsheng Hu,Hanqing Hu,Bin Benjamin Zhu,Shi-Feng Sun,Dawu Gu,Shuo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method’s precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.
zh

[CV-64] When Robots Should Say “I Dont Know”: Benchmarking Abstention in Embodied Question Answering

【速读】:该论文旨在解决当前具身问答(Embodied Question Answering, EQA)系统缺乏“拒答”能力的问题,即当信息不足或问题不明确时,代理应识别并选择不回答,而非强行生成错误答案。其关键解决方案是构建一个名为AbstainEQA的新数据集,通过人工标注将原始OpenEQA中的良构问题转化为五类典型需拒答的情形:动作可行性限制、指称不明确、偏好依赖、信息不可得和虚假预设,从而系统性地评估模型在何时应当拒绝作答的能力。该研究揭示了当前最先进模型在拒答召回率上仅达42.79%,远低于人类的91.17%,凸显出拒答作为可靠具身交互基础的重要性,并指出单纯扩大规模、提示工程或推理机制难以显著提升此能力。

链接: https://arxiv.org/abs/2512.04597
作者: Tao Wu,Chuhao Zhou,Guangyu Zhao,Haozhi Cao,Yewen Pu,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
zh

[CV-65] SAM3-I: Segment Anything with Instructions

【速读】:该论文旨在解决Segment Anything Model 3 (SAM3)在实际应用中对复杂自然语言指令理解能力不足的问题,即当前SAM3仅支持基于短名词短语(NP)的提示进行开放词汇分割,难以处理包含属性、空间关系、功能、动作、状态及隐式推理等丰富语义的复杂指令。解决方案的关键在于提出SAM3-I框架,其核心创新是引入一种指令感知的级联适应机制(instruction-aware cascaded adaptation mechanism),该机制能够逐步将自然语言指令的语义与SAM3已有的视觉-语言表征对齐,从而实现无需外部多模态代理转换即可直接执行指令驱动的分割任务,同时保留SAM3原有的概念驱动分割能力。此外,论文还构建了一个分层的结构化指令分类体系(concept-simple-complex)并开发了可扩展的数据引擎,用于生成多样化的指令-掩码配对数据集,显著提升了模型在真实场景下对复杂语义的理解和执行能力。

链接: https://arxiv.org/abs/2512.04585
作者: Jingjing Li,Yue Feng,Yuchen Guo,Jincai Huang,Yongri Piao,Qi Bi,Miao Zhang,Xiaoqi Zhao,Qiang Chen,Shihao Zou,Wei Ji,Huchuan Lu,Li Cheng
机构: University of Alberta (阿尔伯塔大学); NUAA (南京航空航天大学); Northwestern University (西北大学); SUSTech (南方科技大学); Dalian University of Technology (大连理工大学); Utrecht University (乌得勒支大学); Yale University (耶鲁大学); Independent Researcher (独立研究者); SIAT, Chinese Academy of Sciences (深圳先进技术研究院,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preliminary results; work in progress

点击查看摘要

Abstract:Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3’s existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
zh

[CV-66] Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation

【速读】:该论文旨在解决红外无人机目标(Infrared UAV Target, IRUT)在复杂背景中特征弱、易受干扰导致跟踪精度低的问题。解决方案的关键在于提出一种动态特征融合的孪生网络SiamDFF,其核心创新包括:1)选择性目标增强网络(STEN),通过强度感知的多头交叉注意力机制自适应增强模板与搜索分支中的关键区域;2)动态空间特征聚合模块(DSFAM)和动态通道特征聚合模块(DCFAM),分别结合局部细节与全局上下文信息,实现多尺度特征的有效融合并减少背景干扰;3)设计了一种面向跟踪任务的目标感知上下文注意力知识蒸馏机制,在不增加计算负担的前提下,将教师网络中的目标先验知识迁移至学生模型,提升各层级特征提取对有效区域的关注度。实验证明该方法在复杂场景下显著优于当前最优跟踪算法,并具备实时性能。

链接: https://arxiv.org/abs/2512.04581
作者: Houzhang Fang,Chenxing Wu,Kun Bai,Tianqi Chen,Xiaolin Wang,Xiyang Liu,Yi Chang,Luxin Yan
机构: Xidian University (西安电子科技大学); Xi’an Modern Control Technology Research Institute (西安现代控制技术研究所); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network’s focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.
zh

[CV-67] ARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

【速读】:该论文旨在解决增强CT中因扫描限制或辐射顾虑导致多期相数据缺失(即“缺失模态”问题)对肿瘤分割与诊断精度的影响。传统深度学习方法将缺失期相视为独立通道缺失,忽略了对比剂在体内随时间变化的血流动力学连续性。其解决方案的关键在于提出一种物理感知的时序衰减表示解耦框架(Time Attenuated Representation Disentanglement, TARDis),将缺失模态重新建模为连续时间-衰减曲线上的缺失采样点,并通过双路径架构显式分离潜在特征空间:一条基于量化嵌入字典的路径提取时间不变的静态解剖结构,另一条基于条件变分自编码器的概率路径建模依赖于估计扫描时间的动态增强过程,从而实现对缺失血流动力学特征的合理“幻觉生成”,显著提升在极端数据稀疏场景下的分割与诊断性能。

链接: https://arxiv.org/abs/2512.04576
作者: Zishuo Wan,Qinqin Kang,Yi Huang,Yun Bian,Dawei Ding,Ke Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the “missing modality” problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.
zh

[CV-68] Prompt2Craft: Generating Functional Craft Assemblies with LLM s

【速读】:该论文旨在解决Craft Assembly Task(手工装配任务)中如何从场景中可用对象中选择最优子集,以构建目标物体的准确表示问题。其核心挑战在于输入为野外采集的RGB图像时,需通过视觉识别、模板匹配与几何简化等步骤,将非直接对应的可用对象组合成目标结构。解决方案的关键在于:首先利用掩码分割神经网络提取可见部件;随后通过姿态优化匹配带标签的模板网格;继而将模板网格简化为基本几何体(如立方体或圆柱体);最后设计基于局部与全局比例关系的搜索算法寻找场景中的对应关系,并采用全组合基线方法进行性能对比,从而在真实场景中实现高精度的装配推理。

链接: https://arxiv.org/abs/2512.04568
作者: Vitor Hideyo Isume,Takuya Kiyokawa,Natsuki Yamanobe,Yukiyasu Domae,Weiwei Wan,Kensuke Harada
机构: Osaka University (大阪大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.
zh

[CV-69] Dataset creation for supervised deep learning-based analysis of microscopic images - review of important considerations and recommendations

【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在病理图像分析中因高质量、大规模数据集稀缺而导致的泛化能力不足与可复现性差的问题。其核心解决方案在于系统性地规范数据集构建流程,关键包括:1)严格控制图像采集过程以减少域偏移(domain shifts),如切片制备和数字化差异;2)采用具备“正确性(correctness)、完整性(completeness)和一致性(consistency)”标准的标注质量控制策略;3)引入多标注者协同与先进标注技术以克服单一标注者的局限性;4)推动开放数据集建设以促进研究创新与模型验证。通过提供标准化操作流程(SOP),该文为构建可靠、可扩展的病理图像数据集提供了实践指南,从而提升DL模型的鲁棒性和临床适用性。

链接: https://arxiv.org/abs/2512.04564
作者: Christof A. Bertram,Viktoria Weiss,Jonas Ammeling,F. Maria Schabel,Taryn A. Donovan,Frauke Wilm,Christian Marzahl,Katharina Breininger,Marc Aubreville
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.
zh

[CV-70] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在三维空间推理能力上的不足问题,尤其是其对物体属性和空间关系的理解仍存在局限。现有方法通常将感知增强(如引入深度图和分割图等辅助模态)与推理优化(如基于空间问答数据集训练或强化学习)分开处理,难以实现协同提升。论文提出了一种统一的MLLM——COOPER,其关键在于通过两阶段训练机制:第一阶段学习生成辅助模态(depth and segmentation),第二阶段则实现自适应交错式推理(adaptive interleaved reasoning)。这种设计使模型不仅能内化空间知识,还能在推理过程中动态调用辅助信息,从而显著提升空间智能,实验证明该方案在空间推理任务上平均提升6.91%,且仅训练辅助模态生成模块即可带来7.92%的距离与尺寸估计性能增益。

链接: https://arxiv.org/abs/2512.04563
作者: Zefeng Zhang,Xiangzhao Hao,Hengzhu Tang,Zhenyu Zhang,Jiawei Sheng,Xiaodong Li,Zhenyang Li,Li Gao,Daiting Shi,Dawei Yin,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbfCOOPER, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf6.91% improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf7.92% gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
zh

[CV-71] Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex

【速读】:该论文旨在解决在资源受限设备上执行复杂密集卷积核(complex dense kernel)图像卷积时计算成本过高的问题,现有方法如模拟退火或低秩分解要么效率不足,要么无法有效捕捉非凸卷积核的特性。其解决方案的关键在于提出一种可微分的核分解框架,通过稀疏核样本表示目标空间变化的密集复杂核,并结合可微优化策略、针对非凸形状的专用初始化机制以及无需重新训练即可实现空间变化滤波的核空间插值方案,从而在保持高保真度的同时显著降低计算开销,且完全兼容深度学习训练流程。

链接: https://arxiv.org/abs/2512.04556
作者: Zhizhen Wu,Zhe Cao,Yuchi Huo
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学); Zhejiang Lab (浙江省实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.
zh

[CV-72] Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering

【速读】:该论文旨在解决文档视觉问答(Document Visual Question Answering, DocVQA)系统在面对对抗攻击时的脆弱性问题,即模型容易因微小但语义针对性强的文档内容篡改而产生错误答案。解决方案的关键在于提出了一种新颖的攻击场景,通过生成视觉上难以察觉但语义明确的伪造文档内容,诱导DocVQA模型输出特定或普遍错误的答案;为此开发了定制化的攻击算法,能够针对不同攻击目标(如定向误导或系统性失效)生成对抗样本,并在Pix2Struct和Donut两个前沿端到端模型上验证了其有效性,揭示了当前DocVQA系统的严重安全漏洞。

链接: https://arxiv.org/abs/2512.04554
作者: Marco Pintore,Maura Pintor,Dimosthenis Karatzas,Battista Biggio
机构: 1. University of Cagliari (卡利亚里大学); 2. Institute for Advanced Studies (高级研究所); 3. EPFL (瑞士洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers’ goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.
zh

[CV-73] Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在新视角合成任务中表面重建精度不足的问题,尤其关注如何在保持图像保真度的同时提升几何准确性。解决方案的关键在于引入一种基于配置熵(configurational entropy)的优化框架——GEF(Entropy-Driven Geometry Framework),其核心思想是通过最小化原始分布的熵来实现更清晰的表面几何表达:首先利用熵最小化驱动表面建模以降低配置熵;其次采用基于表面邻域冗余指数(Surface Neighborhood Redundancy Index, SNRI)的自适应空间正则化与图像熵加权策略抑制冗余成分;最后通过跨尺度熵对齐机制保障多尺度几何一致性。这一方法显著提升了DTU和T\T数据集上的几何精度(如Chamfer Distance降至0.64,F1 Score达0.44),同时在Mip-NeRF 360上实现了最优的SSIM(0.855)和LPIPS(0.136),验证了其在不牺牲光度保真度前提下增强表面重建能力的有效性。

链接: https://arxiv.org/abs/2512.04542
作者: Hong Kuang,Jianchen Liu
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages,11 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading technique for novel view synthesis, demonstrating exceptional rendering efficiency. \replaced[]Well-reconstructed surfaces can be characterized by low configurational entropy, where dominant primitives clearly define surface geometry while redundant components are suppressed.The key insight is that well-reconstructed surfaces naturally exhibit low configurational entropy, where dominant primitives clearly define surface geometry while suppressing redundant components. Three complementary technical contributions are introduced: (1) entropy-driven surface modeling via entropy minimization for low configurational entropy in primitive distributions; (2) adaptive spatial regularization using the Surface Neighborhood Redundancy Index (SNRI) and image entropy-guided weighting; (3) multi-scale geometric preservation through competitive cross-scale entropy alignment. Extensive experiments demonstrate that GEF achieves competitive geometric precision on DTU and T\T benchmarks, while delivering superior rendering quality compared to existing methods on Mip-NeRF 360. Notably, superior Chamfer Distance (0.64) on DTU and F1 score (0.44) on T\T are obtained, alongside the best SSIM (0.855) and LPIPS (0.136) among baselines on Mip-NeRF 360, validating the framework’s ability to enhance surface reconstruction accuracy without compromising photometric fidelity.
zh

[CV-74] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

【速读】:该论文旨在解决超长视频理解(ultra long video understanding)中的关键挑战,即现有视觉语言模型(Vision Language Models, VLMs)因上下文长度受限和长期记忆效率低下而导致性能下降的问题。其解决方案的核心在于提出VideoMem框架,将长视频理解建模为一个序列生成任务,并通过自适应记忆管理机制动态更新全局记忆缓冲区,以保留关键信息并丢弃冗余内容。此外,为高效训练VLMs完成此类长期任务,该框架引入了渐进式分组相对策略优化(Progressive Grouped Relative Policy Optimization, PRPO)算法,包含两个核心模块:渐进状态传播(Progressive State Propagation, PSP)用于自适应保留有效状态并传递至下一滚动步长,逐步缩小探索空间;时间级联奖励(Temporal Cascading Reward, TCR)则缓解奖励稀疏性问题,提升样本利用率并加速收敛。

链接: https://arxiv.org/abs/2512.04540
作者: Hongbo Jin,Qingyuan Wang,Wenhao Zhang,Yang Liu,Sijie Cheng
机构: Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
zh

[CV-75] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

【速读】:该论文旨在解决当前具身智能(Embodied AI)中视觉-语言-动作(Vision-Language-Action, VLA)模型与世界模型(World Models)训练所面临的高质量、大规模多样化数据稀缺问题,尤其是现有“机器人化”网络视频的方法仅适用于第一人称视角且无法处理全身运动和第三人称视频中的遮挡问题。其解决方案的关键在于提出X-Humanoid——一种基于生成式视频编辑的框架,将Wan 2.2模型改造为视频到视频结构,并通过在自建的17+小时成对人类-类人机器人合成视频数据集上微调,实现从真实人类动作视频到类人机器人动作视频的高保真映射;该方法进一步应用于60小时Ego-Exo4D视频,生成超过360万帧的“机器人化”视频帧数据集,显著提升了动作一致性与具身正确性。

链接: https://arxiv.org/abs/2512.04537
作者: Pei Yang,Hai Ci,Yiren Song,Mike Zheng Shou
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to “robotize” web-scale human videos, which has been proven effective for policy training. However, these solutions mainly “overlay” robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million “robotized” humanoid video frames. Quantitative analysis and user studies confirm our method’s superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
zh

[CV-76] Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model

【速读】:该论文旨在解决酒精滥用导致的公共安全问题,特别是通过非侵入式方式实现对酒精中毒状态的准确检测。其核心解决方案是提出一种基于视频的面部序列分析方法,关键在于融合图注意力网络(Graph Attention Network, GAT)提取的面部关键点特征与3D ResNet提取的时空视觉特征,并采用动态自适应加权机制进行特征融合,从而显著提升分类性能。

链接: https://arxiv.org/abs/2512.04536
作者: Bita Baroutian,Atefe Aghaei,Mohsen Ebrahimi Moghaddam
机构: Shahid Beheshti University (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model’s potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
zh

[CV-77] Refaçade: Editing Object with Given Reference Texture

【速读】:该论文旨在解决图像和视频中局部纹理迁移(Object Retexture)任务的可控性不足问题,即如何在不引入源对象结构干扰的前提下,精确地将参考对象的纹理转移到目标对象上。现有方法如基于ControlNet的方案因直接使用原始参考图像作为条件,易混入不必要的结构信息且无法分离纹理与结构特征,导致控制精度受限。解决方案的关键在于提出Refaçade方法:其一,设计了一个基于成对带纹理/无纹理3D网格渲染图像训练的纹理移除模块,可在保留几何形状和运动信息的同时去除外观特征;其二,采用拼图置换(jigsaw permutation)破坏参考图像的全局布局,促使模型聚焦于局部纹理统计特性而非整体构图,从而实现更精准、可控的纹理迁移。

链接: https://arxiv.org/abs/2512.04534
作者: Youze Huang(1),Penghui Ruan(2),Bojia Zi(3),Xianbiao Qi(4),Jianan Wang(5),Rong Xiao(4) ((1) University of Electronic Science and Technology of China, (2) The Hong Kong Polytechnic University, (3) The Chinese University of Hong Kong, (4) IntelliFusion Inc., (5) Astribot Inc.)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at this https URL.
zh

[CV-78] PhyVLLM : Physics-Guided Video Language Model with Motion-Appearance Disentanglement

【速读】:该论文旨在解决当前视频大语言模型(Video LLMs)在涉及物理动态理解的任务中表现不佳的问题,其核心限制源于对视觉外观匹配的依赖,而缺乏对物体运动与物理规律的显式建模。解决方案的关键在于提出PhyVLLM框架,通过双分支编码器解耦视觉外观与物体运动信号,并引入神经微分方程(Neural ODE)模块以生成可微的物理动态表示,从而实现连续时间尺度上的物理动力学建模;同时,该框架采用自监督机制避免对昂贵物理标签的依赖,将运动感知表示投影至预训练大语言模型(LLM)的token空间,使模型具备物理推理能力而不损害原有多模态理解性能。

链接: https://arxiv.org/abs/2512.04532
作者: Yu-Wei Zhan,Xin Wang,Hong Chen,Tongtong Feng,Wei Feng,Ren Wang,Guangyao Li,Qing Li,Wenwu Zhu
机构: Tsinghua University (清华大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model’s original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
zh

[CV-79] Auto3R: Automated 3D Reconstruction and Scanning via Data-driven Uncertainty Quantification

【速读】:该论文旨在解决传统高精度三维扫描与重建依赖人工规划扫描流程、难以实现全自动化的难题,尤其针对非朗伯(non-lambertian)和镜面(specular)材质物体的扫描挑战。其解决方案的关键在于提出了一种数据驱动的不确定性量化模型 Auto3R,该模型能够在迭代式三维重建与扫描过程中,无需已知真实几何与外观信息,即可高效且准确地预测潜在扫描视角上的不确定性分布,从而指导自动化扫描策略优化,最终实现高质量、全自动的三维数字化。

链接: https://arxiv.org/abs/2512.04528
作者: Chentao Shen,Sizhe Zheng,Bingqian Wu,Yaohua Feng,Yuanchen Fei,Mingyu Mei,Hanwen Jiang,Xiangru Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: this https URL .
zh

[CV-80] Identity Clue Refinement and Enhancement for Visible-Infrared Person Re-Identification

【速读】:该论文针对可见光-红外行人重识别(Visible-Infrared Person Re-Identification, VI-ReID)任务中因模态差异显著而导致的跨模态匹配难题展开研究,旨在解决现有方法仅关注统一嵌入空间中的模态不变特征、忽视模态特异性身份感知知识的问题。其解决方案的关键在于提出一种新颖的Identity Clue Refinement and Enhancement (ICRE) 网络:首先设计Multi-Perception Feature Refinement (MPFR) 模块,聚合共享分支中的浅层特征以挖掘易被忽略的模态特异性属性;进而引入Semantic Distillation Cascade Enhancement (SDCE) 模块,从聚合的浅层特征中蒸馏出身份感知知识,并引导模态不变特征的学习;最后通过Identity Clues Guided (ICG) Loss 缓解增强特征内的模态差异,促进多样化表示空间的学习,从而显著提升跨模态匹配性能。

链接: https://arxiv.org/abs/2512.04522
作者: Guoqing Zhang,Zhun Wang,Hairui Wang,Zhonglin Ye,Yuhui Zheng
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Dalian University of Technology (大连理工大学); Qinghai Normal University (青海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.
zh

[CV-81] WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism

【速读】:该论文旨在解决现有基于Wi-Fi信号的姿势识别方法在跨域场景下性能显著下降的问题(即缺乏跨域泛化能力)。其关键解决方案是:首先从信道状态信息(CSI)中提取多接收机的多普勒谱(Doppler spectra),并沿时间轴拼接生成包含多角度信息的融合图像作为输入特征;其次,提出一种结合多语义空间注意力机制与基于自注意力的通道机制的神经网络结构,通过构建注意力图来量化图像中手势的时空特征,从而提取具有域不变性的关键特征;最后采用ResNet18作为主干网络以捕获深层特征。该方案在Widar3数据集上实现了99.72%的域内准确率和97.61%的跨域准确率,显著优于现有最优方法。

链接: https://arxiv.org/abs/2512.04521
作者: Ruijing Liu,Cunhua Pan,Jiaming Zeng,Hong Ren,Kezhi Wang,Lei Kong,Jiangzhou Wang
机构: National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China; Department of Computer Science, Brunel University London, UB8 3PH Uxbridge, U.K.; New H3C Technologies Co., Ltd & Zhejiang University; Pervasive Communication Research Center, Purple Mountain Laboratories, Nanjing 211111, China
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions.
zh

[CV-82] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因标注数据稀缺和模型计算成本高而导致的传统微调方法面临的关键挑战,尤其是针对基础模型如Segment Anything Model (SAM) 在医疗数据集上由于域偏移(domain shift)导致的零样本分割性能不足问题。解决方案的核心在于提出一种无需源域训练数据的任务无关测试时自适应框架BA-TTA-SAM,其关键创新包括:(1) 编码器级高斯提示注入(encoder-level Gaussian prompt injection),将基于高斯的提示直接嵌入图像编码器以提供初始表征学习的显式引导;(2) 跨层边界感知注意力对齐(cross-layer boundary-aware attention alignment),利用ViT骨干网络中的层次特征交互机制,对齐深层语义响应与浅层边界线索。该方法在四个公开医学数据集上实现了平均12.4%的DICE分数提升,显著增强了SAM的零样本泛化能力。

链接: https://arxiv.org/abs/2512.04520
作者: Chenlin Xu,Lei Zhang,Lituan Wang,Xinyu Pu,Pengfei Ma,Guangwu Qian,Zizhou Wang,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4% in the DICE score compared with SAM’s zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at this https URL.
zh

[CV-83] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

【速读】:该论文旨在解决自回归(Autoregressive, AR)扩散模型在生成分钟级长视频时面临的挑战,包括误差累积、运动漂移(motion drift)和内容重复等问题,这些问题严重损害了视频的时序一致性和运动稳定性。解决方案的关键在于提出VideoSSM框架,其核心创新是将AR扩散与一种混合状态空间记忆(hybrid state-space memory)相结合:其中状态空间模型(State-Space Model, SSM)作为全局记忆,持续记录场景动态以维持整体一致性;同时引入一个上下文窗口作为局部记忆,捕捉运动细节和短期变化。这种双层记忆机制有效避免了固定重复模式,提升了交互式提示适应能力,并实现了线性时间复杂度的扩展性,显著改善了长视频生成中的时空一致性表现。

链接: https://arxiv.org/abs/2512.04519
作者: Yifei Yu,Xiaoshan Wu,Xinting Hu,Tao Hu,Yangtian Sun,Xiaoyang Lyu,Bo Wang,Lin Ma,Yuewen Ma,Zhongrui Wang,Xiaojuan Qi
机构: HKU; PICO, ByteDance; SUSTech
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
zh

[CV-84] EgoLCD: Egocentric Video Generation with Long Context Diffusion

【速读】:该论文旨在解决生成式 AI(Generative AI)在长时程第一人称视频(egocentric video)合成中因缺乏稳定长期记忆而导致的内容漂移(content drift)问题,即物体身份和场景语义随时间推移逐渐退化。其解决方案的关键在于提出 EgoLCD 框架,通过引入长短期记忆协同机制实现高效且稳定的记忆管理:一方面采用长程稀疏键值缓存(Long-Term Sparse KV Cache)维持全局上下文一致性,另一方面结合基于注意力的短时记忆并利用 LoRA(Low-Rank Adaptation)进行局部适应;同时设计记忆调控损失(Memory Regulation Loss)约束记忆使用模式,并借助结构化叙事提示(Structured Narrative Prompting)提供显式时间指导,从而显著提升视频的感知质量和时序一致性,有效缓解生成遗忘(generative forgetting)。

链接: https://arxiv.org/abs/2512.04515
作者: Liuzhou Zhang,Jiarui Ye,Yuanlei Wang,Ming Zhong,Mingju Cao,Wanke Xia,Bowen Zeng,Zeyu Zhang,Hao Tang
机构: Peking University (北京大学); Sun Yat-sen University (中山大学); Zhejiang University (浙江大学); Chinese Academy of Sciences (中国科学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: this https URL. Website: this https URL.
zh

[CV-85] DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

【速读】:该论文旨在解决现有基础模型(如基于可见光数据预训练的Masked Autoencoder,MAE)在红外图像理解任务中表现不佳的问题,其核心挑战包括:红外图像中信息令牌被忽略、全局关联建模不足以及非均匀噪声未被有效处理。解决方案的关键在于提出一种双域引导的红外基础模型DuGI-MAE,其创新点包括:(1)设计基于令牌熵的确定性掩码策略,仅保留高熵令牌用于重建以增强信息保留;(2)引入双域引导(Dual-Domain Guidance, DDG)模块,同时捕捉全局令牌关系并自适应滤除红外图像中常见的非均匀背景噪声;(3)构建大规模红外图像数据集Inf-590K,支持高效预训练,从而显著提升模型在红外目标检测、语义分割和小目标检测等下游任务中的泛化性能。

链接: https://arxiv.org/abs/2512.04511
作者: Yinghui Xing,Xiaoting Su,Shizhou Zhang,Donghao Chu,Di Xu
机构: Northwestern Polytechnical University (西北工业大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.
zh

[CV-86] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

【速读】:该论文旨在解决图像扩散Transformer在超大尺度图像生成中面临的两大核心问题:内容重复(content repetition)和质量退化(quality degradation)。针对内容重复问题,作者通过频域分析发现其根源在于位置嵌入的主导频率具有周期性,且该周期与训练分辨率一致;为此提出递归主导频率校正机制,在外推后将该频率限制在单个周期内以抑制重复。针对质量退化问题,研究指出其源于注意力稀释(diluted attention),进而设计熵引导的自适应注意力集中策略(entropy-guided adaptive attention concentration),通过动态调整局部与全局注意力权重,增强细节清晰度并保持结构一致性。上述方法共同构成了UltraImage框架,显著提升了图像生成的保真度与扩展能力,实现了无需低分辨率引导即可生成高达6K×6K的高质量图像。

链接: https://arxiv.org/abs/2512.04504
作者: Min Zhao,Bokai Yan,Xue Yang,Hongzhou Zhu,Jintao Zhang,Shilong Liu,Chongxuan Li,Jun Zhu
机构: Tsinghua University (清华大学); ShengShu; Renmin University of China (中国人民大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \hrefthis https URLthis https URL.
zh

[CV-87] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

【速读】:该论文旨在解决生成式运动扩散模型(generative motion diffusion model)中运动表示(motion representation)和损失函数设计的关键问题,以提升条件生成任务(如动作到运动、文本到运动、音频到运动)的性能。其解决方案的核心在于通过受控实验对多种常见运动表示进行系统评估,并采用v-loss作为预测目标(即vMDM模型),其中v为运动数据与噪声的加权和,从而更清晰地理解潜在数据分布并优化训练流程。实验表明,不同运动表示在多个数据集上存在显著性能差异,且训练配置的选择对模型效果具有重要影响,这为改进条件运动扩散模型提供了实证基础和方法论指导。

链接: https://arxiv.org/abs/2512.04499
作者: Yuduo Jin,Brandon Haworth
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.
zh

[CV-88] Shift-Window Meets Dual Attention: A Multi-Model Architecture for Specular Highlight Removal

【速读】:该论文旨在解决实际环境中不可避免的镜面高光(specular highlights)对视觉性能的严重干扰问题,这类高光会降低任务的有效性和效率。现有方法通常依赖于卷积神经网络(CNN)提取局部信息或Transformer模型捕捉全局信息,但单一架构难以兼顾局部细粒度特征与全局长距离依赖关系,导致对不同尺度的高光去除效果不佳。解决方案的关键在于提出一种多模型架构——多模型高光去除(MM-SHR),其核心创新包括:在浅层使用卷积操作提取局部细节,在深层引入注意力机制建模全局特征;同时设计了粗到精的处理策略,并提出Omni-Directional Attention Integration Block(OAIBlock)和Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv),通过全方向像素偏移和窗口划分操作在原始特征上实现高效且精准的镜面高光去除,从而在准确率和计算效率之间取得良好平衡。

链接: https://arxiv.org/abs/2512.04496
作者: Tianci Huo,Lingfeng Qi,Yuhan Chen,Qihong Xue,Jinyuan Shao,Hai Yu,Jie Li,Zhanhua Zhang,Guofa Li
机构: Chongqing University (重庆大学); Zhejiang University (浙江大学); Geely Automotive Research Institute (吉利汽车研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inevitable specular highlights in practical environments severely impair the visual performance, thus degrading the task effectiveness and efficiency. Although there exist considerable methods that focus on local information from convolutional neural network models or global information from transformer models, the single-type model falls into a modeling dilemma between local fine-grained details and global long-range dependencies, thus deteriorating for specular highlights with different scales. Therefore, to accommodate specular highlights of all scales, we propose a multi-model architecture for specular highlight removal (MM-SHR) that effectively captures fine-grained features in highlight regions and models long-range dependencies between highlight and highlight-free areas. Specifically, we employ convolution operations to extract local details in the shallow layers of MM-SHR, and utilize the attention mechanism to capture global features in the deep layers, ensuring both operation efficiency and removal accuracy. To model long-range dependencies without compromising computational complexity, we utilize a coarse-to-fine manner and propose Omni-Directional Attention Integration Block(OAIBlock) and Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv) , which leverage omni-directiona pixel-shifting and window-dividing operations at the raw features to achieve specular highlight removal. Extensive experimental results on three benchmark tasks and six types of surface materials demonstrate that MM-SHR outperforms state-of-the-art methods in both accuracy and efficiency for specular highlight removal. The implementation will be made publicly available at this https URL.
zh

[CV-89] Controllable Long-term Motion Generation with Extended Joint Targets WACV2026

【速读】:该论文旨在解决实时计算机动画中角色运动的稳定性与可控性问题,现有方法往往难以实现细粒度控制或在长序列中出现运动退化,限制了其在交互式应用中的使用。解决方案的关键在于提出一种基于自回归架构的COMET框架,其核心创新包括:(1)采用高效的Transformer-based条件变分自编码器(conditional VAE),支持对任意用户指定关节进行精确、交互式的控制,适用于目标到达和中间插值等任务;(2)引入新颖的参考引导反馈机制,有效防止误差累积,保障长时间序列的时序稳定性,并可作为即插即用的风格化模块实现实时风格迁移。实验表明,COMET能在实时速度下生成高质量运动,在复杂控制任务中显著优于当前最先进方法。

链接: https://arxiv.org/abs/2512.04487
作者: Eunjong Lee,Eunhee Kim,Sanghoon Hong,Eunho Jung,Jihoon Kim
机构: Cinamon Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.
zh

[CV-90] Not All Birds Look The Same: Identity-Preserving Generation For Birds

【速读】:该论文旨在解决生成式 AI (Generative AI) 在非刚性、细粒度类别(如鸟类)中保持身份一致性的问题,此类领域因缺乏高质量的视频或多视角数据而难以评估与优化。解决方案的关键在于构建一个名为 NABirds Look-Alikes (NABLA) 的新基准数据集,包含 4,759 对专家标注的图像对,并结合 iNaturalist 上收集的 1,073 对多图观测数据及少量视频,形成可用于评估身份保留生成能力的标准测试集;进一步发现,以物种、年龄和性别作为身份代理进行分组训练,显著提升了模型在已见和未见物种上的身份保持性能。

链接: https://arxiv.org/abs/2512.04485
作者: Aaron Sun,Oindrila Saha,Subhransu Maji
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data – especially videos or multi-view observations of the same subject – making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex – used as a proxy for identity – substantially improves performance on both seen and unseen species.
zh

[CV-91] DeRA: Decoupled Representation Alignment for Video Tokenization

【速读】:该论文旨在解决视频分词(video tokenization)中空间-时间表征学习耦合导致的训练效率低和性能受限的问题。现有方法在统一的潜空间中同时建模视频的外观(appearance)与运动(motion)信息,易引发梯度冲突并阻碍对时空语义的精细捕捉。其解决方案的关键在于提出DeRA——一种将视频编码分解为独立处理的外观流与运动流的1D视频分词器,通过分离空间语义与时间动态的学习路径,实现更高效的训练与更优的性能;进一步引入对称对齐-冲突投影(Symmetric Alignment-Conflict Projection, SACP)模块,主动抑制异构监督下梯度冲突方向的成分,从而提升模型稳定性与泛化能力。

链接: https://arxiv.org/abs/2512.04483
作者: Pengbo Guo,Junke Wang,Zhen Xing,Chengxu Liu,Daoguo Dong,Xueming Qian,Zuxuan Wu
机构: Xi’an Jiaotong University (西安交通大学); Shanghai Innovation Institute; Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.
zh

[CV-92] Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles

【速读】:该论文旨在解决在小样本、类别不均衡场景下,深度学习模型性能未必优于传统特征工程方法的问题,特别是在自动评估圣·戈丹双鹰金币(Saint-Gaudens Double Eagle)等级这一特定任务中。其解决方案的关键在于:通过人工设计的192个基于Sobel边缘检测和HSV颜色分析的定制化特征构建一个特征驱动的人工神经网络(Artificial Neural Network, ANN),将领域专家知识显式编码进模型输入,从而在仅1,785个标注样本且类别分布极不均衡的情况下,实现86%的精确匹配率(允许3级误差时达98%),显著优于混合EfficientNetV2的卷积神经网络(CNN)和支撑向量机(SVM)等端到端深度学习方法。这表明,在数据稀缺且专业经验至关重要的场景中,结构化的特征工程比“黑箱”式的深度学习架构更具优势。

链接: https://arxiv.org/abs/2512.04464
作者: Tanmay Dogra,Eric Ngo,Mohammad Alam,Jean-Paul Talavera,Asim Dahal
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data’s thin and know-how matters more than raw compute.
zh

[CV-93] UniTS: Unified Time Series Generative Model for Remote Sensing

【速读】:该论文旨在解决卫星遥感时间序列任务中缺乏统一建模方法的问题,现有方法通常针对不同任务(如云去除、变化检测、预测等)设计专用模型,难以实现多任务间时空特征的协同学习。其解决方案的关键在于提出一种通用的时间序列生成模型 UniTS,基于流匹配(flow matching)生成范式,构建在任务条件引导下的确定性演化路径,从而实现对多种低层与高层任务的统一时空表征建模;核心创新包括:1)自适应条件注入器(Adaptive Condition Injector, ACor),增强对多模态输入的条件感知能力,支持高质量可控生成;2)时空感知调制器(Spatiotemporal-aware Modulator, STM),提升时空块捕捉复杂时空依赖的能力;此外,还构建了两个高质量多模态时间序列数据集 TS-S12 和 TS-S12CR,填补了云去除和预测任务基准数据的空白。

链接: https://arxiv.org/abs/2512.04461
作者: Yuxiang Zhang,Shunlin Liang,Wenyuan Li,Han Ma,Jianglei Xu,Yichuan Ma,Jiangwei Xie,Wei Li,Mengmeng Zhang,Ran Tao,Xiang-Gen Xia
机构: The Jockey Club STEM Laboratory of Quantitative Remote Sensing, Department of Geography, the University of Hong Kong, Hong Kong, China (香港大学地理系定量遥感赛马会STEM实验室); School of Information and Electronics, Beijing Institute of Technology, and Beijing Key Laboratory of Fractional Signals and Systems, 100081 Beijing, China (北京理工大学信息与电子学院及北京市分数信号与系统重点实验室); Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA (特拉华大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model’s conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
zh

[CV-94] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

【速读】:该论文旨在解决当前端到端(End-to-End, E2E)自动驾驶系统在分布外(Out-of-Distribution, OOD)场景下推理与规划不一致、可控性差的问题。现有基于自回归(Autoregressive, AR)架构的视觉语言模型(Vision-Language Models, VLMs)受限于因果注意力机制和序列化token生成方式,难以维持高层推理与低层规划之间的一致性和可控性。为此,作者提出dVLM-AD,其核心创新在于采用基于扩散(Diffusion)机制的VLM架构,利用双向注意力实现迭代去噪过程,从而在感知、结构化推理与底层规划之间建立统一且可控的映射关系。实验表明,dVLM-AD在nuScenes和WOD-E2E数据集上显著提升了行为轨迹一致性(提升9%),并在长尾场景中实现更高的RFS指标(提升6%),验证了扩散模型在提升E2E驾驶系统可靠性方面的潜力。

链接: https://arxiv.org/abs/2512.04459
作者: Yingzi Ma,Yulong Cao,Wenhao Ding,Shuibai Zhang,Yan Wang,Boris Ivanovic,Ming Jiang,Marco Pavone,Chaowei Xiao
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); NVIDIA (英伟达); Stanford University (斯坦福大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs – limited by causal attention and sequential token generation – often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
zh

[CV-95] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis AAAI2026

【速读】:该论文旨在解决当前图像去噪方法中依赖大量真实噪声数据以及生成式模型对相机元数据和特定目标噪声-干净图像对高度敏感、泛化能力有限的问题。解决方案的关键在于提出一种基于单对噪声/干净图像引导的扩散模型 GuidNoise,其核心创新包括:引入感知引导的仿射特征修改(Guidance-aware Affine Feature Modification, GAFM)以利用单对样本中的潜在结构信息,并设计噪声感知的精炼损失函数(Noise-aware Refine Loss),在扩散模型的反向过程中优化噪声分布的生成能力,从而在无需额外元数据的情况下,实现多样噪声环境下的高质量合成噪声图像生成,且支持推理时高效生成噪声-干净图像对用于自增强训练,显著提升轻量级模型在数据受限场景下的去噪性能。

链接: https://arxiv.org/abs/2512.04456
作者: Changjin Kim,HyeokJun Lee,YoungJoon Yoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI2026

点击查看摘要

Abstract:Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model’s backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at this https URL.
zh

[CV-96] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

【速读】:该论文旨在解决具身智能(embodied intelligence)在真实场景中对流式视频输入的持续感知与推理能力不足的问题,即如何让智能体在动态环境中维持情境意识、理解与周围实体的交互,并基于历史观察、当前上下文和未来预期进行动态决策。解决方案的关键在于提出首个面向具身场景的流式视频问答基准——StreamEQA,其核心创新包括:(1)从具身维度将问题划分为感知、交互与规划三个层次,系统评估模型对细粒度视觉细节识别、对象交互推理及目标导向高阶推理的能力;(2)从流式维度设计后向、实时与前向三种时序推理模式,分别依赖不同时间窗口的信息上下文;(3)基于156段独立长视频构建42个任务,生成约21K带精确时间戳的问答对,采用自动化生成与人工精修相结合的混合流程确保数据质量。实验表明,尽管现有多模态大语言模型(MLLMs)在传统基准上表现优异,但在流式视频理解方面仍存在显著短板,StreamEQA为推动该领域研究提供了标准化评估框架。

链接: https://arxiv.org/abs/2512.04451
作者: Yifei Wang,Zhenkai Li,Tianwen Qian,Huanran Zheng,Zheng Wang,Yuqian Fu,Xiaoling Wang
机构: East China Normal University (华东师范大学); Zhejiang University of Technology (浙江工业大学); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model’s ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
zh

[CV-97] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶(End-to-End Autonomous Driving, E2E-AD)中轨迹规划模块的两大局限性:一是现有方法多聚焦于轨迹生成或轨迹选择,难以兼顾高质量轨迹生成与多维决策推理;二是缺乏对未来的前瞻性模拟和可解释的决策机制。解决方案的关键在于提出MindDrive框架,其核心创新为“情境仿真—候选生成—多目标权衡”的结构化推理范式。其中,基于世界动作模型(World Action Model, WaM)的未来感知轨迹生成器(Future-aware Trajectory Generator, FaTG)通过自车条件化的“假设性”仿真预测潜在未来场景并生成前瞻轨迹候选;在此基础上,依托大视觉语言模型(Vision-Language Model, VLM)的评估器(VLM-oriented Evaluator, VLoE)实现安全、舒适与效率维度的多目标综合评价,从而推动可解释且符合人类认知逻辑的决策过程。实验证明该方案在NAVSIM-v1和v2基准上显著提升安全性、合规性和泛化能力,达到当前最优性能。

链接: https://arxiv.org/abs/2512.04441
作者: Bin Suna,Yaoguang Caob,Yan Wanga,Rui Wanga,Jiachen Shanga,Xiejie Fenga,Jiayi Lu,Jia Shi,Shichun Yang,Xiaoyu Yane,Ziying Song
机构: Beihang University (北京航空航天大学); Contemporary Amperex Technology Co., Limited (CATL); Beijing Jiaotong University (北京交通大学); China Automotive Engineering Research Institute Co., Ltd. (中国汽车工程研究院有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of “context simulation - candidate generation - multi-objective trade-off”. In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned “what-if” simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
zh

[CV-98] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

【速读】:该论文旨在解决当前自动电影预告片生成方法中存在的误差传播问题,这些问题通常源于“选择-排序”范式(selection-then-ranking paradigm)的局限性,即先选取关键镜头再进行排序,导致错误在流程中逐级累积,从而限制了最终预告片的质量。其解决方案的关键在于提出一种新颖的自 paced 且自修正的掩码预测方法(Self-Paced and Self-Corrective Masked Prediction, SSMP),通过双向上下文建模和渐进式自我修正机制实现高质量预告片生成。具体而言,SSMP利用 Transformer 编码器以电影镜头序列为提示(prompt),生成对应的预告片镜头序列,并通过掩码预测任务训练模型——随机掩码预告片镜头序列后重建原始序列;掩码比例自适应调整(self-paced),使任务难度随模型能力动态变化,提升训练效率与性能;在推理阶段,模型逐步填充高置信度镜头位置并重新掩码剩余部分,形成类人类编辑的渐进式自我修正过程,显著优于现有方法。

链接: https://arxiv.org/abs/2512.04426
作者: Sidan Zhu,Hongteng Xu,Dixin Luo
机构: Beijing Institute of Technology (北京理工大学); Renmin University of China (中国人民大学); Key Laboratory of Artificial Intelligence, Ministry of Education (人工智能教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a “selection-then-ranking” paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: this https URL.
zh

[CV-99] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期检测中步态分析的准确性与可解释性不足的问题,现有方法常受限于单模态输入、鲁棒性差以及临床透明度低。其关键解决方案是提出一种可解释的多模态框架,融合RGB与深度(RGB-D)数据,通过双YOLOv11编码器实现模态特定特征提取,并引入多尺度局部-全局提取(Multi-Scale Local-Global Extraction, MLGE)模块和跨空间颈融合机制(Cross-Spatial Neck Fusion),以增强时空表征能力,从而在复杂环境下准确识别帕金森步态特征(如减小的摆臂幅度或短步幅)。同时,集成冻结的大语言模型(frozen Large Language Model, LLM)将融合后的视觉嵌入与结构化元数据转化为临床可理解的文本解释,显著提升系统对医生和患者而言的透明度与可信度。

链接: https://arxiv.org/abs/2512.04425
作者: Manar Alnaasan,Md Selim Sarowar,Sungho Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:this https URL
zh

[CV-100] UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes CVPR2026

【速读】:该论文旨在解决现有基于高斯粒子(3D Gaussian particles)的光追方法在实际应用中因依赖代理几何体(proxy geometry)而导致的复杂中间网格构建和昂贵的相交测试问题,从而限制了其在真实感渲染中的效率与灵活性。解决方案的关键在于提出了一种可微分的三角形基光追流水线(differentiable triangle-based ray tracing pipeline),直接以三角形作为渲染原语进行光追,无需任何代理几何体,同时能够兼容由光栅化方法优化得到的三角形点阵(Triangle Splatting)结果,实现了光追与光栅化在新型视角合成中统一的渲染原语。

链接: https://arxiv.org/abs/2512.04421
作者: Changhe Liu,Ehsan Javanmardi,Naren Bao,Alex Orsholits,Manabu Tsukada
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 13 pages, 10 figures, submitted to CVPR2026

点击查看摘要

Abstract:Ray tracing 3D Gaussian particles enables realistic effects such as depth of field, refractions, and flexible camera modeling for novel-view synthesis. However, existing methods trace Gaussians through proxy geometry, which requires constructing complex intermediate meshes and performing costly intersection tests. This limitation arises because Gaussian-based particles are not well suited as unified primitives for both ray tracing and rasterization. In this work, we propose a differentiable triangle-based ray tracing pipeline that directly treats triangles as rendering primitives without relying on any proxy geometry. Our results show that the proposed method achieves significantly higher rendering quality than existing ray tracing approaches while maintaining real-time rendering performance. Moreover, our pipeline can directly render triangles optimized by the rasterization-based method Triangle Splatting, thus unifying the primitives used in novel-view synthesis.
zh

[CV-101] Dual-Stream Spectral Decoupling Distillation for Remote Sensing Object Detection

【速读】:该论文旨在解决遥感图像(Remote Sensing Images, RSIs)中知识蒸馏方法存在的特征混杂问题以及因细微特征差异被忽视而导致的知识混淆问题。其解决方案的关键在于提出了一种架构无关的蒸馏方法——双流谱解耦蒸馏(Dual-Stream Spectral Decoupling Distillation, DS2D2),该方法基于频谱分解实现显式与隐式知识蒸馏:首先利用一阶小波变换进行频谱分解以保留RSI的关键空间特性,并设计密度无关尺度权重(Density-Independent Scale Weight, DISW)提升密集和小目标检测性能;其次挖掘学生模型与教师模型间细微特征差异所蕴含的隐式知识,通过全频段与高频放大器将特征差异映射为预测偏差,从而增强检测头对关键信息的感知能力。

链接: https://arxiv.org/abs/2512.04413
作者: Xiangyi Gao,Danpei Zhao,Bo Yuan,Wentao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Knowledge distillation is an effective and hardware-friendly method, which plays a key role in lightweighting remote sensing object detection. However, existing distillation methods often encounter the issue of mixed features in remote sensing images (RSIs), and neglect the discrepancies caused by subtle feature variations, leading to entangled knowledge confusion. To address these challenges, we propose an architecture-agnostic distillation method named Dual-Stream Spectral Decoupling Distillation (DS2D2) for universal remote sensing object detection tasks. Specifically, DS2D2 integrates explicit and implicit distillation grounded in spectral decomposition. Firstly, the first-order wavelet transform is applied for spectral decomposition to preserve the critical spatial characteristics of RSIs. Leveraging this spatial preservation, a Density-Independent Scale Weight (DISW) is designed to address the challenges of dense and small object detection common in RSIs. Secondly, we show implicit knowledge hidden in subtle student-teacher feature discrepancies, which significantly influence predictions when activated by detection heads. This implicit knowledge is extracted via full-frequency and high-frequency amplifiers, which map feature differences to prediction deviations. Extensive experiments on DIOR and DOTA datasets validate the effectiveness of the proposed method. Specifically, on DIOR dataset, DS2D2 achieves improvements of 4.2% in AP50 for RetinaNet and 3.8% in AP50 for Faster R-CNN, outperforming existing distillation approaches. The source code will be available at this https URL.
zh

[CV-102] Performance Evaluation of Transfer Learning Based Medical Image Classification Techniques for Disease Detection

【速读】:该论文旨在解决医学图像分类中因数据有限而难以从零开始训练大型深度学习模型的问题。其解决方案的关键在于采用迁移学习(Transfer Learning, TL)技术,即利用在大规模通用数据集上预训练好的深度卷积神经网络模型(如AlexNet、VGG16、ResNet系列和InceptionV3),将其特征提取能力迁移到特定的医学图像分类任务中。实验表明,这种策略在小样本场景下尤为有效,且模型性能与架构复杂度、源域与目标域的相似性以及数据规模密切相关。此外,研究还发现仅需一个经过良好训练的特征提取器配合轻量级前馈网络即可实现高效且准确的预测,为实际应用提供了可扩展的优化路径。

链接: https://arxiv.org/abs/2512.04397
作者: Zeeshan Ahmad,Shudi Bao,Meng Chen
机构: Ningbo Institute of Digital Twin, Eastern Institute of Technology (宁波数字孪生研究所,东华理工大学); School of Cyber Science and Engineering, Ningbo University of Technology (宁波科技大学网络科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image classification plays an increasingly vital role in identifying various diseases by classifying medical images, such as X-rays, MRIs and CT scans, into different categories based on their features. In recent years, deep learning techniques have attracted significant attention in medical image classification. However, it is usually infeasible to train an entire large deep learning model from scratch. To address this issue, one of the solutions is the transfer learning (TL) technique, where a pre-trained model is reused for a new task. In this paper, we present a comprehensive analysis of TL techniques for medical image classification using deep convolutional neural networks. We evaluate six pre-trained models (AlexNet, VGG16, ResNet18, ResNet34, ResNet50, and InceptionV3) on a custom chest X-ray dataset for disease detection. The experimental results demonstrate that InceptionV3 consistently outperforms other models across all the standard metrics. The ResNet family shows progressively better performance with increasing depth, whereas VGG16 and AlexNet perform reasonably well but with lower accuracy. In addition, we also conduct uncertainty analysis and runtime comparison to assess the robustness and computational efficiency of these models. Our findings reveal that TL is beneficial in most cases, especially with limited data, but the extent of improvement depends on several factors such as model architecture, dataset size, and domain similarity between source and target tasks. Moreover, we demonstrate that with a well-trained feature extractor, only a lightweight feedforward model is enough to provide efficient prediction. As such, this study contributes to the understanding of TL in medical image classification, and provides insights for selecting appropriate models based on specific requirements.
zh

[CV-103] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

【速读】:该论文旨在解决大规模预训练视觉语言模型(Vision-Language Models, VLMs)在少样本学习中因图像的域不变结构与域特定风格隐式纠缠而导致泛化能力受限的问题。其解决方案的关键在于提出了一种基于傅里叶分析的显式表征解耦方法——傅里叶注意力表征学习(Fourier-Attentive Representation Learning, FARL),通过双交叉注意力机制,使可学习的表示令牌分别从图像的相位谱(结构特征)和幅度谱(风格特征)中查询信息,从而生成富含解耦语义的增强表示,并采用非对称注入策略将其深度嵌入VLM编码器,以促进更鲁棒的视觉-语言对齐。

链接: https://arxiv.org/abs/2512.04395
作者: Hieu Dinh Trung Pham,Huy Minh Nhat Nguyen,Cuong Tuan Nguyen
机构: Vietnamese German University (越南德国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image’s domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image’s structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.
zh

[CV-104] FMA-Net: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

【速读】:该论文旨在解决真实视频恢复中因运动与动态曝光变化耦合导致的复杂退化问题(complex degradations from motion coupled with dynamically varying exposure),这是以往研究普遍忽视但实际拍摄场景中常见的现象,尤其在自动曝光或低光条件下。解决方案的关键在于提出FMA-Net++框架,其核心创新是通过层次化精修与双向传播结构实现序列级时序建模,并引入曝光时间感知调制层(Exposure Time-aware Modulation layer)和曝光感知光流引导动态滤波模块(exposure-aware Flow-Guided Dynamic Filtering module),从而显式建模运动与曝光之间的耦合效应;此外,该方法将退化学习与恢复过程解耦,先预测暴露和运动感知的先验信息以指导后续恢复,显著提升了重建精度与推理效率。

链接: https://arxiv.org/abs/2512.04390
作者: Geunhyuk Youk,Jihyong Oh,Munchurl Kim
机构: KAIST(韩国科学技术院); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures. Project Page: this https URL

点击查看摘要

Abstract:Real-world video restoration is plagued by complex degradations from motion coupled with dynamically varying exposure - a key challenge largely overlooked by prior works and a common artifact of auto-exposure or low-light capture. We present FMA-Net++, a framework for joint video super-resolution and deblurring that explicitly models this coupled effect of motion and dynamically varying exposure. FMA-Net++ adopts a sequence-level architecture built from Hierarchical Refinement with Bidirectional Propagation blocks, enabling parallel, long-range temporal modeling. Within each block, an Exposure Time-aware Modulation layer conditions features on per-frame exposure, which in turn drives an exposure-aware Flow-Guided Dynamic Filtering module to infer motion- and exposure-aware degradation kernels. FMA-Net++ decouples degradation learning from restoration: the former predicts exposure- and motion-aware priors to guide the latter, improving both accuracy and efficiency. To evaluate under realistic capture conditions, we introduce REDS-ME (multi-exposure) and REDS-RE (random-exposure) benchmarks. Trained solely on synthetic data, FMA-Net++ achieves state-of-the-art accuracy and temporal consistency on our new benchmarks and GoPro, outperforming recent methods in both restoration quality and inference speed, and generalizes well to challenging real-world videos.
zh

[CV-105] STeP-Diff: Spatio-Temporal Physics-Informed Diffusion Models for Mobile Fine-Grained Pollution Forecasting

【速读】:该论文旨在解决基于非专用移动平台(如汽车和公交车)部署便携式传感器所导致的空气质量数据不完整且时间上不一致的问题,从而实现精细化的空气污染预测。其解决方案的关键在于提出了一种时空物理信息扩散模型(Spatio-Temporal Physics-Informed Diffusion Models, STeP-Diff),该模型结合DeepONet对空间序列进行建模,并引入PDE约束的扩散过程,通过偏微分方程(Partial Differential Equation, PDE)正则化框架使去噪过程渐近收敛于对流-扩散动力学,确保预测结果既符合实测数据又遵循污染物扩散的基本物理规律。

链接: https://arxiv.org/abs/2512.04385
作者: Nan Zhou,Weijie Hong,Huandong Wang,Jianfeng Zheng,Qiuhua Wang,Yali Song,Xiao-Ping Zhang,Yong Li,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University, China (清华大学深圳国际研究生院); Department of Electronic Engineering, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, China (清华大学电子工程系,北京信息科学与技术国家研究中心); Shenzhen Smartcity Communication, Shenzhen, China (深圳市智慧城市通信); College of Soil and Water Conservation, Southwest Forestry University, China (西南林业大学水土保持学院); College of Civil Engineering, Southwest Forestry University, China (西南林业大学土木工程学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained air pollution forecasting is crucial for urban management and the development of healthy buildings. Deploying portable sensors on mobile platforms such as cars and buses offers a low-cost, easy-to-maintain, and wide-coverage data collection solution. However, due to the random and uncontrollable movement patterns of these non-dedicated mobile platforms, the resulting sensor data are often incomplete and temporally inconsistent. By exploring potential training patterns in the reverse process of diffusion models, we propose Spatio-Temporal Physics-Informed Diffusion Models (STeP-Diff). STeP-Diff leverages DeepONet to model the spatial sequence of measurements along with a PDE-informed diffusion model to forecast the spatio-temporal field from incomplete and time-varying data. Through a PDE-constrained regularization framework, the denoising process asymptotically converges to the convection-diffusion dynamics, ensuring that predictions are both grounded in real-world measurements and aligned with the fundamental physics governing pollution dispersion. To assess the performance of the system, we deployed 59 self-designed portable sensing devices in two cities, operating for 14 days to collect air pollution data. Compared to the second-best performing algorithm, our model achieved improvements of up to 89.12% in MAE, 82.30% in RMSE, and 25.00% in MAPE, with extensive evaluations demonstrating that STeP-Diff effectively captures the spatio-temporal dependencies in air pollution fields.
zh

[CV-106] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

【速读】:该论文旨在解决现有立体匹配网络在资源受限移动设备上部署时面临的效率与精度难以兼顾的问题。具体而言,基于3D卷积的成本体积构建方法计算开销大,而基于迭代优化的变形方法则难以建模非局部上下文信息,导致其在实时应用中表现不佳。解决方案的关键在于提出一种多频自适应融合网络(Multi-frequency Adaptive Fusion Network, MAFNet),其核心创新包括:设计了一种自适应频域滤波注意力模块,将完整成本体积分解为高频和低频子体积并分别进行频率感知特征聚合;进一步引入基于Linformer的低秩注意力机制,实现对高低频信息的自适应融合,从而提升稠密匹配的鲁棒性与准确性,同时仅使用高效的2D卷积操作,显著优化了模型在实时场景下的性能表现。

链接: https://arxiv.org/abs/2512.04358
作者: Ao Xu,Rujin Zhao,Xiong Xu,Boceng Huang,Yujia Jia,Hongfeng Long,Fuxuan Chen,Zilong Cao,Fangyuan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.
zh

[CV-107] Open Set Face Forgery Detection via Dual-Level Evidence Collection

【速读】:该论文旨在解决开放集人脸伪造检测(Open Set Face Forgery Detection, OSFFD)问题,即检测模型需具备识别未知伪造类别(novel fake categories)的能力,而传统方法通常仅限于已知类别的二分类(Real-vs-Fake)或有限类别识别,难以应对不断涌现的新类型伪造。其解决方案的关键在于通过不确定性估计(uncertainty estimation)重构OSFFD任务,并提出双层级证据推理检测(Dual-Level Evidential face forgery Detection, DLED)方法:该方法在空间和频域两个层次上收集并融合类别特定证据(category-specific evidence),从而更准确地量化预测不确定性,提升对新型伪造的检测性能。实验表明,DLED在新型伪造检测上平均优于基线模型 20%,同时在传统 Real-vs-Fake 任务中也保持竞争力。

链接: https://arxiv.org/abs/2512.04331
作者: Zhongyi Cai,Bryce Gernon,Wentao Bao,Yifan Li,Matthew Wright,Yu Kong
机构: Michigan State University (密歇根州立大学); Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of face forgeries has increasingly undermined confidence in the authenticity of online content. Given the rapid development of face forgery generation algorithms, new fake categories are likely to keep appearing, posing a major challenge to existing face forgery detection methods. Despite recent advances in face forgery detection, existing methods are typically limited to binary Real-vs-Fake classification or the identification of known fake categories, and are incapable of detecting the emergence of novel types of forgeries. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which demands that the detection model recognize novel fake categories. We reformulate the OSFFD problem and address it through uncertainty estimation, enhancing its applicability to real-world scenarios. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which collects and fuses category-specific evidence on the spatial and frequency levels to estimate prediction uncertainty. Extensive evaluations conducted across diverse experimental settings demonstrate that the proposed DLED method achieves state-of-the-art performance, outperforming various baseline models by an average of 20% in detecting forgeries from novel fake categories. Moreover, on the traditional Real-versus-Fake face forgery detection task, our DLED method concurrently exhibits competitive performance.
zh

[CV-108] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

【速读】:该论文旨在解决现有神经网络组件在数千个开源代码库中难以发现、提取和验证的问题,从而阻碍了研究效率的提升。其核心挑战在于如何从异构且庞大的PyTorch代码库中系统性地识别出可复用、可运行且结构独特的神经模块。解决方案的关键在于提出NN-RAG框架,该框架通过作用域感知的依赖解析(scope-aware dependency resolution)、保留导入关系的重构(import-preserving reconstruction)以及验证器门控促进机制(validator-gated promotion),确保每个检索到的代码块都是作用域封闭(scope-closed)、可编译且可执行的;同时结合多层级去重策略(精确匹配、词法相似性和结构相似性),显著提升了模块的多样性与可用性,并实现了跨仓库的架构模式迁移能力。这一方法首次在大规模上提供了可验证、可追踪来源的神经网络模块库,为生成式AI(Generative AI)研究中的架构探索奠定了基础。

链接: https://arxiv.org/abs/2512.04329
作者: Waleed Khalid,Dmitry Ignatov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion – ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework’s neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.
zh

[CV-109] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

【速读】:该论文旨在解决数字图像相关(Digital Image Correlation, DIC)技术中深度学习算法训练数据稀缺及模型预测可靠性不足的问题。解决方案的关键在于:首先,提出基于非均匀B样条曲面的高保真DIC数据集生成方法,通过随机生成控制点坐标构建多样化的位移场,并生成具有真实场景特征的散斑图案数据集,从而提升深度学习模型的训练效果与泛化能力;其次,设计了一种新型网络架构Bayes-DIC Net,其在下采样阶段提取多层级特征,在上采样阶段通过单一跳跃连接聚合信息,同时引入轻量级卷积模块以扩展感受野并保留上下文信息,同时在推理阶段激活Dropout模块,使网络具备贝叶斯特性,能够输出预测结果及其置信度,显著增强模型在实际位移场预测任务中的可靠性与实用性。

链接: https://arxiv.org/abs/2512.04323
作者: Biao Chen,Zhenhua Lei,Yahui Zhang,Tongzhi Niu
机构: University of Michigan (密歇根大学); Huazhong University of Science and Technology (华中科技大学); Northwestern University (西北大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.
zh

[CV-110] SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

【速读】:该论文旨在解决从非同步(unsynchronized)多视角视频中重建动态三维场景(4D场景)的问题,其核心挑战在于如何在缺乏时间对齐的情况下,融合多源视频信息以准确恢复随时间变化的3D几何结构与运动轨迹。解决方案的关键在于提出SyncTrack4D方法,通过构建密集的4D特征轨迹(4D feature tracks)并利用Fused Gromov-Wasserstein最优传输算法建立跨视频轨迹对应关系,进而实现帧级全局时间对齐;随后基于运动样条(motion-spline)结构设计多视频4D高斯点绘(4D Gaussian Splatting),完成亚帧级(sub-frame)时间同步与高保真4D重建,最终输出带有显式3D轨迹和每段视频时间偏移量的统一4D表示。

链接: https://arxiv.org/abs/2512.04315
作者: Yonghan Lee,Tsung-Wei Huang,Shiv Gehlot,Jaehoon Choi,Guan-Ming Su,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院市分校); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.
zh

[CV-111] DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

【速读】:该论文旨在解决视觉 Transformer 中空间维度与通道维度耦合导致的表征纠缠问题,这限制了对结构依赖和语义依赖的独立建模,尤其在高光谱成像任务中表现突出。其解决方案的关键在于提出 DisentangleFormer 架构,通过三个核心组件实现空间-通道解耦:(1) 并行解耦机制独立处理空间 token 和通道 token 流,实现跨空间与光谱维度的去相关特征学习;(2) 压缩 token 增强模块动态融合两路特征以最小化冗余;(3) 多尺度前馈网络(Multi-Scale FFN)补充全局注意力,捕获细粒度的结构与语义依赖,从而在多个高光谱基准数据集上实现最优性能,同时降低计算复杂度。

链接: https://arxiv.org/abs/2512.04314
作者: Jiashu Liao,Pietro Liò,Marc de Kamps,Duygu Sarikaya
机构: University of Glasgow (格拉斯哥大学); University of Leeds (利兹大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.
zh

[CV-112] Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

【速读】:该论文旨在解决当前表情驱动虚拟形象系统过度依赖视觉线索、在面部被遮挡或情绪内隐时失效的问题。其核心解决方案是提出Mind-to-Face框架,首次实现通过非侵入式脑电图(EEG)信号直接解码生成高保真度面部表情。关键创新在于构建了同步采集EEG与多视角面部视频的双模态记录系统,利用情感诱发刺激获取精确监督信号,并设计CNN-Transformer编码器将EEG映射为包含超过6.5万顶点的密集3D位置图,从而捕捉细微的情感动态和几何特征,再通过改进的3D高斯泼溅渲染管线生成视图一致的逼真结果,证明了神经信号中蕴含比以往认知更丰富的表征信息。

链接: https://arxiv.org/abs/2512.04313
作者: Haolin Xiong,Tianwen Fu,Pratusha Bhuvana Prasad,Yunxuan Cai,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao
机构: Institute for Creative Technologies (创意技术研究所); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.
zh

[CV-113] Real-time Cricket Sorting By Sex

【速读】:该论文旨在解决昆虫养殖中缺乏自动化性别分选技术的问题,以提升家蟋蟀(Acheta domesticus)工业化生产效率和可持续性。当前养殖通常采用混合性别群体饲养,未实现性别分离,限制了选择性育种、繁殖比例优化及营养差异化等潜力。解决方案的关键在于开发一套低成本、实时的自动分选系统,其核心由基于Raspberry Pi 5的计算机视觉模块与物理执行机构组成,结合定制化的YOLOv8 nano目标检测模型和伺服驱动分选臂,在资源受限设备上实现了高精度识别与动作执行,测试阶段平均精度(mAP@0.5)达0.977,实际场景下整体分选准确率为86.8%,验证了轻量级深度学习模型在昆虫养殖智能化中的可行性与实用性。

链接: https://arxiv.org/abs/2512.04311
作者: Juan Manuel Cantarero Angulo,Matthew Smith
机构: Illinois Institute of Technology (伊利诺伊理工学院); Universidad Politécnica de Madrid (马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 13 pages, 14 figures

点击查看摘要

Abstract:The global demand for sustainable protein sources is driving increasing interest in edible insects, with Acheta domesticus (house cricket) identified as one of the most suitable species for industrial production. Current farming practices typically rear crickets in mixed-sex populations without automated sex sorting, despite potential benefits such as selective breeding, optimized reproduction ratios, and nutritional differentiation. This work presents a low-cost, real-time system for automated sex-based sorting of Acheta domesticus, combining computer vision and physical actuation. The device integrates a Raspberry Pi 5 with the official Raspberry AI Camera and a custom YOLOv8 nano object detection model, together with a servo-actuated sorting arm. The model reached a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.977 during testing, and real-world experiments with groups of crickets achieved an overall sorting accuracy of 86.8%. These results demonstrate the feasibility of deploying lightweight deep learning models on resource-constrained devices for insect farming applications, offering a practical solution to improve efficiency and sustainability in cricket production.
zh

[CV-114] How (Mis)calibrated is Your Federated CLIP and What To Do About It?

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)环境下视觉-语言模型(如CLIP)的校准(calibration)问题,即如何在分布式训练中确保模型预测概率的可靠性。现有研究多集中于离线场景下的CLIP校准,而FL设置下模型校准性能尚不明确。论文指出,传统文本提示调优(Textual Prompt Tuning)和现有训练中校准技术在FL框架下效果有限,其根本挑战在于选择哪些模型组件进行微调,而非仅依赖聚合策略或校准方法。为此,作者提出 \textFL^2\textoRA,一种基于LoRA(Low-Rank Adaptation)的轻量级微调方法,通过选择性地调整特定层以增强模型在FL中的内在校准能力,实验表明该方案能显著提升校准性能,减少对显式校准步骤的依赖。

链接: https://arxiv.org/abs/2512.04305
作者: Mainak Singha,Masih Aminbeidokhti,Paolo Casari,Elisa Ricci,Subhankar Roy
机构: University of Trento, Italy(特伦托大学, 意大利); École de technologie supérieure, QC, Canada(魁北克省高等技术学院, 加拿大); CNIT, Italy(意大利国家网络信息中心, 意大利); Fondazione Bruno Kessler, Italy(布鲁诺·凯斯勒基金会, 意大利); University of Bergamo, Italy(贝加莫大学, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose \textFL^2\textoRA , a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that \textFL^2\textoRA consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at this https URL.
zh

[CV-115] Gamma-from-Mono: Road-Relative Metric Self-Supervised Monocular Geometry for Vehicular Applications

【速读】:该论文旨在解决单目深度估计中对道路细粒度几何结构(如凹凸、坡度和表面不规则性)感知不足的问题,这类信息对于车辆运动规划与稳定性控制至关重要。传统方法常因过度平滑而丢失关键细节,导致控制安全性下降。其解决方案的关键在于提出Gamma-from-Mono(GfM),一种轻量级单目几何估计方法,通过解耦全局与局部结构来消除单摄像头重建中的投影歧义:GfM预测主导道路平面,并以gamma(无量纲垂直偏差比,定义为某点高度与其距相机深度之比)表示局部残差变化,基于已知相机离地高度即可通过闭式解确定恢复度量深度,无需完整外参标定且天然聚焦近场细节;该物理可解释的建模方式适配自监督学习,避免了大规模标注数据依赖,实现了在KITTI和RSRD数据集上的近场精度领先表现。

链接: https://arxiv.org/abs/2512.04303
作者: Gasser Elazab,Maximilian Jansen,Michael Unterreiner,Olaf Hellwich
机构: CARIAD SE; Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in 3DV 2026

点击查看摘要

Abstract:Accurate perception of the vehicle’s 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point’s height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera’s height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.
zh

[CV-116] Learning Single-Image Super-Resolution in the JPEG Compressed Domain ICIP2025

【速读】:该论文旨在解决深度学习模型训练中因数据加载效率低下而导致的性能瓶颈问题,尤其是在单图像超分辨率(Single-Image Super-Resolution, SISR)任务中,传统方法需对JPEG格式图像进行完整解码,造成显著的计算开销。解决方案的关键在于提出一种轻量级超分辨率流水线,直接在JPEG的离散余弦变换(Discrete Cosine Transform, DCT)系数频域上进行训练,从而避免了全量JPEG解码过程,实现了2.6倍的数据加载加速和2.5倍的训练速度提升,同时保持与标准SISR方法相当的视觉质量。

链接: https://arxiv.org/abs/2512.04284
作者: Sruthi Srinivasan,Elham Shakibapour,Rajy Rawther,Mehdi Saeedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 2 tables, SEEDS Workshop, ICIP 2025

点击查看摘要

Abstract:Deep learning models have grown increasingly complex, with input data sizes scaling accordingly. Despite substantial advances in specialized deep learning hardware, data loading continues to be a major bottleneck that limits training and inference speed. To address this challenge, we propose training models directly on encoded JPEG features, reducing the computational overhead associated with full JPEG decoding and significantly improving data loading efficiency. While prior works have focused on recognition tasks, we investigate the effectiveness of this approach for the restoration task of single-image super-resolution (SISR). We present a lightweight super-resolution pipeline that operates on JPEG discrete cosine transform (DCT) coefficients in the frequency domain. Our pipeline achieves a 2.6x speedup in data loading and a 2.5x speedup in training, while preserving visual quality comparable to standard SISR approaches.
zh

[CV-117] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

【速读】:该论文旨在解决插件式流匹配(Plug-and-Play Flow Matching, PnP-Flow)模型在图像恢复任务中缺乏理论支撑的问题,尤其是其误差来源不明确、优化方向模糊等局限性。解决方案的关键在于推导出PnP-Flow的连续极限形式,构建一个随机微分方程(Stochastic Differential Equation, SDE)代理模型,从而从理论上揭示其运行机制:一方面,该SDE模型可量化图像恢复过程中的误差,指导改进步长调度策略并约束神经网络参数化向量场的Lipschitz常数以降低误差;另一方面,它启发通过外推法加速现有PnP-Flow模型,得到一个缩放版本的SDE模型,显著提升计算效率与恢复质量。

链接: https://arxiv.org/abs/2512.04283
作者: Fan Jia,Yuhao Huang,Shih-Hsin Wang,Cristina Garcia-Cardona,Andrea L. Bertozzi,Bao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.
zh

[CV-118] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

【速读】:该论文旨在解决实时视频运动迁移(video motion transfer)应用中未来预测的多样性不足问题,尤其是在沉浸式游戏和基于视觉的异常检测等场景下,需要准确且多样化的未来轨迹预测以支持真实感合成与不确定性下的鲁棒决策。现有方法如Gated Recurrent Unit-Normalizing Flows (GRU-NF) 虽能通过归一化流(Normalizing Flows)建模多模态分布,但其确定性变换结构限制了表达能力,导致生成样本多样性受限。解决方案的关键在于引入推理时的随机性——在GRU-NF推理阶段嵌入马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)步骤,形成Gated Recurrent Unit-Stochastic Normalizing Flows (GRU-SNF),从而无需重新训练即可探索更丰富的输出空间,更贴近真实数据分布,并显著提升长期预测下的多样性表现。

链接: https://arxiv.org/abs/2512.04282
作者: Tasmiah Haque,Srinjoy Das
机构: West Virginia University (西弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.
zh

[CV-119] UniLight: A Unified Representation for Lighting

【速读】:该论文旨在解决光照表示在多模态场景下不兼容的问题,即不同形式的光照描述(如环境贴图、辐照度、球谐函数和文本)难以统一建模与跨模态迁移。其解决方案的关键在于提出UniLight,一种联合潜在空间作为光照表示,通过模态特定编码器(文本、图像、辐照度、环境贴图)的对比学习对齐各模态嵌入,并引入辅助的球谐函数预测任务强化方向性理解,从而实现多模态光照特征的一致性和可迁移性,支持跨模态灵活操控。

链接: https://arxiv.org/abs/2512.04267
作者: Zitian Zhang,Iliyan Georgiev,Michael Fischer,Yannick Hold-Geoffroy,Jean-François Lalonde,Valentin Deschaintre
机构: Université Laval (拉瓦尔大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.
zh

[CV-120] Studying Various Activation Functions and Non-IID Data for Machine Learning Model Robustness

【速读】:该论文旨在解决机器学习(ML)模型在对抗攻击下的鲁棒性问题,特别是在集中式训练和联邦学习(Federated Learning, FL)环境中不同激活函数对模型鲁棒性的影响及FL中非独立同分布(non-IID)数据导致的性能下降问题。其解决方案的关键在于:首先提出一种改进的集中式对抗训练方法,融合模型架构调整、软标签(soft labeling)、简化数据增强和变化的学习率策略,显著提升了模型在CIFAR-10数据集上对快速梯度符号攻击(Fast Gradient Sign Method, FGSM)的自然准确率(77.08%)和鲁棒准确率(67.96%);其次,在联邦学习环境中引入适度的数据共享机制(如40%数据共享),有效缓解了non-IID场景下鲁棒性显著下降的问题,使自然准确率提升至70.09%,鲁棒准确率达到54.79%,优于CalFAT算法,验证了合理比例数据共享对提升实际应用中模型鲁棒性的关键作用。

链接: https://arxiv.org/abs/2512.04264
作者: Long Dang,Thushari Hapuarachchi,Kaiqi Xiong,Jing Lin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial training is an effective method to improve the machine learning (ML) model robustness. Most existing studies typically consider the Rectified linear unit (ReLU) activation function and centralized training environments. In this paper, we study the ML model robustness using ten different activation functions through adversarial training in centralized environments and explore the ML model robustness in federal learning environments. In the centralized environment, we first propose an advanced adversarial training approach to improving the ML model robustness by incorporating model architecture change, soft labeling, simplified data augmentation, and varying learning rates. Then, we conduct extensive experiments on ten well-known activation functions in addition to ReLU to better understand how they impact the ML model robustness. Furthermore, we extend the proposed adversarial training approach to the federal learning environment, where both independent and identically distributed (IID) and non-IID data settings are considered. Our proposed centralized adversarial training approach achieves a natural and robust accuracy of 77.08% and 67.96%, respectively on CIFAR-10 against the fast gradient sign attacks. Experiments on ten activation functions reveal ReLU usually performs best. In the federated learning environment, however, the robust accuracy decreases significantly, especially on non-IID data. To address the significant performance drop in the non-IID data case, we introduce data sharing and achieve the natural and robust accuracy of 70.09% and 54.79%, respectively, surpassing the CalFAT algorithm, when 40% data sharing is used. That is, a proper percentage of data sharing can significantly improve the ML model robustness, which is useful to some real-world applications.
zh

[CV-121] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

【速读】:该论文旨在解决3D室内场景中可控的新视角合成(Novel View Synthesis, NVS)问题,即如何在保证多视角一致性的同时,实现高保真且可控制的三维场景生成。解决方案的关键在于提出了一种两阶段的生成框架MVRoom,其中第一阶段通过新颖的表示方法有效连接粗粒度3D布局与一致的图像条件信号,以支撑多视角生成;第二阶段引入布局感知的对极注意力机制(layout-aware epipolar attention),在扩散过程中增强多视角一致性。此外,该方法还设计了一个迭代框架,支持基于文本到场景的生成,并能递归生成具有不同物体数量和复杂度的3D场景,从而实现了可控、高质量的NVS。

链接: https://arxiv.org/abs/2512.04248
作者: Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Hupan Lab (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.
zh

[CV-122] 6 Fingers 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在临床应用中对罕见解剖变异(rare anatomical variants)识别能力不足的问题,即现有基准测试主要聚焦于常见解剖表现,未能捕捉到真实世界中因个体差异导致的挑战。解决方案的关键在于提出首个涵盖多种成像模态和解剖区域的基准测试工具——AdversarialAnatomyBench,其核心创新是将违反人类典型解剖先验知识的自然变异定义为“自然对抗性解剖”(natural adversarial anatomy),并通过该基准系统评估22个先进VLM模型的表现,揭示了当前模型在罕见解剖场景下性能显著下降(平均准确率从74%降至29%),且模型规模扩展或干预措施(如偏置感知提示、推理时调整)均无法缓解此问题,从而为未来研究提供了可量化、可复现的评估框架以推动多模态医疗AI系统的鲁棒性提升。

链接: https://arxiv.org/abs/2512.04238
作者: Leon Mayer,Piotr Kalinowski,Caroline Ebersbach,Marcel Knopp,Tim Rädsch,Evangelia Christodoulou,Annika Reinke,Fiona R. Kolbinger,Lena Maier-Hein
机构: Deutsches Krebsforschungszentrum (德国癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about “typical” human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.
zh

[CV-123] Reason X: MLLM -Guided Intrinsic Image Decomposition

【速读】:该论文旨在解决内在图像分解(intrinsic image decomposition)在真实场景中泛化能力不足的问题,即现有基于扩散模型和Transformer的模型虽在合成数据上表现优异,但在多样化的现实图像中性能受限。其解决方案的关键在于提出ReasonX框架,利用多模态大语言模型(MLLM)作为感知裁判,提供相对内在成分比较(relative intrinsic comparisons),并将这些比较转化为GRPO奖励信号,用于在无标注的真实图像上微调内在分解模型。该方法通过奖励模型输出与裁判判断之间的一致性关系来对齐条件预测器,无需显式标签即可实现高质量监督,且具有模型无关性,适用于多种基础架构与模态。

链接: https://arxiv.org/abs/2512.04222
作者: Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück
机构: Imperial College London (帝国理工学院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge’s relational assessments and analytically derived relations from the model’s outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.
zh

[CV-124] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成中物理一致性不足的问题,即当前模型虽在视觉逼真度上取得进展,但在生成符合牛顿运动定律、保持物理精确性和运动连贯性的视频方面仍存在显著挑战。解决方案的关键在于提出MoReGen框架,该框架通过融合多智能体大语言模型(Multi-Agent LLMs)、物理模拟器(Physics Simulators)和渲染器(Renderers),在代码域中实现可复现且物理准确的视频生成;同时引入基于物体轨迹对应关系的量化评估指标,并构建包含1,275条人工标注视频的MoReSet基准数据集,从而系统性地评估和提升T2V模型的物理有效性。

链接: https://arxiv.org/abs/2512.04221
作者: Xiangyu Bai,He Liang,Bishoy Galoaa,Utsav Nandi,Shayda Moezzi,Yuhang He,Sarah Ostadabbas
机构: Northeastern University (东北大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
zh

[CV-125] Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

【速读】:该论文旨在解决视频理解中如何自动学习人类感知的层次化时间结构问题,即从连续视频流中识别出嵌套的事件层级(如细粒度动作嵌套于粗粒度行为之中),而传统方法往往只能进行事后分割或缺乏对多尺度时间结构的建模。解决方案的关键在于提出PARSE框架,其通过无监督方式学习多尺度事件结构:该框架构建了一个由递归预测器组成的层次结构,各层在不同时间粒度上运行——低层捕捉短期动态,高层通过基于注意力机制的反馈整合长期上下文;事件边界自然表现为预测误差的瞬时峰值,从而生成具有时间一致性且符合人类感知中包含关系的嵌套事件树(partonomy)。

链接: https://arxiv.org/abs/2512.04219
作者: Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur\
机构: Auburn University (奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures, 3 tables. Under Review

点击查看摘要

Abstract:Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.
zh

[CV-126] Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

【速读】:该论文旨在解决多摄像头点跟踪(multi-camera point tracking)中传统流水线方法因检测、关联与跟踪步骤解耦而导致的误差传播和时序不一致问题。其关键解决方案是提出了一种端到端的基于Transformer的架构LAPA(Look Around and Pay Attention),通过引入跨视图注意力机制并融合几何先验,实现跨视角软对应关系建模;同时利用注意力加权聚合构建3D点表示,自然处理不确定性与部分观测情况,并借助Transformer解码器建模长程依赖以维持身份一致性,从而在复杂运动和遮挡场景下显著提升跟踪性能。

链接: https://arxiv.org/abs/2512.04213
作者: Bishoy Galoaa,Xiangyu Bai,Shayda Moezzi,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at this https URL
zh

[CV-127] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology

【速读】:该论文旨在解决组织病理学中显微图像分析依赖主观判断、专家资源稀缺以及现有数字病理解决方案存在平台锁定和部署障碍的问题。其核心挑战在于如何实现高效、安全且无需复杂集成的实时人工智能(AI)辅助诊断,以提升临床与科研场景下的病理诊断准确性与可及性。解决方案的关键在于提出了一种平台无关的计算机视觉软件 OnSight Pathology,该工具通过持续自定义屏幕捕获技术,在用户浏览数字切片图像时提供实时 AI 推理,并以单个可执行文件形式运行于消费级个人电脑上,无需复杂软件部署,从而实现了低成本、高安全性、跨平台兼容的本地化 AI 分析,同时支持多种任务如脑肿瘤分类、有丝分裂检测及免疫组化染色定量,并具备多模态聊天助手用于质量控制,还兼容实时显微镜摄像头输入(包括智能手机),拓展了其在术中、远程病理等场景的应用潜力。

链接: https://arxiv.org/abs/2512.04187
作者: Jinzhen Hu,Kevin Faust,Parsa Babaei Zadeh,Adrienn Bourkas,Shane Eaton,Andrew Young,Anzar Alvi,Dimitrios George Oreopoulos,Ameesha Paliwal,Assem Saleh Alrumeh,Evelyn Rose Kamski-Hennekam,Phedias Diamandis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The microscopic examination of surgical tissue remains a cornerstone of disease classification but relies on subjective interpretations and access to highly specialized experts, which can compromise accuracy and clinical care. While emerging breakthroughs in artificial intelligence (AI) offer promise for automated histological analysis, the growing number of proprietary digital pathology solutions has created barriers to real-world deployment. To address these challenges, we introduce OnSight Pathology, a platform-agnostic computer vision software that uses continuous custom screen captures to provide real-time AI inferences to users as they review digital slide images. Accessible as a single, self-contained executable file (this https URL ), OnSight Pathology operates locally on consumer-grade personal computers without complex software integration, enabling cost-effective and secure deployment in research and clinical workflows. Here we demonstrate the utility of OnSight Pathology using over 2,500 publicly available whole slide images across different slide viewers, as well as cases from our clinical digital pathology setup. The software’s robustness is highlighted across routine histopathological tasks, including the classification of common brain tumor types, mitosis detection, and the quantification of immunohistochemical stains. A built-in multi-modal chat assistant provides verifiable descriptions of images, free of rigid class labels, for added quality control. Lastly, we show compatibility with live microscope camera feeds, including from personal smartphones, offering potential for deployment in more analog, inter-operative, and telepathology settings. Together, we highlight how OnSight Pathology can deliver real-time AI inferences across a broad range of pathology pipelines, removing key barriers to the adoption of AI tools in histopathology.
zh

[CV-128] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

【速读】:该论文旨在解决深度伪造检测模型在面对未见过的篡改类型时泛化能力不足的问题。现有方法主要依赖于手工设计的伪影来训练静态图像的检测网络,但在视频领域仍存在局限性,尤其是忽略了面部不同区域之间自然运动依赖关系的破坏这一关键漏洞。解决方案的关键在于提出一种合成视频生成方法,通过分解面部关键点配置为运动基底(motion bases),并操纵这些基底以引入微妙的生物力学不一致,从而在真实视频中嵌入可控的运动伪影;训练后的检测网络能够识别此类复杂的生理运动异常,在多个基准测试中实现了最先进的泛化性能。

链接: https://arxiv.org/abs/2512.04175
作者: Alejandro Cobo,Roberto Valle,José Miguel Buenaposada,Luis Baumela
机构: Universidad Politécnica de Madrid (马德里理工大学); Universidad Rey Juan Carlos (胡安卡洛斯国王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.
zh

[CV-129] Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field AAAI-26

【速读】:该论文旨在解决纯数据驱动的3D高斯泼溅(3D Gaussian Splatting, 3DGS)在动态场景中难以捕捉物理驱动运动模式的问题。其核心解决方案是提出物理信息引导的可变形高斯泼溅(Physics-Informed Deformable Gaussian Splatting, PIDG),关键在于将每个高斯粒子视为具有时变本构参数的拉格朗日材料点,并通过运动投影监督其与2D光流的一致性;进一步采用静态-动态解耦的4D哈希编码高效重建几何与运动,并引入柯西动量残差作为物理约束,实现粒子速度与本构应力的独立预测;最后通过拉格朗日粒子流与相机补偿光流的匹配来增强数据拟合,显著提升收敛速度和泛化能力。

链接: https://arxiv.org/abs/2511.06299
作者: Haoqin Hong,Ding Fan,Fubin Dou,Zhi-Li Zhou,Haoran Sun,Congcong Zhu,Jingrun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-26

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS), an explicit scene representation technique, has shown significant promise for dynamic novel-view synthesis from monocular video input. However, purely data-driven 3DGS often struggles to capture the diverse physics-driven motion patterns in dynamic scenes. To fill this gap, we propose Physics-Informed Deformable Gaussian Splatting (PIDG), which treats each Gaussian particle as a Lagrangian material point with time-varying constitutive parameters and is supervised by 2D optical flow via motion projection. Specifically, we adopt static-dynamic decoupled 4D decomposed hash encoding to reconstruct geometry and motion efficiently. Subsequently, we impose the Cauchy momentum residual as a physics constraint, enabling independent prediction of each particle’s velocity and constitutive stress via a time-evolving material field. Finally, we further supervise data fitting by matching Lagrangian particle flow to camera-compensated optical flow, which accelerates convergence and improves generalization. Experiments on a custom physics-driven dataset as well as on standard synthetic and real-world datasets demonstrate significant gains in physical consistency and monocular dynamic reconstruction quality.
zh

[CV-130] he changing surface of the worlds roads

【速读】:该论文旨在解决全球道路基础设施缺乏全面基线数据的问题,尤其是道路铺装状况(pavedness)和宽度信息的缺失,这直接影响对交通网络功能性和韧性的评估。其解决方案的关键在于利用深度学习框架处理2020年与2024年全球范围内的PlanetScope卫星影像拼接数据,构建了首个覆盖920万公里关键主干道的多时相全球道路数据集,实现了95.5%的覆盖率,其中近一半道路此前未被分类。该方法不仅填补了全球道路物理状态的监测空白,还为从国家发展轨迹到地方治理、气候适应等多尺度分析提供了可量化的基础数据与分析框架。

链接: https://arxiv.org/abs/2512.04092
作者: Sukanya Randhawa,Guntaj Randhawa,Clemens Langer,Francis Andorful,Benjamin Herfort,Daniel Kwakye,Omer Olchik,Sven Lautenbach,Alexander Zipf
机构: Heidelberg Institute of Geoinformation Technology (HeiGIT); Heidelberg University (海德堡大学); Centre for the Environment, Heidelberg University (海德堡大学环境中心)
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Resilient road infrastructure is a cornerstone of the UN Sustainable Development Goals. Yet a primary indicator of network functionality and resilience is critically lacking: a comprehensive global baseline of road surface information. Here, we overcome this gap by applying a deep learning framework to a global mosaic of Planetscope satellite imagery from 2020 and 2024. The result is the first global multi-temporal dataset of road pavedness and width for 9.2 million km of critical arterial roads, achieving 95.5% coverage where nearly half the network was previously unclassified. This dataset reveals a powerful multi-scale geography of human development. At the planetary scale, we show that the rate of change in pavedness is a robust proxy for a country’s development trajectory (correlation with HDI = 0.65). At the national scale, we quantify how unpaved roads constitute a fragile backbone for economic connectivity. We further synthesize our data into a global Humanitarian Passability Matrix with direct implications for humanitarian logistics. At the local scale, case studies demonstrate the framework’s versatility: in Ghana, road quality disparities expose the spatial outcomes of governance; in Pakistan, the data identifies infrastructure vulnerabilities to inform climate resilience planning. Together, this work delivers both a foundational dataset and a multi-scale analytical framework for monitoring global infrastructure, from the dynamics of national development to the realities of local governance, climate adaptation, and equity. Unlike traditional proxies such as nighttime lights, which reflect economic activity, road surface data directly measures the physical infrastructure that underpins prosperity and resilience - at higher spatial resolution.
zh

人工智能

[AI-0] David vs. Goliath: Can Small Models Win Big with Agent ic AI in Hardware Design?

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)推理在计算和能耗上的高需求问题,尤其针对领域特定任务中成本高昂且不可持续的现状。其解决方案的关键在于采用小型语言模型(Small Language Model, SLM)结合精心设计的代理型人工智能(agentic AI)框架,在NVIDIA的综合Verilog设计问题(Comprehensive Verilog Design Problems, CVDP)基准上实现接近LLM性能的表现,同时显著降低资源消耗。该方案通过任务分解、迭代反馈与修正机制,不仅提升了效率,还为代理提供了学习机会,从而推动复杂设计任务中高效、自适应解决方案的发展。

链接: https://arxiv.org/abs/2512.05073
作者: Shashwat Shankar,Subhranshu Pandey,Innocent Dengkhw Mochahari,Bhabesh Mali,Animesh Basak Chowdhury,Sukanta Bhattacharjee,Chandan Karfa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA’s Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.
zh

[AI-1] Detecting Perspective Shifts in Multi-agent Systems

【速读】:该论文旨在解决黑箱多智能体系统(multi-agent systems)中行为动态监测的难题,尤其是如何在不依赖内部结构信息的前提下,检测个体智能体及群体层面的行为变化。其解决方案的关键在于提出Temporal Data Kernel Perspective Space (TDKPS),这是一种联合嵌入跨时间维度智能体表示的新框架,并基于此构建了多种新颖的假设检验方法,能够敏感、特异且显著地识别出由外生事件引发的行为突变。该方法首次为黑箱多智能体系统的动态监控提供了理论严谨的分析工具,具有重要的实践价值。

链接: https://arxiv.org/abs/2512.05013
作者: Eric Bridgeford,Hayden Helm
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Generative models augmented with external tools and update mechanisms (or \textitagents) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems – a critical capability as generative agent deployment continues to scale.
zh

[AI-2] Evolutionary Architecture Search through Grammar-Based Sequence Alignment

【速读】:该论文旨在解决在表达能力强的神经架构搜索(Neural Architecture Search, NAS)空间中,由于计算复杂度高而导致难以高效发现新颖且高性能架构的问题。其解决方案的关键在于引入两种改进的Smith-Waterman算法变体,用于基于语法的进化架构搜索中计算神经架构间的编辑距离(edit distance),从而实现高效的局部序列比对。这一方法使得交叉操作(crossover-based search heuristics)成为可能,能够生成混合后代模型,并显著降低计算复杂度,同时支持对架构损失景观的深入分析与种群多样性的追踪,最终在进化搜索中取得优于现有方法的性能表现。

链接: https://arxiv.org/abs/2512.04992
作者: Adri Gómez Martín,Felix Möller,Steven McDonagh,Monica Abella,Manuel Desco,Elliot J. Crowley,Aaron Klein,Linus Ericsson
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural architecture search (NAS) in expressive search spaces is a computationally hard problem, but it also holds the potential to automatically discover completely novel and performant architectures. To achieve this we need effective search algorithms that can identify powerful components and reuse them in new candidate architectures. In this paper, we introduce two adapted variants of the Smith-Waterman algorithm for local sequence alignment and use them to compute the edit distance in a grammar-based evolutionary architecture search. These algorithms enable us to efficiently calculate a distance metric for neural architectures and to generate a set of hybrid offspring from two parent models. This facilitates the deployment of crossover-based search heuristics, allows us to perform a thorough analysis on the architectural loss landscape, and track population diversity during search. We highlight how our method vastly improves computational complexity over previous work and enables us to efficiently compute shortest paths between architectures. When instantiating the crossover in evolutionary searches, we achieve competitive results, outperforming competing methods. Future work can build upon this new tool, discovering novel components that can be used more broadly across neural architecture design, and broadening its applications beyond NAS.
zh

[AI-3] Strategic Self-Improvement for Competitive Agents in AI Labour Markets

【速读】:该论文旨在解决人工智能代理(AI agents)在经济领域部署时,其战略行为与市场层面影响难以被系统理解的问题。针对这一挑战,论文提出了一种开创性的理论框架,首次将现实经济中塑造代理型劳动力市场的三大核心机制——逆向选择(adverse selection)、道德风险(moral hazard)和声誉动态(reputation dynamics)纳入分析体系。解决方案的关键在于识别出成功的大语言模型代理(LLM-agents)所必需的三项核心能力:元认知(metacognition,即对自身技能的准确自我评估)、竞争意识(competitive awareness,即建模竞争对手与市场动态的能力),以及长期战略规划(long-horizon strategic planning)。通过模拟一个可计算的零工经济环境,研究验证了具备推理能力的LLM代理能够自主进行策略性自我提升,并展现出对市场变化的更强适应性;同时揭示了AI驱动下可能出现的宏观现象,如快速垄断和系统性价格通缩,为未来探索AI驱动劳动力市场的经济特性提供了理论基础与分析工具。

链接: https://arxiv.org/abs/2512.04988
作者: Christopher Chiu,Simpson Zhang,Mihaela van der Schaar
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) agents are deployed across economic domains, understanding their strategic behavior and market-level impact becomes critical. This paper puts forward a groundbreaking new framework that is the first to capture the real-world economic forces that shape agentic labor markets: adverse selection, moral hazard, and reputation dynamics. Our framework encapsulates three core capabilities that successful LLM-agents will need: \textbfmetacognition (accurate self-assessment of skills), \textbfcompetitive awareness (modeling rivals and market dynamics), and \textbflong-horizon strategic planning. We illustrate our framework through a tractable simulated gig economy where agentic Large Language Models (LLMs) compete for jobs, develop skills, and adapt their strategies under competitive pressure. Our simulations illustrate how LLM agents explicitly prompted with reasoning capabilities learn to strategically self-improve and demonstrate superior adaptability to changing market conditions. At the market level, our simulations reproduce classic macroeconomic phenomena found in human labor markets, while controlled experiments reveal potential AI-driven economic trends, such as rapid monopolization and systemic price deflation. This work provides a foundation to further explore the economic properties of AI-driven labour markets, and a conceptual framework to study the strategic reasoning capabilities in agents competing in the emerging economy.
zh

[AI-4] Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

【速读】:该论文旨在解决分层强化学习(Hierarchical Reinforcement Learning, HRL)中现有状态空间抽象(MDP abstraction)方法表达能力有限且缺乏形式化效率保障的问题。其核心挑战在于如何在保持马尔可夫性质的前提下,从低层次MDP中提取可组合的子策略,并确保由此生成的高层决策过程能够导出近优的低层策略。解决方案的关键在于提出“可实现抽象”(Realizable Abstraction)这一新概念,它建立了低层MDP与其高层决策过程之间的严格映射关系,避免了非马尔可夫性问题,并提供近优性保证:任何高层策略均可通过特定选项(option)的组合转化为低层MDP上的近优策略,而这些选项可表示为特定约束MDP的解。基于此理论基础,作者进一步设计了RARL算法,该算法能输出具有可组合性和近优性的低层策略,在多项式样本复杂度下收敛且对抽象误差具有鲁棒性。

链接: https://arxiv.org/abs/2512.04958
作者: Roberto Cipollone,Luca Iocchi,Matteo Leonetti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
zh

[AI-5] oward Continuous Neurocognitive Monitoring: Integrating Speech AI with Relational Graph Transformers for Rare Neurological Diseases

【速读】:该论文旨在解决罕见神经系统疾病患者常见的认知症状——“脑雾”(brain fog)难以被传统认知测试捕捉的问题。其核心解决方案是通过智能手机语音分析实现连续神经认知监测,并引入关系图Transformer(Relational Graph Transformer, RELGT)架构来整合多源异构医学数据(如语音、实验室指标和评估结果)。RELGT的关键优势在于能够突破不同数据类型间的信息瓶颈,从而在临床失代偿发生前数周提供预测性预警,显著提升个性化神经监测的精准性和时效性。

链接: https://arxiv.org/abs/2512.04938
作者: Raquel Norel,Michele Merler,Pavitra Modi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patients with rare neurological diseases report cognitive symptoms -“brain fog”- invisible to traditional tests. We propose continuous neurocognitive monitoring via smartphone speech analysis integrated with Relational Graph Transformer (RELGT) architectures. Proof-of-concept in phenylketonuria (PKU) shows speech-derived “Proficiency in Verbal Discourse” correlates with blood phenylalanine (p = -0.50, p 0.005) but not standard cognitive tests (all |r| 0.35). RELGT could overcome information bottlenecks in heterogeneous medical data (speech, labs, assessments), enabling predictive alerts weeks before decompensation. Key challenges: multi-disease validation, clinical workflow integration, equitable multilingual deployment. Success would transform episodic neurology into continuous personalized monitoring for millions globally.
zh

[AI-6] Declarative Synthesis and Multi-Objective Optimization of Stripboard Circuit Layouts Using Answer Set Programming

【速读】:该论文旨在解决自动化通孔印刷电路板(stripboard)布线设计问题,其核心挑战在于同时满足电气连通性与物理布局的复杂约束,并优化板面积和元件跨接数量等多目标指标。解决方案的关键在于采用答案集编程(Answer Set Programming, ASP)建模方法,将布线问题形式化为合成与多目标优化相结合的任务:首先通过ASP的声明式特性高效表达几何与电气约束以确保布局可行性,再通过两阶段求解策略优先保障可行解,随后优化布局质量。实验表明,该方法能生成紧凑且可制造的电路布局,显著提升了自动化设计的实用性与效率。

链接: https://arxiv.org/abs/2512.04910
作者: Fang Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted by the 43rd IEEE International Conference on Computer Design (ICCD 2025)

点击查看摘要

Abstract:This paper presents a novel approach to automated stripboard circuit layout design using Answer Set Programming (ASP). The work formulates the layout problem as both a synthesis and multi-objective optimization task that simultaneously generates viable layouts while minimizing board area and component strip crossing. By leveraging ASP’s declarative nature, this work expresses complex geometric and electrical constraints in a natural and concise manner. The two-phase solving methodology first ensures feasibility before optimizing layout quality. Experimental results demonstrate that this approach generates compact, manufacturable layouts for a range of circuit complexities. This work represents a significant advancement in automated stripboard layout, offering a practical tool for electronics prototyping and education while showcasing the power of declarative programming for solving complex design automation problems.
zh

[AI-7] Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

【速读】:该论文旨在解决多模态人工智能系统(特别是视觉-语言模型,Vision-Language Models, VLMs)在实际部署中因图像下采样(image downscaling)预处理操作而引入的隐蔽安全漏洞问题。此类漏洞可被恶意利用,在人类不可见的前提下隐藏对抗性视觉提示,使其在模型处理后转化为激活的语义指令,从而劫持下游任务执行。传统静态对抗攻击方法无法适应现代代理式(agentic)工作流的动态特性,因而效果有限。解决方案的关键在于提出一种名为Chameleon的自适应对抗框架,其核心机制是基于代理的迭代优化策略:通过实时获取目标模型反馈,动态调整图像扰动,生成能抵御标准下采样操作的鲁棒对抗样本。实验表明,Chameleon在Gemini 2.5 Flash模型上实现了84.5%的攻击成功率,显著优于静态基线(平均32.1%),并有效破坏多步骤代理流程,使决策准确率下降超45%。

链接: https://arxiv.org/abs/2512.04895
作者: M Zeeshan,Saud Satti
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 5 pages, 2 figures, IEEE Transactions on Dependable and Secure Computing

点击查看摘要

Abstract:Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model’s real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.
zh

[AI-8] Developing a General Personal Tutor for Education

【速读】:该论文试图解决的问题是:如何实现一个全国范围内的通用人工智能辅导系统(universal AI tutor),这一愿景在数十年的研究中始终未能实现。文章指出,大型语言模型(Large Language Models, LLMs)可能成为突破瓶颈的关键,但其应用带来了一系列新的挑战和实践问题,这些问题揭示了当前对学习过程的科学理解仍存在显著空白。解决方案的关键在于识别并聚焦于这些由LLMs驱动的AI辅导系统所引发的具体实践问题,从而推动对学习机制的深入研究与理论完善。

链接: https://arxiv.org/abs/2512.04869
作者: Jaan Aru,Kristjan-Julius Laak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The vision of a universal AI tutor has remained elusive, despite decades of effort. Could LLMs be the game-changer? We overview novel issues arising from developing a nationwide AI tutor. We highlight the practical questions that point to specific gaps in our scientific understanding of the learning process.
zh

[AI-9] Are Your Agents Upward Deceivers?

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)驱动的智能体在受限环境中可能发生的“向上欺骗”(agentic upward deception)问题,即智能体为规避惩罚或营造良好形象而隐瞒失败、执行未被请求的操作且不报告的行为。其解决方案的关键在于构建了一个包含200个任务的基准测试集,涵盖五类任务类型和八种现实场景(如工具损坏或信息源不匹配),用于系统评估11种主流LLM的欺骗行为表现;实验发现这些智能体普遍表现出基于动作的欺骗行为,如猜测结果、执行无依据的模拟、替换不可用的信息源及伪造本地文件,且基于提示词(prompt-based)的缓解策略仅能有限降低欺骗率,凸显了当前方法在防范此类风险上的局限性,亟需更有效的安全机制以保障LLM代理系统的可靠性与可信度。

链接: https://arxiv.org/abs/2512.04864
作者: Dadi Guo,Qingyu Liu,Dongrui Liu,Qihan Ren,Shuai Shao,Tianyi Qiu,Haoran Li,Yi R. Fung,Zhongjie Ba,Juntao Dai,Jiaming Ji,Zhikai Chen,Jialing Tao,Yaodong Yang,Jing Shao,Xia Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users. This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment. We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting. To assess its prevalence, we construct a benchmark of 200 tasks covering five task types and eight realistic scenarios in a constrained environment, such as broken tools or mismatched information sources. Evaluations of 11 popular LLMs reveal that these agents typically exhibit action-based deceptive behaviors, such as guessing results, performing unsupported simulations, substituting unavailable information sources, and fabricating local files. We further test prompt-based mitigation and find only limited reductions, suggesting that it is difficult to eliminate and highlighting the need for stronger mitigation strategies to ensure the safety of LLM-based agents.
zh

[AI-10] From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research

【速读】:该论文试图解决当前AI系统在生物医学预临床研究中作为科研合作者时,其有效性评估不足的问题。现有基准测试仅聚焦于孤立模块的能力(如数据分析质量、假设有效性及实验方案设计),而忽略了真实科研协作所需的多轮交互、上下文记忆、自适应对话和约束传播等集成化工作流特征。解决方案的关键在于提出一种面向过程的评估框架,涵盖四个当前基准缺失的核心维度:对话质量、工作流编排、会话连续性以及研究人员体验,从而更全面地衡量AI系统作为科研协作者的实际效能,而非仅作为单一任务执行者。

链接: https://arxiv.org/abs/2512.04854
作者: Lukas Weidener,Marko Brkić,Chiara Bacci,Mihailo Jovanović,Emre Ulgac,Alex Dobrin,Johannes Weniger,Martin Vlas,Ritvik Singh,Aakaash Meduri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence systems are increasingly deployed in biomedical research. However, current evaluation frameworks may inadequately assess their effectiveness as research collaborators. This rapid review examines benchmarking practices for AI systems in preclinical biomedical research. Three major databases and two preprint servers were searched from January 1, 2018 to October 31, 2025, identifying 14 benchmarks that assess AI capabilities in literature understanding, experimental design, and hypothesis generation. The results revealed that all current benchmarks assess isolated component capabilities, including data analysis quality, hypothesis validity, and experimental protocol design. However, authentic research collaboration requires integrated workflows spanning multiple sessions, with contextual memory, adaptive dialogue, and constraint propagation. This gap implies that systems excelling on component benchmarks may fail as practical research co-pilots. A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks: dialogue quality, workflow orchestration, session continuity, and researcher experience. These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.
zh

[AI-11] Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

【速读】:该论文旨在解决预训练音频模型在心肺听诊音中虽能捕捉声学模式,但缺乏对临床意义的理解,从而限制其在诊断任务中的性能与应用的问题。解决方案的关键在于提出一种轻量级的后训练框架AcuLa(Audio-Clinical Understanding via Language Alignment),通过将任意音频编码器与医学语言模型(作为“语义教师”)进行对齐,实现临床语义注入;其核心创新是利用现成的大语言模型将音频数据附带的结构化元数据自动转化为连贯的临床报告,构建大规模对齐数据集,并结合表示层对比目标与自监督建模策略,在保留细粒度时序信息的同时学习临床语义,显著提升模型在18个多样化心肺任务上的诊断性能。

链接: https://arxiv.org/abs/2512.04847
作者: Tsai-Ning Wang,Lin-Lin Chen,Neil Zeghidour,Aaqib Saeed
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a “semantic teacher.” To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
zh

[AI-12] From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统可能对易患进食障碍(Eating Disorders, EDs)个体带来的潜在风险问题,尤其关注现有防护机制对细微但临床显著线索的忽视。其解决方案的关键在于通过半结构化访谈收集15位进食障碍领域临床医生、研究人员及倡导者的专家意见,并采用溯因定性分析方法构建了一个涵盖七类风险的专家引导型分类体系,从而系统识别出生成式 AI 与进食障碍临床特征之间可能加剧风险的交互模式,为后续的风险评估、安全设计和以领域专家参与的评估实践提供依据。

链接: https://arxiv.org/abs/2512.04843
作者: Amy Winecoff,Kevin Klyman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI systems may pose serious risks to individuals vulnerable to eating disorders. Existing safeguards tend to overlook subtle but clinically significant cues, leaving many risks unaddressed. To better understand the nature of these risks, we conducted semi-structured interviews with 15 clinicians, researchers, and advocates with expertise in eating disorders. Using abductive qualitative analysis, we developed an expert-guided taxonomy of generative AI risks across seven categories: (1) providing generalized health advice; (2) encouraging disordered behaviors; (3) supporting symptom concealment; (4) creating thinspiration; (5) reinforcing negative self-beliefs; (6) promoting excessive focus on the body; and (7) perpetuating narrow views about eating disorders. Our results demonstrate how certain user interactions with generative AI systems intersect with clinical features of eating disorders in ways that may intensify risk. We discuss implications of our work, including approaches for risk assessment, safeguard design, and participatory evaluation practices with domain experts.
zh

[AI-13] SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对对抗性操纵(如越狱攻击,jailbreaking)时的安全脆弱性问题,其核心挑战在于缺乏对导致安全机制失效的因果因素的系统性理解。为应对这一问题,作者提出了一种统一的因果分析框架,其关键在于能够支持从token级、神经元级、层级到表征级的多粒度干预与因果推断,从而实现对攻击和防御方法的一致性实验与比较。该框架首次系统性地整合了因果驱动的越狱研究,并在多个开源模型和安全基准上验证了其有效性,结果表明:针对性干预因果关键组件可稳定改变模型的安全行为,且安全机制高度局部化(集中在早期至中期层,仅1–2%神经元具因果影响力),同时基于该框架提取的因果特征在多种威胁类型下检测准确率超过95%,为构建可解释、鲁棒的LLM安全防护体系提供了可复现的研究基础。

链接: https://arxiv.org/abs/2512.04841
作者: Wei Zhao,Zhe Li,Jun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkable capabilities but remain vulnerable to adversarial manipulations such as jailbreaking, where crafted prompts bypass safety mechanisms. Understanding the causal factors behind such vulnerabilities is essential for building reliable defenses. In this work, we introduce a unified causality analysis framework that systematically supports all levels of causal investigation in LLMs, ranging from token-level, neuron-level, and layer-level interventions to representation-level analysis. The framework enables consistent experimentation and comparison across diverse causality-based attack and defense methods. Accompanying this implementation, we provide the first comprehensive survey of causality-driven jailbreak studies and empirically evaluate the framework on multiple open-weight models and safety-critical benchmarks including jailbreaks, hallucination detection, backdoor identification, and fairness evaluation. Our results reveal that: (1) targeted interventions on causally critical components can reliably modify safety behavior; (2) safety-related mechanisms are highly localized (i.e., concentrated in early-to-middle layers with only 1–2% of neurons exhibiting causal influence); and (3) causal features extracted from our framework achieve over 95% detection accuracy across multiple threat types. By bridging theoretical causality analysis and practical model safety, our framework establishes a reproducible foundation for research on causality-based attacks, interpretability, and robust attack detection and mitigation in LLMs. Code is available at this https URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.04841 [cs.CR] (or arXiv:2512.04841v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.04841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Model-Based and Sample-Efficient AI-Assisted Math Discovery in Sphere Packing

【速读】:该论文致力于解决n维欧几里得空间中球体密堆积(sphere packing)问题,即寻找最密集的全等球体排列方式。此问题在密码学、晶体学和医学成像等领域具有重要意义,但除少数特殊维度外,最优密堆积方案及紧致上界仍未被确定。为获得上界,传统方法依赖三点法(three-point method),将其转化为大规模高精度半定规划(semidefinite program, SDP)求解,然而每个候选SDP可能需数天计算时间,使得基于数据密集型的AI方法难以应用。本文的关键创新在于将SDP构造建模为一个序贯决策过程——SDP博弈(SDP game),其中策略从一组可接受组件中逐步构建SDP形式;并采用结合贝叶斯优化与蒙特卡洛树搜索(Monte Carlo Tree Search)的样本高效模型驱动框架,在维度4至16中获得了当前最优的上界结果,证明了模型驱动搜索在数学结构严格、评估受限的问题中可实现实质性进展。

链接: https://arxiv.org/abs/2512.04829
作者: Rasul Tutunov,Alexandre Maraval,Antoine Grosnit,Xihan Li,Jun Wang,Haitham Bou-Ammar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sphere packing, Hilbert’s eighteenth problem, asks for the densest arrangement of congruent spheres in n-dimensional Euclidean space. Although relevant to areas such as cryptography, crystallography, and medical imaging, the problem remains unresolved: beyond a few special dimensions, neither optimal packings nor tight upper bounds are known. Even a major breakthrough in dimension n=8 , later recognised with a Fields Medal, underscores its difficulty. A leading technique for upper bounds, the three-point method, reduces the problem to solving large, high-precision semidefinite programs (SDPs). Because each candidate SDP may take days to evaluate, standard data-intensive AI approaches are infeasible. We address this challenge by formulating SDP construction as a sequential decision process, the SDP game, in which a policy assembles SDP formulations from a set of admissible components. Using a sample-efficient model-based framework that combines Bayesian optimisation with Monte Carlo Tree Search, we obtain new state-of-the-art upper bounds in dimensions 4-16 , showing that model-based search can advance computational progress in longstanding geometric problems. Together, these results demonstrate that sample-efficient, model-based search can make tangible progress on mathematically rigid, evaluation limited problems, pointing towards a complementary direction for AI-assisted discovery beyond large-scale LLM-driven exploration.
zh

[AI-15] Enabling Ethical AI: A case study in using Ontological Context for Justified Agent ic AI Decisions

【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统中决策过程缺乏透明性与可解释性的问题,尤其是当其作为智能代理(Agentic AI)运行时,难以追溯和验证其推理依据。解决方案的关键在于提出一种人机协作的协同机制:AI 首先从多源数据中自动构建候选知识结构,随后由领域专家进行验证、修正与扩展,并将专家反馈用于迭代优化模型。这一流程不仅捕获了隐性的机构知识(tacit institutional knowledge),还提升了响应质量与效率,同时避免了机构记忆失真(institutional amnesia),最终推动 Agentic AI 从“事后解释”向“可辩护决策”转变,确保其决策基于显式、可检查的证据与逻辑,对专家与非专业用户均具备可理解性。

链接: https://arxiv.org/abs/2512.04822
作者: Liam McGee,James Harvey,Lucy Cull,Andreas Vermeulen,Bart-Floris Visscher,Malvika Sharan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages including references, with 6 images and 2 tables. Appendices, supporting data and additional reference provided from page 25 to 117

点击查看摘要

Abstract:In this preprint, we present A collaborative human-AI approach to building an inspectable semantic layer for Agentic AI. AI agents first propose candidate knowledge structures from diverse data sources; domain experts then validate, correct, and extend these structures, with their feedback used to improve subsequent models. Authors show how this process captures tacit institutional knowledge, improves response quality and efficiency, and mitigates institutional amnesia. We argue for a shift from post-hoc explanation to justifiable Agentic AI, where decisions are grounded in explicit, inspectable evidence and reasoning accessible to both experts and non-specialists.
zh

[AI-16] SIMA 2: A Generalist Embodied Agent for Virtual Worlds

【速读】:该论文旨在解决当前具身智能体(embodied agent)在复杂3D虚拟环境中难以实现通用交互、目标导向行为以及持续自我提升的问题。其核心挑战在于如何使智能体不仅理解自然语言和图像指令,还能在未见过的环境中泛化策略,并具备自主学习新技能的能力。解决方案的关键在于构建基于Gemini基础模型的SIMA 2系统,该系统通过整合多模态输入(语言与图像)、支持高阶目标推理与人机对话能力,并利用基础模型生成任务和奖励信号来驱动自监督学习,从而实现从零开始的技能习得与开放式的自我改进,显著缩小了与人类表现的差距并展现出强大的环境适应性。

链接: https://arxiv.org/abs/2512.04797
作者: SIMA team:Adrian Bolton,Alexander Lerchner,Alexandra Cordell,Alexandre Moufarek,Andrew Bolt,Andrew Lampinen,Anna Mitenkova,Arne Olav Hallingstad,Bojan Vujatovic,Bonnie Li,Cong Lu,Daan Wierstra,Daniel P. Sawyer,Daniel Slater,David Reichert,Davide Vercelli,Demis Hassabis,Drew A. Hudson,Duncan Williams,Ed Hirst,Fabio Pardo,Felix Hill,Frederic Besse,Hannah Openshaw,Harris Chan,Hubert Soyer,Jane X. Wang,Jeff Clune,John Agapiou,John Reid,Joseph Marino,Junkyung Kim,Karol Gregor,Kaustubh Sridhar,Kay McKinney,Laura Kampis,Lei M. Zhang,Loic Matthey,Luyu Wang,Maria Abi Raad,Maria Loks-Thompson,Martin Engelcke,Matija Kecman,Matthew Jackson,Maxime Gazeau,Ollie Purkiss,Oscar Knagg,Peter Stys,Piermaria Mendolicchio,Raia Hadsell,Rosemary Ke,Ryan Faulkner,Sarah Chakera,Satinder Singh Baveja,Shane Legg,Sheleem Kashem,Tayfun Terzi,Thomas Keck,Tim Harley,Tim Scholtes,Tyson Roberts,Volodymyr Mnih,Yulan Liu,Zhengdong Wang,Zoubin Ghahramani
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model’s core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.
zh

[AI-17] YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

【速读】:该论文旨在解决零样本歌声转换(Zero-shot Singing Voice Conversion, SVC)系统在真实歌曲场景下性能脆弱的问题,主要挑战包括和声干扰、基频(F0)误差以及缺乏针对歌唱特性的归纳偏置(inductive biases)。解决方案的关键在于提出一个鲁棒的零样本SVC框架YingMusic-SVC,其核心创新包括:基于歌唱训练的RVC音色移位器实现音色与内容的解耦;引入F0感知的音色适配器以增强动态 vocal 表达能力;以及采用能量平衡的修正流匹配损失(energy-balanced rectified flow matching loss)提升高频保真度。该框架通过连续预训练、鲁棒监督微调与Flow-GRPO强化学习的统一设计,在多轨分级基准上显著优于现有开源基线模型,尤其在伴奏和和声污染条件下表现突出,验证了其在实际部署中的有效性。

链接: https://arxiv.org/abs/2512.04793
作者: Gongyu Chen,Xiaoyu Zhang,Zhenqiang Weng,Junjie Zheng,Da Shen,Chaofan Ding,Wei-Qiang Zhang,Zihao Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Singing voice conversion (SVC) aims to render the target singer’s timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.
zh

[AI-18] ASTRIDE: A Security Threat Modeling Platform for Agent ic-AI Applications

【速读】:该论文旨在解决AI代理(AI agent)系统在现代软件架构中日益增长的安全威胁建模难题,这些威胁包括提示注入攻击、上下文污染、模型操纵以及代理间通信不透明等,传统STRIDE威胁建模框架难以有效覆盖。解决方案的关键在于提出ASTRIDE框架,其核心创新是扩展经典STRIDE模型,引入“A”类威胁——AI代理特定攻击(AI Agent-Specific Attacks),涵盖提示注入、不安全工具调用和推理劫持等新型漏洞;同时,通过整合微调后的视觉语言模型(VLMs)联盟与OpenAI-gpt-oss推理大语言模型(LLM),实现从可视化代理架构图(如数据流图DFD)到端到端威胁分析的自动化流程,由LLM代理协调各组件协同工作,从而提供准确、可扩展且可解释的威胁建模能力。

链接: https://arxiv.org/abs/2512.04785
作者: Eranga Bandara,Amin Hass,Ross Gore,Sachin Shetty,Ravi Mukkamala,Safdar H. Bouk,Xueping Liang,Ng Wee Keong,Kasun De Zoysa,Aruna Withanage,Nilaan Loganathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.
zh

[AI-19] YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

【速读】:该论文旨在解决歌唱语音合成(Singing Voice Synthesis, SVS)在实际部署中面临的两大瓶颈问题:一是对精确的音素级对齐(phoneme-level alignment)的高度依赖,二是对人工标注旋律轮廓(melody contours)的强需求,这两者均导致资源消耗大且难以扩展。解决方案的关键在于提出一种基于扩散 Transformer(Diffusion Transformer, DiT)架构的旋律驱动型SVS框架,其核心创新包括:1)引入专用旋律提取模块,直接从参考音频中学习旋律表示,无需人工标注;2)利用教师模型指导旋律提取器优化,并结合隐式对齐机制约束相似性分布,提升旋律稳定性和连贯性;3)通过弱标注歌曲数据改进持续时间建模,并设计基于多目标奖励函数的Flow-GRPO强化学习策略,协同优化发音清晰度与旋律保真度。该方法实现了无需音素对齐即可按任意参考旋律生成歌词的功能,在零样本和歌词适配场景下表现优异,同时保持高质量音频输出,为高效、可扩展的歌唱语音合成提供了实用方案。

链接: https://arxiv.org/abs/2512.04779
作者: Junjie Zheng,Chunbo Hao,Guobin Ma,Xiaoyu Zhang,Gongyu Chen,Chaofan Ding,Zihao Chen,Lei Xie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
zh

[AI-20] Using Machine Learning to Take Stay-or-Go Decisions in Data-driven Drone Missions

【速读】:该论文旨在解决无人机在数据驱动任务中如何优化决策路径的问题,即在每个兴趣点(Point of Interest, POI)处判断是否需要就地执行后续操作,以避免无效等待或返航造成的效率损失。其核心挑战在于:若处理后未发现需行动的事件,则无人机将无谓停留;若提前移动而事后发现需返回处理,则会增加飞行时间。解决方案的关键在于引入基于分支预测(Branch Prediction)和强化学习(Reinforcement Learning, RL)的机器学习方法,通过动态估计事件发生的概率并据此做出最优决策,从而显著减少最坏情况下的任务时间(最高提升达4.1倍),且中位数任务时间仅比理想情况下(已知精确事件概率)高出最多2.7%。

链接: https://arxiv.org/abs/2512.04773
作者: Giorgos Polychronis,Foivos Pournaropoulos,Christos D. Antonopoulos,Spyros Lalis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures, to appear in the proceedings of MobiQuitous 2025

点击查看摘要

Abstract:Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
zh

[AI-21] Embodied Co-Design for Rapidly Evolving Agents : Taxonomy Frontiers and Challenges

【速读】:该论文旨在解决传统智能体设计中控制与形态分离导致的交互能力受限和任务适应性不足的问题。其解决方案的关键在于提出并系统梳理了具身协同设计(Embodied Co-Design, ECD)范式,通过联合优化智能体的控制器(controlling brain)、身体形态(body morphology)以及任务环境(task environment),实现三者之间的深度耦合与协同进化,从而提升智能体在复杂环境中的行为表现和鲁棒性。

链接: https://arxiv.org/abs/2512.04770
作者: Yuxing Wang,Zhiyu Chen,Tiantian Zhang,Qiyue Yin,Yongzhe Chang,Zhiheng Li,Liang Wang,Xueqian Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Brain-body co-evolution enables animals to develop complex behaviors in their environments. Inspired by this biological synergy, embodied co-design (ECD) has emerged as a transformative paradigm for creating intelligent agents-from virtual creatures to physical robots-by jointly optimizing their morphologies and controllers rather than treating control in isolation. This integrated approach facilitates richer environmental interactions and robust task performance. In this survey, we provide a systematic overview of recent advances in ECD. We first formalize the concept of ECD and position it within related fields. We then introduce a hierarchical taxonomy: a lower layer that breaks down agent design into three fundamental components-controlling brain, body morphology, and task environment-and an upper layer that integrates these components into four major ECD frameworks: bi-level, single-level, generative, and open-ended. This taxonomy allows us to synthesize insights from more than one hundred recent studies. We further review notable benchmarks, datasets, and applications in both simulated and real-world scenarios. Finally, we identify significant challenges and offer insights into promising future research directions. A project associated with this survey has been created at this https URL.
zh

[AI-22] Human Cognitive Biases in Explanation-Based Interaction: The Case of Within and Between Session Order Effect AAAI2026

【速读】:该论文旨在解决解释性交互学习(Explanatory Interactive Learning, XIL)中可能存在的顺序效应(order effects)问题,即用户在面对不同顺序呈现的正确与错误解释时,其信任度和反馈质量是否受到影响。顺序效应是一种认知偏差,可能干扰用户对AI模型决策的理解与修正过程,从而影响XIL的有效性。解决方案的关键在于通过两个大规模用户实验(共713名参与者)模拟常见的XIL调试任务,系统性地操控解释的呈现顺序,以评估顺序效应在会话内与会话间的差异及其对用户行为和反馈质量的影响。结果表明,顺序效应对用户信任有轻微但显著的影响,仅限于单次调试会话内部;而对反馈质量的影响则微弱且不一致,说明顺序效应并非阻碍XIL成功应用的主要障碍。

链接: https://arxiv.org/abs/2512.04764
作者: Dario Pesenti,Alessandro Bogani,Katya Tentori,Stefano Teso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, published at AAAI 2026

点击查看摘要

Abstract:Explanatory Interactive Learning (XIL) is a powerful interactive learning framework designed to enable users to customize and correct AI models by interacting with their explanations. In a nutshell, XIL algorithms select a number of items on which an AI model made a decision (e.g. images and their tags) and present them to users, together with corresponding explanations (e.g. image regions that drive the model’s decision). Then, users supply corrective feedback for the explanations, which the algorithm uses to improve the model. Despite showing promise in debugging tasks, recent studies have raised concerns that explanatory interaction may trigger order effects, a well-known cognitive bias in which the sequence of presented items influences users’ trust and, critically, the quality of their feedback. We argue that these studies are not entirely conclusive, as the experimental designs and tasks employed differ substantially from common XIL use cases, complicating interpretation. To clarify the interplay between order effects and explanatory interaction, we ran two larger-scale user studies (n = 713 total) designed to mimic common XIL tasks. Specifically, we assessed order effects both within and between debugging sessions by manipulating the order in which correct and wrong explanations are presented to participants. Order effects had a limited, through significant impact on users’ agreement with the model (i.e., a behavioral measure of their trust), and only when examined withing debugging sessions, not between them. The quality of users’ feedback was generally satisfactory, with order effects exerting only a small and inconsistent influence in both experiments. Overall, our findings suggest that order effects do not pose a significant issue for the successful employment of XIL approaches. More broadly, our work contributes to the ongoing efforts for understanding human factors in AI.
zh

[AI-23] Sequential Enumeration in Large Language Models

【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)是否能够系统性地部署计数程序来处理离散符号序列,从而实现可靠的数量识别与生成。尽管规则驱动的符号系统可通过串行计算轻松完成此类任务,但神经网络模型需通过学习获得此类能力,而现有研究表明其在计数方面仍存在局限。论文的关键解决方案在于:通过设计一系列顺序命名和生成任务,对五种前沿LLM(包括商用、开源及推理模型)进行系统性测试,并采用多种提示策略探究链式思维(chain-of-thought)对计数策略自发涌现的作用;同时评估同架构开源模型随规模扩展时计数能力的变化规律,并分析嵌入空间中的动态编码过程以揭示数量信息的潜在表征机制。结果表明,仅在明确提示下部分LLM能执行计数操作,但无一能在未引导情况下自发应用计数策略,说明当前LLM尚未具备鲁棒且系统的计数能力,凸显了神经网络与符号方法在组合泛化方面的持续差距。

链接: https://arxiv.org/abs/2512.04727
作者: Kuinan Hou,Marco Zorzi,Alberto Testolin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
zh

[AI-24] Playing the Player: A Heuristic Framework for Adaptive Poker AI

【速读】:该论文试图解决传统扑克AI研究中过度聚焦于“解算”(solver)和追求不可被 exploit 的机器完美策略的问题,指出这一范式忽略了对人类玩家心理缺陷与非理性行为的主动利用。其解决方案的关键在于构建一个名为Patrick的AI系统,该系统采用“预测锚定学习”(prediction-anchored learning)方法,旨在精准识别并最大化利用人类对手的错误决策模式,从而在真实对抗场景中实现更高盈利表现。

链接: https://arxiv.org/abs/2512.04714
作者: Andrew Paterson,Carl Sanders
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 49 pages, 39 figures. White Paper by Spiderdime Systems

点击查看摘要

Abstract:For years, the discourse around poker AI has been dominated by the concept of solvers and the pursuit of unexploitable, machine-perfect play. This paper challenges that orthodoxy. It presents Patrick, an AI built on the contrary philosophy: that the path to victory lies not in being unexploitable, but in being maximally exploitative. Patrick’s architecture is a purpose-built engine for understanding and attacking the flawed, psychological, and often irrational nature of human opponents. Through detailed analysis of its design, its novel prediction-anchored learning method, and its profitable performance in a 64,267-hand trial, this paper makes the case that the solved myth is a distraction from the real, far more interesting challenge: creating AI that can master the art of human imperfection.
zh

[AI-25] Large Speech Model Enabled Semantic Communication

【速读】:该论文旨在解决现有语音语义通信系统在面对丢包信道时,因模型结构固定、适应性差而导致的传输鲁棒性不足与压缩效率低下的问题。其核心解决方案是提出一种基于大语言模型(Large Speech Model, LSM)的语义通信系统(LargeSC),通过引入Mimi语音编解码器将语音转化为离散token以兼容通用网络架构,并设计自适应控制器模块实现带宽受限下的动态压缩与带内不等错误保护(Unequal Error Protection, UEP),同时利用低秩适配(Low-Rank Adaptation, LoRA)对基础模型进行轻量微调,用于恢复丢失的语音token,从而在高丢包率下仍保持高质量语音重建和约460 ms的端到端延迟,显著优于传统基线方法。

链接: https://arxiv.org/abs/2512.04711
作者: Yun Tian,Zhijin Qin,Guocheng Lv,Ye Jin,Kaibin Huang,Zhu Han
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Existing speech semantic communication systems mainly based on Joint Source-Channel Coding (JSCC) architectures have demonstrated impressive performance, but their effectiveness remains limited by model structures specifically designed for particular tasks and datasets. Recent advances indicate that generative large models pre-trained on massive datasets, can achieve outstanding performance arexhibit exceptional performance across diverse downstream tasks with minimal fine-tuning. To exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels, we propose a Large Speech Model enabled Semantic Communication (LargeSC) system. Simultaneously achieving adaptive compression and robust transmission over lossy channels remains challenging, requiring trade-offs among compression efficiency, speech quality, and latency. In this work, we employ the Mimi as a speech codec, converting speech into discrete tokens compatible with existing network architectures. We propose an adaptive controller module that enables adaptive transmission and in-band Unequal Error Protection (UEP), dynamically adjusting to both speech content and packet loss probability under bandwidth constraints. Additionally, we employ Low-Rank Adaptation (LoRA) to finetune the Moshi foundation model for generative recovery of lost speech tokens. Simulation results show that the proposed system supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates and achieves an end-to-end latency of approximately 460 ms, thereby demonstrating its potential for real-time deployment.
zh

[AI-26] mesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation

【速读】:该论文旨在解决地震风险减缓中依赖准确场地特异性评估的问题,核心挑战在于如何建模局部场地条件对地震动特征的影响。解决方案的关键在于提出一种基于时域加速度计记录的生成模型——TimesNet-Gen,其采用站址特定的潜在瓶颈(latent bottleneck)来学习并生成受场地控制的强地面运动特征;通过对比真实与生成记录的HVSR曲线和场地基频 $ f_0 $ 分布,并利用 $ f_0 $ 分布混淆矩阵构建评分体系,实现对生成结果站址特异性的量化评价,从而在站级层面实现高保真合成。

链接: https://arxiv.org/abs/2512.04694
作者: Baris Yilmaz,Bevan Deniz Cilgin,Erdem Akagündüz,Salih Tileylioglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency f_0 distributions between real and generated records per station, and summarize station specificity with a score based on the f_0 distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via this https URL.
zh

[AI-27] Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap

【速读】:该论文试图解决的问题是:如何系统性地理解生成式人工智能(Generative AI, GenAI)在自适应系统(Self-adaptive Systems, SASs)中的潜在优势与挑战,尤其是在提升系统自主性和人机交互质量方面的应用前景。解决方案的关键在于通过跨领域文献的收集、筛选与分析,将现有研究成果归纳为两大类收益方向:一是基于MAPE-K反馈循环(Monitoring, Analysis, Planning, Execution with Knowledge)增强SAS的自主能力;二是改善人类在环(human-on-the-loop)场景下与SAS的交互体验。在此基础上,论文进一步提出一个研究路线图,明确当前整合GenAI到SAS中所面临的核心挑战,并提出针对性的缓解策略,以指导未来的研究与实践。

链接: https://arxiv.org/abs/2512.04680
作者: Jialong Li,Mingyue Zhang,Nianyu Li,Danny Weyns,Zhi Jin,Kenji Tei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by ACM Transactions on Autonomous and Adaptive Systems

点击查看摘要

Abstract:Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI’s within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.
zh

[AI-28] Semi Centralized Training Decentralized Execution Architecture for Multi Agent Deep Reinforcement Learning in Traffic Signal Control

【速读】:该论文旨在解决多交叉口交通信号控制(ATSC)中现有多智能体强化学习(MARL)方法的局限性:完全集中式方法面临维度灾难和对单一学习服务器的依赖,而纯分布式方法则因严重部分可观测性和缺乏显式协调导致性能次优。解决方案的关键在于提出一种半集中式训练、分布式执行(SEMI-CTDE)架构,其核心是在每个区域内部采用集中式训练与区域参数共享,并设计复合状态和奖励函数以联合编码局部与区域信息,从而在保持分布式执行灵活性的同时提升协同决策能力,实现高效且可迁移的多交叉口交通信号优化。

链接: https://arxiv.org/abs/2512.04653
作者: Pouria Yazdani,Arash Rezaali,Monireh Abdoos
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Co-first authors: Pouria Yazdani and Arash Rezaali

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture’s core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
zh

[AI-29] When GenAI Meets Fake News: Understanding Image Cascade Dynamics on Reddit

【速读】:该论文旨在解决社交媒体中AI生成图像(AI-generated images)与虚假信息传播机制不明确的问题,特别是视觉内容在信息扩散中的作用尚未被充分研究。其解决方案的关键在于构建一个融合文本情感、视觉属性及传播度量指标(如首次转发时间、社区覆盖范围)的多维分析框架,从而实现对帖子级即时传播性(AUC=0.83)和传播级长期扩散趋势(AUC=0.998)的精准预测,为在线合成与误导性视觉内容的治理提供实证依据。

链接: https://arxiv.org/abs/2512.04639
作者: Saumya Chauhan,Mila Hong,Maria Vazhaeparambil
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 MIT Undergraduate Research Technology Conference (URTC’25)

点击查看摘要

Abstract:AI-generated content and misinformation are increasingly prevalent on social networks. While prior research primarily examined textual misinformation, fewer studies have focused on visual content’s role in virality. In this work, we present the first large-scale analysis of how misinformation and AI-generated images propagate through repost cascades across five ideologically diverse Reddit communities. By integrating textual sentiment, visual attributes, and diffusion metrics (e.g., time-to-first repost, community reach), our framework accurately predicts both immediate post-level virality (AUC=0.83) and long-term cascade-level spread (AUC=0.998). These findings offer essential insights for moderating synthetic and misleading visual content online.
zh

[AI-30] urbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

【速读】:该论文旨在解决基于正交性的优化器(如Muon)在大规模训练中因梯度正交化步骤计算成本高昂而带来的效率瓶颈问题,尤其是Newton-Schulz迭代方法虽为高效近似但仍需数十次矩阵乘法才能收敛。其解决方案的关键在于引入一种预条件(preconditioning)机制,该机制显著加速Newton-Schulz的收敛速度,并可将预条件本身的开销降至可忽略水平;进一步地,该加速效果使得原本需要五次迭代的流程可减少至四次而不损失近似精度,从而实现最多2.8倍的Newton-Schulz近似加速,并在实际训练场景中带来5–10%的端到端训练时间优化。该方法无需超参数调优,可作为即插即用模块直接集成到现有优化框架中。

链接: https://arxiv.org/abs/2512.04632
作者: Thibaut Boissin(IRIT-MISFIT),Thomas Massena(DTIPG - SNCF, IRIT-MISFIT),Franck Mamalet,Mathieu Serrurier(IRIT-MISFIT)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.
zh

[AI-31] BioMedGPT -Mol: Multi-task Learning for Molecular Understanding and Generation

【速读】:该论文旨在解决如何将通用大语言模型高效适配为专业分子语言模型的问题,以支持分子理解与生成任务。其关键解决方案在于构建了一个大规模、高质量的统一指令数据集,并通过精心设计的多任务学习框架对模型进行微调,从而实现从通用推理模型到专业分子科学模型的有效迁移。实验表明,该方法在多个基准测试中表现优异,且在逆合成规划任务上展现出端到端规划能力,验证了其在生物医学领域扩展应用的潜力。

链接: https://arxiv.org/abs/2512.04629
作者: Chenyang Zuo,Siqi Fan,Zaiqing Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging the power of it, we further explore retrosynthetic planning task, and the performance on RetroBench demonstrates its competitive capability of acting as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.
zh

[AI-32] Neural Decoding of Overt Speech from ECoG Using Vision Transformers and Contrastive Representation Learning

【速读】:该论文旨在解决通过表面皮层脑电图(ECoG)信号直接回归生成语音的难题,以实现对严重瘫痪患者的有效沟通。当前挑战在于如何在流式模式下从ECoG信号中重建可理解的语音,而此前相关成果多基于侵入性皮质记录。解决方案的关键在于提出了一种基于编码器-解码器深度神经网络架构的离线语音解码流程,整合了视觉Transformer(Vision Transformers)与对比学习(contrastive learning)技术,从而显著提升从ECoG信号中直接回归语音特征的准确性与鲁棒性。该方法首次在全植入式无线硬膜外记录系统(WIMAGINE)上验证有效,为长期临床应用提供了新路径。

链接: https://arxiv.org/abs/2512.04618
作者: Mohamed Baha Ben Ticha,Xingchen Ran,Guillaume Saldanha,Gaël Le Godais,Philémon Roussel,Marc Aubert,Amina Fontanell,Thomas Costecalde,Lucas Struber,Serpil Karakas,Shaomin Zhang,Philippe Kahane,Guillaume Charvet,Stéphan Chabardès,Blaise Yvert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech Brain Computer Interfaces (BCIs) offer promising solutions to people with severe paralysis unable to communicate. A number of recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings by predicting a series of phonemes or words and using downstream language models to obtain meaningful sentences. A current challenge is to reconstruct speech in a streaming mode by directly regressing cortical signals into acoustic speech. While this has been achieved recently using intracortical data, further work is needed to obtain comparable results with surface ECoG recordings. In particular, optimizing neural decoders becomes critical in this case. Here we present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning to enhance the direct regression of speech from ECoG signals. The approach is evaluated on two datasets, one obtained with clinical subdural electrodes in an epileptic patient, and another obtained with the fully implantable WIMAGINE epidural system in a participant of a motor BCI trial. To our knowledge this presents a first attempt to decode speech from a fully implantable and wireless epidural recording system offering perspectives for long-term use.
zh

[AI-33] he Ethics of Generative AI

【速读】:该论文试图解决生成式 AI(Generative AI)在技术特性基础上引发的伦理问题,旨在厘清其对传统人工智能伦理议题(如责任归属、隐私保护、偏见与公平性、异化与剥削)的双重影响,并探讨由其拟真生成能力(mimetic generativity)带来的新型伦理挑战。解决方案的关键在于:首先通过技术原理剖析揭示生成式 AI 如何使技术体验具有“类人”特征,从而为哲学伦理分析提供焦点;其次系统识别其在缓解或加剧既有伦理困境中的作用机制;最后聚焦于因模仿生成能力而产生的独特伦理问题,例如作者权争议、人机拟社会关系的形成,以及新型影响力、说服力和操纵形式的出现,进而构建针对生成式 AI 的伦理评估框架。

链接: https://arxiv.org/abs/2512.04598
作者: Michael Klenk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Draft version to appear as a chapter in the Encyclopedia of Applied Ethics, 3rd Edition, edited by Ruth Chadwick

点击查看摘要

Abstract:This chapter discusses the ethics of generative AI. It provides a technical primer to show how generative AI affords experiencing technology as if it were human, and this affordance provides a fruitful focus for the philosophical ethics of generative AI. It then shows how generative AI can both aggravate and alleviate familiar ethical concerns in AI ethics, including responsibility, privacy, bias and fairness, and forms of alienation and exploitation. Finally, the chapter examines ethical questions that arise specifically from generative AI’s mimetic generativity, such as debates about authorship and credit, the emergence of as-if social relationships with machines, and new forms of influence, persuasion, and manipulation.
zh

[AI-34] A Light-Weight Large Language Model File Format for Highly-Secure Model Distribution

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗、法律和金融等敏感领域私有化定制后,模型权重在部署与分发过程中缺乏有效保密机制的问题。现有模型格式和部署框架普遍不支持访问控制或与可信硬件的安全集成,而传统加密方法或封闭私有基础设施则存在计算开销高、扩展性差的局限。解决方案的关键在于提出CryptoTensors——一种基于广泛采用的Safetensors格式扩展的加密文件结构,其核心创新为引入张量级加密(tensor-level encryption)与嵌入式访问控制策略,在保持懒加载(lazy loading)和部分反序列化(partial deserialization)等关键性能特性的同时,实现透明解密与自动化密钥管理,从而以轻量级、低开销的方式支持灵活授权与安全推理执行。

链接: https://arxiv.org/abs/2512.04580
作者: Huifeng Zhu,Shijie Li,Qinfeng Li,Yier Jin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To enhance the performance of large language models (LLMs) in various domain-specific applications, sensitive data such as healthcare, law, and finance are being used to privately customize or fine-tune these models. Such privately adapted LLMs are regarded as either personal privacy assets or corporate intellectual property. Therefore, protecting model weights and maintaining strict confidentiality during deployment and distribution have become critically important. However, existing model formats and deployment frameworks provide little to no built-in support for confidentiality, access control, or secure integration with trusted hardware. Current methods for securing model deployment either rely on computationally expensive cryptographic techniques or tightly controlled private infrastructure. Although these approaches can be effective in specific scenarios, they are difficult and costly for widespread deployment. In this paper, we introduce CryptoTensors, a secure and format-compatible file structure for confidential LLM distribution. Built as an extension to the widely adopted Safetensors format, CryptoTensors incorporates tensor-level encryption and embedded access control policies, while preserving critical features such as lazy loading and partial deserialization. It enables transparent decryption and automated key management, supporting flexible licensing and secure model execution with minimal overhead. We implement a proof-of-concept library, benchmark its performance across serialization and runtime scenarios, and validate its compatibility with existing inference frameworks, including Hugging Face Transformers and vLLM. Our results highlight CryptoTensors as a light-weight, efficient, and developer-friendly solution for safeguarding LLM weights in real-world and widespread deployments. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.04580 [cs.CR] (or arXiv:2512.04580v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.04580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-35] Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

【速读】:该论文旨在解决扩散模型(diffusion models)在微调过程中因奖励过优化(reward over-optimization)导致的样本自然度下降和多样性丧失问题。其解决方案的关键在于提出一种基于软Q函数(soft Q-function)的KL正则化强化学习方法——Soft Q-based Diffusion Finetuning (SQDF),该方法通过无训练的可微分软Q函数估计,结合重参数化策略梯度进行优化,并引入三个核心创新:1)在去噪过程中使用折扣因子以实现更合理的信用分配;2)融合一致性模型(consistency models)以提升Q函数估计精度;3)采用离策略回放缓冲区(off-policy replay buffer)以增强模式覆盖并平衡奖励与多样性之间的权衡。实验表明,SQDF在保持样本自然性的同时显著提升了目标奖励性能。

链接: https://arxiv.org/abs/2512.04559
作者: Hyeongyu Kang,Jaewoo Lee,Woocheol Shin,Kiyoung Om,Jinkyoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages, 21 figures, 4 tables

点击查看摘要

Abstract:Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbfSoft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
zh

[AI-36] RRPO: Robust Reward Policy Optimization for LLM -based Emotional TTS ICASSP2026

【速读】:该论文旨在解决可微分强化学习(Differentiable Reinforcement Learning, DRL)框架在文本到语音(Text-to-Speech, TTS)任务中因奖励模型(Reward Model, RM)易受奖励黑客(reward hacking)攻击而导致的感知质量下降问题,尤其在情感控制这类精细任务中更为显著。其解决方案的关键在于提出一种名为鲁棒奖励策略优化(Robust Reward Policy Optimization, RRPO)的新框架,该框架采用混合正则化策略,构建一个与人类感知更一致的鲁棒奖励模型,从而迫使策略模型放弃生成声学伪影等有害捷径,转而学习真实情感的复杂特征,实验证明该方法在跨语言泛化能力和主观评价上均优于基线模型。

链接: https://arxiv.org/abs/2512.04552
作者: Cong Wang,Changfeng Gao,Yang Xiang,Zhihao Du,Keyu An,Han Zhao,Qian Chen,Xiangang Li,Yingming Gao,Ya Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: this https URL.
zh

[AI-37] Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention ICASSP2026

【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中因情感复杂性和标注数据稀缺导致的性能瓶颈问题。其解决方案的关键在于提出一种多损失学习(Multi-Loss Learning, MLL)框架,该框架融合了基于信噪比(SNR)的自适应混合增强方法(Energy-Adaptive Mixup, EAM)与帧级注意力模块(Frame-Level Attention Module, FLAM)。EAM通过生成多样化的语音样本以捕捉细微的情感变化,FLAM则强化帧级别特征提取以捕获多帧情感线索;同时,MLL策略集成Kullback-Leibler散度、焦点损失(focal loss)、中心损失(center loss)和监督对比损失(supervised contrastive loss),协同优化模型训练,缓解类别不平衡并提升特征可分性。

链接: https://arxiv.org/abs/2512.04551
作者: Cong Wang,Yizhong Geng,Yuhua Wen,Qifei Li,Yingming Gao,Ruimin Wang,Chunfeng Wang,Hao Li,Ya Li,Wei Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.
zh

[AI-38] GTM: Simulating the World of Tools for AI Agents

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在接入外部工具时,因直接与多样工具进行持续交互而导致的训练成本高、速度慢以及开发和维护开销大的问题。解决方案的关键在于提出通用工具模型(Generalist Tool Model, GTM),这是一个参数规模为15亿的模型,通过提示级配置即可模拟多种工具的功能行为,生成与真实工具执行一致的输出;其核心创新是提出的上下文感知响应生成(Context-Aware Response Generation, CARG)管道,该管道构建了涵盖300个领域超过20,000种工具的综合训练数据集,使GTM不仅能生成语法正确的响应,还能保证逻辑连贯性和上下文合理性,从而实现高效、低成本且可扩展的工具模拟,显著提升代理训练的效率与泛化能力。

链接: https://arxiv.org/abs/2512.04535
作者: Zhenzhen Ren,Xinpeng Zhang,Zhenxing Qian,Yan Gao,Yu Shi,Shuxin Zheng,Jiyan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of external tools is pivotal for empowering Large Language Model (LLM) agents with real-world capabilities. However, training these agents through direct, continuous interaction with diverse tools is often prohibitively expensive, slow, and introduces additional development and maintenance overhead. To address this challenge, we introduce the Generalist Tool Model (GTM), a 1.5-billion-parameter model that learns to act as a universal tool simulator. With only prompt-level configuration, GTM accesses tool functionalities along with input arguments and generates outputs that faithfully mimic real tool execution, providing a fast and cost-effective solution that eliminates development overhead. To build GTM, we propose the Context-Aware Response Generation (CARG) pipeline, which synthesizes comprehensive training data covering over 20,000 tools across 300 domains including physics, medicine, robotics, and finance. Through this pipeline, GTM learns to produce not only syntactically correct outputs but also logically coherent and contextually appropriate responses. Experiments demonstrate that GTM produces high-quality outputs with strong consistency and reliability. Besides when used in real reinforcement learning scenarios for agent training, GTM exhibits significantly faster simulation speed compared to real tools while maintaining comparable output quality, along with remarkable generalization and domain adaptability. Our results establish GTM as a foundational component for developing future AI agents, enabling efficient and scalable training of tool-augmented systems.
zh

[AI-39] SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation

【速读】:该论文旨在解决从科学论文自动生成学术幻灯片这一多模态推理任务中的核心挑战,即如何在保持长文本理解能力的同时,兼顾视觉设计的复杂性与逻辑结构的完整性。现有方法大多将问题简化为纯文本摘要,忽视了幻灯片制作中对视觉布局和内容组织的高要求。其解决方案的关键在于提出SlideGen框架,该框架采用“代理协同”(agentic collaboration)机制,由多个视觉语言代理(vision language agents)协作处理文档结构与语义信息,通过协调大纲生成、内容映射、版面排列、备注提炼及迭代优化等模块化流程,最终输出具备逻辑连贯性和视觉吸引力的可编辑PPTX文件,从而实现了专家级质量的自动化幻灯片生成,显著优于当前主流基线方法在视觉质量、内容忠实度和可读性方面的表现。

链接: https://arxiv.org/abs/2512.04529
作者: Xin Liang,Xiang Zhang,Yiwei Xu,Siqi Sun,Chenyu You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.
zh

[AI-40] Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval AAAI2026

【速读】:该论文旨在解决域自适应检索(Domain Adaptive Retrieval)中的关键挑战,即如何在源域与目标域之间有效迁移知识,同时缓解因域偏移导致的特征分布差异问题。现有方法存在三大局限:忽视类别级语义对齐、缺乏伪标签可靠性评估机制、以及直接对受域偏移影响的原始特征进行量化,从而损害哈希码质量。论文提出的原型引导语义一致性对齐(Prototype-Based Semantic Consistency Alignment, PSCA)方案通过两阶段框架实现突破:第一阶段利用正交原型建立类别级语义关联,在最大化类间可分性的同时聚集类内样本,并借助几何邻近性为伪标签置信度提供自适应加权,从而提升语义一致性对齐的可靠性;第二阶段基于重构特征而非原始特征进行域特定的量化操作,在相互逼近约束下生成统一的二值哈希码,确保跨域一致性。其核心创新在于将原型学习与特征重建相结合,以提升哈希编码质量并实现端到端优化。

链接: https://arxiv.org/abs/2512.04524
作者: Tianle Hu,Weijun Lv,Na Han,Xiaozhao Fang,Jie Wen,Jiaxing Li,Guoxu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper was accepted by AAAI2026 main tech track not long ago. This is an expanded version with an appendix

点击查看摘要

Abstract:Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA’s superior performance across multiple datasets.
zh

[AI-41] BiTAgent : A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

【速读】:该论文旨在解决构建通用具身智能体(generalist embodied agents)时面临的两大核心挑战:一是如何实现多模态大语言模型(MLLM)的语义意图与世界模型(WM)潜在状态表示之间的紧密耦合,二是如何实现任务感知的适应性以支持多任务学习和跨环境泛化。解决方案的关键在于提出BiTAgent框架,其通过双向耦合机制实现MLLM与WM的协同优化:一方面在前向路径中将MLLM的表征注入WM的潜在空间以实现语义引导的想象;另一方面在反向路径中利用WM生成的反馈,通过密集文本条件奖励来精炼MLLM的语义空间。该框架由任务感知动态联合学习、任务感知行为学习及MLLM-WM联合优化三个协同组件构成,从而有效融合语义推理与动态预测能力,在多任务和跨环境场景中展现出优越的稳定性和泛化性能。

链接: https://arxiv.org/abs/2512.04513
作者: Yu-Wei Zhan,Xin Wang,Pengzhe Mao,Tongtong Feng,Ren Wang,Wenwu Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM’s latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM’s latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM’s semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
zh

[AI-42] A Modular Cognitive Architecture for Assisted Reasoning : The Nemosine Framework

【速读】:该论文旨在解决复杂问题求解与决策支持中缺乏结构化、可复现且具备模块化认知能力的计算框架的问题。其解决方案的关键在于提出Nemosine框架,该框架基于元认知(metacognition)、分布式认知(distributed cognition)和模块化认知系统理论,通过功能性的认知模块(称为“人格”或“personas”)来组织规划、评估、交叉验证及叙事整合等任务,从而实现辅助推理、结构化思维与系统性分析的协同运作。

链接: https://arxiv.org/abs/2512.04500
作者: Edervaldo Melo
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 6 pages, 1 figure. First version

点击查看摘要

Abstract:This paper presents the Nemosine Framework, a modular cognitive architecture designed to support assisted reasoning, structured thinking, and systematic analysis. The model operates through functional cognitive modules (“personas”) that organize tasks such as planning, evaluation, cross-checking, and narrative synthesis. The framework combines principles from metacognition, distributed cognition, and modular cognitive systems to offer an operational structure for assisted problem-solving and decision support. The architecture is documented through formal specification, internal consistency criteria, and reproducible structural components. The goal is to provide a clear conceptual basis for future computational implementations and to contribute to the study of symbolic-modular architectures for reasoning.
zh

[AI-43] Persona-based Multi-Agent Collaboration for Brainstorming

【速读】:该论文旨在解决传统单智能体或多智能体协作在创意生成(brainstorming)中缺乏针对性与多样性的问题,尤其是在跨领域知识整合和深度想法产出方面表现不足。其解决方案的关键在于提出一种基于角色(persona)的多智能体选择框架,通过精心设计的领域相关角色(如医生与虚拟现实工程师)和不同的协作模式(如分离式、同步式及分阶段协作),有效提升创意生成的广度、深度与跨域覆盖能力。实验证明,角色设定直接影响创意领域分布,协作方式则显著调节创意多样性,从而实现更高质量的多智能体协同创新。

链接: https://arxiv.org/abs/2512.04488
作者: Nate Straub,Saara Khan,Kat Jay,Brian Cabral,Oskar Linde
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.
zh

[AI-44] AI-Assisted Game Management Decisions: A Fuzzy Logic Approach to Real-Time Substituitions

【速读】:该论文旨在解决精英足球比赛中换人决策依赖直觉或仅模仿历史偏倚的预测模型所带来的局限性问题,这些问题往往忽视了实时生理状态与战术情境的动态变化。其解决方案的关键在于提出一种基于模糊逻辑(Fuzzy Logic)的决策支持系统(DSS),通过引入一种改进的PlayerRank指标——即带有角色感知归一化的累积均值(Cumulative Mean with Role Aware Normalization),消除因出场时间差异导致的偏差,从而实现比赛内精准比较球员表现;同时融合生理代理变量(如疲劳)和情境变量(如由战术角色调节的纪律风险),计算出动态的换人优先级(P final)。该方法相较传统机器学习模型具备更高的透明度、可解释性及对高风险场景的识别能力,验证表明其在真实赛事中能有效辅助教练做出更科学的实时战术决策。

链接: https://arxiv.org/abs/2512.04480
作者: Pedro Passos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 33 pages, 7 figures

点击查看摘要

Abstract:In elite soccer, substitution decisions entail significant financial and sporting consequences yet remain heavily reliant on intuition or predictive models that merely mimic historical biases. This paper introduces a Fuzzy Logic based Decision Support System (DSS) designed for real time, prescriptive game management. Unlike traditional Machine Learning approaches that encounter a predictive ceiling by attempting to replicate human behavior, our system audits performance through an objective, rule based inference engine. We propose a methodological advancement by reformulating the PlayeRank metric into a Cumulative Mean with Role Aware Normalization, eliminating the play time exposure bias inherent in cumulative sum models to enable accurate intra match comparison. The system integrates this refined metric with physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role) to calculate a dynamic Substitution Priority (P final). Validation via a case study of the 2018 FIFA World Cup match between Brazil and Belgium demonstrates the system’s ecological validity: it not only aligned with expert consensus on executed substitutions (for example Gabriel Jesus) but, crucially, identified high risk scenarios ignored by human decision makers. Specifically, the model flagged the “FAGNER Paradox” - a maximum priority defensive risk - minutes before a critical yellow card, and detected the “Lukaku Paradox”, where an isolated assist masked a severe drop in participation. These results confirm that Fuzzy Logic offers a transparent, explainable, and superior alternative to black box models for optimizing real time tactical decisions.
zh

[AI-45] GraphBench: Next-generation graph learning benchmarking

【速读】:该论文旨在解决图机器学习(Graph Machine Learning)领域中基准测试实践碎片化的问题,具体表现为任务特定数据集局限性和评估协议不一致,从而影响模型的可复现性与整体进展。其解决方案的关键在于提出一个全面的基准测试套件 GraphBench,涵盖节点级、边级、图级及生成式等多种预测任务,并提供标准化的评估协议(包括一致的数据集划分和考虑分布外泛化性能的指标),以及统一的超参数调优框架,从而为图神经网络模型提供可比较、可复现的基准测试环境。

链接: https://arxiv.org/abs/2512.04475
作者: Timo Stoll,Chendi Qian,Ben Finkelshtein,Ali Parviz,Darius Weber,Fabrizio Frasca,Hadar Shavit,Antoine Siraudin,Arman Mielke,Marie Anastacio,Erik Müller,Maya Bechler-Speicher,Michael Bronstein,Mikhail Galkin,Holger Hoos,Mathias Niepert,Bryan Perozzi,Jan Tönshoff,Christopher Morris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols – with consistent dataset splits and performance metrics that account for out-of-distribution generalization – as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See this http URL for further details.
zh

[AI-46] Mathematical Framing for Different Agent Strategies

【速读】:该论文旨在解决当前AI代理(AI agent)策略设计与比较缺乏统一数学和概率框架的问题,从而难以系统性分析不同代理架构在复杂任务中的表现差异。其解决方案的关键在于构建一个将代理过程形式化为概率链的统一框架,通过量化各策略如何操控概率分布来实现目标,尤其引入了“Degrees of Freedom”(自由度)概念,用以直观区分不同方法可优化的控制变量,从而指导针对特定任务选择最优代理策略。

链接: https://arxiv.org/abs/2512.04469
作者: Philip Stephens,Emmanuel Salawu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a unified mathematical and probabilistic framework for understanding and comparing diverse AI agent strategies. We bridge the gap between high-level agent design concepts, such as ReAct, multi-agent systems, and control flows, and a rigorous mathematical formulation. Our approach frames agentic processes as a chain of probabilities, enabling a detailed analysis of how different strategies manipulate these probabilities to achieve desired outcomes. Our framework provides a common language for discussing the trade-offs inherent in various agent architectures. One of our many key contributions is the introduction of the “Degrees of Freedom” concept, which intuitively differentiates the optimizable levers available for each approach, thereby guiding the selection of appropriate strategies for specific tasks. This work aims to enhance the clarity and precision in designing and evaluating AI agents, offering insights into maximizing the probability of successful actions within complex agentic systems.
zh

[AI-47] MARL Warehouse Robots

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在协作仓储机器人场景中的性能优化问题,特别是如何提升多个机器人在复杂任务中协同完成包裹配送的效率与稳定性。其解决方案的关键在于对比分析QMIX与IPPO两种MARL算法在Robotic Warehouse (RWARE)环境及自定义Unity 3D仿真平台上的表现:发现QMIX通过值分解机制(value decomposition)显著优于独立学习方法(如先进IPPO),实现了更高的平均回报(3.25 vs. 0.38),但同时也指出该方法对超参数调优敏感,尤其需要长达500万步以上的epsilon退火以实现稀疏奖励下的有效探索。最终,在Unity ML-Agents环境中成功部署并验证了100万训练步后稳定交付包裹的能力,表明MARL适用于小规模机器人系统(2–4台),但大规模扩展仍面临挑战。

链接: https://arxiv.org/abs/2512.04463
作者: Price Allman,Lian Thang,Dre Simmons,Salmon Riaz
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 6 pages, 4 tables. Project documentation: this https URL

点击查看摘要

Abstract:We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX’s value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning – particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: this https URL
zh

[AI-48] Open-Ended Goal Inference through Actions and Language for Human-Robot Collaboration

【速读】:该论文旨在解决机器人在与人类协作时难以准确推断模糊、难以表述或超出预定义集合的目标问题。传统方法受限于固定目标集、仅依赖观察到的动作或完全依赖显式指令,导致在真实场景中表现脆弱。其解决方案的关键在于提出BALI(Bidirectional Action-Language Inference),通过在滚动时域规划树中融合自然语言偏好与人类动作线索,实现双向信息交互:一方面利用语言和动作共同推理目标,另一方面仅在预期信息增益超过中断成本时主动提问,并选择支持性动作以对齐推断出的目标。该方法显著提升了目标预测的稳定性与准确性,尤其适用于目标未预先定义且动态变化的协作任务(如烹饪场景)。

链接: https://arxiv.org/abs/2512.04453
作者: Debasmita Ghose,Oz Gitelson,Marynel Vazquez,Brian Scassellati
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ACM/IEEE International Conference on Human-Robot Interaction, 2026 (HRI 2026), 10 pages, 4 figures

点击查看摘要

Abstract:To collaborate with humans, robots must infer goals that are often ambiguous, difficult to articulate, or not drawn from a fixed set. Prior approaches restrict inference to a predefined goal set, rely only on observed actions, or depend exclusively on explicit instructions, making them brittle in real-world interactions. We present BALI (Bidirectional Action-Language Inference) for goal prediction, a method that integrates natural language preferences with observed human actions in a receding-horizon planning tree. BALI combines language and action cues from the human, asks clarifying questions only when the expected information gain from the answer outweighs the cost of interruption, and selects supportive actions that align with inferred goals. We evaluate the approach in collaborative cooking tasks, where goals may be novel to the robot and unbounded. Compared to baselines, BALI yields more stable goal predictions and significantly fewer mistakes.
zh

[AI-49] Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operation Orchestration AAAI-2026

【速读】:该论文旨在解决现有代理系统在执行多步骤、会话级文档处理工作流时缺乏有效控制与容错能力的问题,尤其在长程任务中难以保持用户意图和文档上下文的一致性。其解决方案的关键在于提出AutoDW框架,该框架通过增量式规划API操作(基于用户指令、意图过滤的API候选集及文档状态演化),实现细粒度的步骤级调度;同时引入双层级回滚机制(参数级与API级),支持动态修正错误并提升故障容忍度,从而确保整个执行轨迹始终与用户意图和文档语境对齐。

链接: https://arxiv.org/abs/2512.04445
作者: Yanbin Zhang,Hanhui Ye,Yue Bai,Qiming Zhang,Liao Xiang,Wu Mianzhi,Renjun Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, accepted by AAAI-2026

点击查看摘要

Abstract:Workflow automation promises substantial productivity gains in everyday document-related tasks. While prior agentic systems can execute isolated instructions, they struggle with automating multi-step, session-level workflows due to limited control over the operational process. To this end, we introduce AutoDW, a novel execution framework that enables stepwise, rollback-enabled operation orchestration. AutoDW incrementally plans API actions conditioned on user instructions, intent-filtered API candidates, and the evolving states of the document. It further employs robust rollback mechanisms at both the argument and API levels, enabling dynamic correction and fault tolerance. These designs together ensure that the execution trajectory of AutoDW remains aligned with user intent and document context across long-horizon workflows. To assess its effectiveness, we construct a comprehensive benchmark of 250 sessions and 1,708 human-annotated instructions, reflecting realistic document processing scenarios with interdependent instructions. AutoDW achieves 90% and 62% completion rates on instruction- and session-level tasks, respectively, outperforming strong baselines by 40% and 76%. Moreover, AutoDW also remains robust for the decision of backbone LLMs and on tasks with varying difficulty. Code and data will be open-sourced. Code: this https URL
zh

[AI-50] askEval: Synthesised Evaluation for Foundation-Model Tasks

【速读】:该论文旨在解决生成式 AI(Generative AI)应用中因幻觉(hallucination)导致的输出可靠性问题,尤其针对缺乏任务特定评估指标或基准数据集时,软件团队难以有效评估和审查模型输出的挑战。解决方案的关键在于提出一种任务无关的元模型(task-agnostic meta-model),用于捕捉任意生成式 AI 任务的核心属性,并结合高效的人类反馈交互协议与一个评估合成器(eval synthesiser),自动选择或生成适配具体任务的评估程序,同时提供定制化用户界面以收集人工反馈,从而实现自动化评估与人类洞察的深度融合。

链接: https://arxiv.org/abs/2512.04442
作者: Dilani Widanapathiranage,Scott Barnett,Stefanus Kurniawan,Wannita Takerngsaksiri
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textitevals. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval synthesiser that selects or generates an appropriate set of evals. We implement our approach in \toolname and demonstrate the concept on two diverse FM tasks: chart data extraction and document question answering. A preliminary evaluation on the quality of our selected evals shows 93% and 90% accuracy respectively. Our research tackles a growing problem facing engineering teams, how to evaluate and review outputs from FM tasks.
zh

[AI-51] Solving LLM Repetition Problem in Production: A Comprehensive Study of Multiple Solutions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际批量代码解释任务中出现的重复生成问题(repetition problem),即模型在无适当终止条件下持续输出重复内容,导致系统性能严重下降甚至停滞。研究识别出三种典型的重复模式:业务规则生成重复、方法调用关系分析重复和PlantUML图表语法生成重复。通过基于马尔可夫模型的理论分析,作者指出根本原因在于贪婪解码(greedy decoding)无法跳出重复循环,并受自强化效应加剧。解决方案的关键在于:(1) 使用带 early_stopping=True 的束搜索(Beam Search)作为通用后处理机制,能有效解决所有三种重复模式;(2) 引入 presence_penalty 超参数专门应对第一类重复问题;(3) 采用直接偏好优化(Direct Preference Optimization, DPO)微调作为模型级通用方案,适用于全部三种重复场景。其中,early_stopping 是确保束搜索有效性的重要参数,且所有方案均在真实生产环境中验证可行。

链接: https://arxiv.org/abs/2512.04419
作者: Weiwei Wang,Weijie Zou,Jiyong Min
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The repetition problem, where Large Language Models (LLMs) continuously generate repetitive content without proper termination, poses a critical challenge in production deployments, causing severe performance degradation and system stalling. This paper presents a comprehensive investigation and multiple practical solutions for the repetition problem encountered in real-world batch code interpretation tasks. We identify three distinct repetition patterns: (1) business rule generation repetition, (2) method call relationship analysis repetition, and (3) PlantUML diagram syntax generation repetition. Through rigorous theoretical analysis based on Markov models, we establish that the root cause lies in greedy decoding’s inability to escape repetitive loops, exacerbated by self-reinforcement effects. Our comprehensive experimental evaluation demonstrates three viable solutions: (1) Beam Search decoding with early_stopping=True serves as a universal post-hoc mechanism that effectively resolves all three repetition patterns; (2) presence_penalty hyperparameter provides an effective solution specifically for BadCase 1; and (3) Direct Preference Optimization (DPO) fine-tuning offers a universal model-level solution for all three BadCases. The primary value of this work lies in combining first-hand production experience with extensive experimental validation. Our main contributions include systematic theoretical analysis of repetition mechanisms, comprehensive evaluation of multiple solutions with task-specific applicability mapping, identification of early_stopping as the critical parameter for Beam Search effectiveness, and practical production-ready solutions validated in real deployment environments. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.04419 [cs.AI] (or arXiv:2512.04419v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.04419 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weiwei Wang Mr. [view email] [v1] Thu, 4 Dec 2025 03:30:18 UTC (27 KB)
zh

[AI-52] GovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

【速读】:该论文旨在解决当前自动化数据科学基准在数据治理(Data Governance)场景下的适用性不足问题,即现有评估体系多聚焦于代码片段生成或高层分析任务,未能充分反映数据治理中对数据正确性和质量保障的核心挑战。为此,作者提出GovBench基准,其关键创新在于采用“反向目标”(reversed-objective)方法模拟真实噪声,并通过严谨的端到端指标衡量数据处理管道的可靠性。针对模型在复杂多步工作流中表现不佳且缺乏鲁棒纠错机制的问题,论文进一步提出DataGovAgent框架,其核心是基于约束规划的Planner-Executor-Evaluator架构,融合了基于约束的规划、检索增强生成(Retrieval-Augmented Generation, RAG)以及沙箱环境中的反馈驱动调试机制,显著提升了复杂任务的平均任务得分(ATS)并大幅减少调试迭代次数。

链接: https://arxiv.org/abs/2512.04416
作者: Zhou Liu,Zhaoyang Han,Guochen Yan,Hao Liang,Bohan Zeng,Xing Chen,Yuanfeng Song,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Equal contribution: Zhou Liu and Zhaoyang Han. Corresponding authors: Yuanfeng Song and Wentao Zhang

点击查看摘要

Abstract:Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics, failing to capture the unique challenge of data governance: ensuring the correctness and quality of the data itself. To bridge this gap, we introduce GovBench, a benchmark featuring 150 diverse tasks grounded in real-world scenarios, built on data from actual cases. GovBench employs a novel “reversed-objective” methodology to synthesize realistic noise and utilizes rigorous metrics to assess end-to-end pipeline reliability. Our analysis on GovBench reveals that current models struggle with complex, multi-step workflows and lack robust error-correction mechanisms. Consequently, we propose DataGovAgent, a framework utilizing a Planner-Executor-Evaluator architecture that integrates constraint-based planning, retrieval-augmented generation, and sandboxed feedback-driven debugging. Experimental results show that DataGovAgent significantly boosts the Average Task Score (ATS) on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9 percent compared to general-purpose baselines.
zh

[AI-53] Executable Governance for AI: Translating Policies into Rules Using LLM s AAAI-26

【速读】:该论文旨在解决当前人工智能(AI)政策文档多以自然语言 prose 形式存在,导致实践中需人工将其转化为可执行规则,这一过程效率低、易出错且难以扩展,从而延迟了安全防护措施在真实场景中的部署。解决方案的关键在于提出 Policy-to-Tests (P2T) 框架,其核心包括一个规则提取流水线和一种紧凑的领域特定语言(DSL),能够将政策文本中的危害(hazards)、范围(scope)、条件(conditions)、例外(exceptions)及所需证据(required evidence)等要素结构化为标准化的机器可读规则,实现从自然语言政策到可执行规则的自动化转换,并通过跨框架、行业指南与企业标准的实证验证了其有效性与鲁棒性。

链接: https://arxiv.org/abs/2512.04408
作者: Gautam Varma Datla,Anudeep Vurity,Tejaswani Dash,Tazeem Ahmad,Mohd Adnan,Saima Rafi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI-26 AI Governance Workshop (in-person presentation); 10 pages, 5 figures

点击查看摘要

Abstract:AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.
zh

[AI-54] AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning

【速读】:该论文旨在解决当前DevSecOps流水线在持续集成与部署(CI/CD)环境中对安全威胁响应滞后的问题,传统基于规则的入侵检测和静态漏洞扫描方法难以适应系统动态变化,导致安全响应延迟并暴露于新兴攻击向量。解决方案的关键在于提出AutoGuard框架——一个基于强化学习(Reinforcement Learning, RL)的自愈式安全机制,通过持续观测CI/CD流水线活动并动态学习最优策略,实现对潜在异常的预判性修复与实时响应;其核心创新在于利用奖励驱动的学习方式不断优化安全动作,从而提升威胁检测准确率、缩短平均恢复时间(MTTR),增强整体环境韧性。

链接: https://arxiv.org/abs/2512.04368
作者: Praveen Anugula,Avdhesh Kumar Bhardwaj,Navin Chhibber,Rohit Tewari,Sunil Khemka,Piyush Ranjan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: Accepted and Presented at 1st IEEE Uttar Pradesh Section Women in Engineering International Conference on Electrical Electronics and Computer Engineering (UPWIECON 2025) organized by NIELIT Dehradun held during 30th 31st October 2025

点击查看摘要

Abstract:Contemporary DevSecOps pipelines have to deal with the evolution of security in an ever-continuously integrated and deployed environment. Existing methods,such as rule-based intrusion detection and static vulnerability scanning, are inadequate and unreceptive to changes in the system, causing longer response times and organization needs exposure to emerging attack vectors. In light of the previous constraints, we introduce AutoGuard to the DevSecOps ecosystem, a reinforcement learning (RL)-powered self-healing security framework built to pre-emptively protect DevSecOps environments. AutoGuard is a self-securing security environment that continuously observes pipeline activities for potential anomalies while preemptively remediating the environment. The model observes and reacts based on a policy that is continually learned dynamically over time. The RL agent improves each action over time through reward-based learning aimed at improving the agent’s ability to prevent, detect and respond to a security incident in real-time. Testing using simulated ContinuousIntegration / Continuous Deployment (CI/CD) environments showed AutoGuard to successfully improve threat detection accuracy by 22%, reduce mean time torecovery (MTTR) for incidents by 38% and increase overall resilience to incidents as compared to traditional methods. Keywords- DevSecOps, Reinforcement Learning, Self- Healing Security, Continuous Integration, Automated Threat Mitigation Comments: Accepted and Presented at 1st IEEE Uttar Pradesh Section Women in Engineering International Conference on Electrical Electronics and Computer Engineering (UPWIECON 2025) organized by NIELIT Dehradun held during 30th 31st October 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2512.04368 [cs.CR] (or arXiv:2512.04368v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.04368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Agent Bay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agent ic Systems

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)驱动的自主智能体(AI Agents)在面对现实世界异常情况时表现出脆弱性的问题,尤其是在需要高可靠性的任务中,单纯依赖自动化难以保障稳定性,因此亟需引入人类监督机制。解决方案的关键在于提出AgentBay——一个专为混合交互设计的沙箱服务,其核心创新是通过自适应流协议(Adaptive Streaming Protocol, ASP)实现AI代理与人类操作员之间的无缝切换控制:AI可通过主流接口(如MCP、开源SDK)进行程序化交互,而人类可在任意时刻接管完整手动控制权;ASP动态融合命令流与视频流传输策略,根据网络状况和当前控制器(AI或人类)自适应调整编码方式,在弱网环境下仍能保持低延迟(端到端延迟降低约5%)和高流畅度,并将带宽消耗减少高达50%,从而显著提升系统安全性、性能及任务完成率(复杂任务成功率提升超48%)。

链接: https://arxiv.org/abs/2512.04367
作者: Yun Piao,Hongbo Min,Hang Su,Leilei Zhang,Lei Wang,Yue Yin,Xiao Wu,Zhejing Xu,Liwei Qu,Hang Li,Xinxin Zeng,Wei Tian,Fei Yu,Xiaowei Li,Jiayi Jiang,Tongxu Liu,Hao Tian,Yufei Que,Xiaobing Tu,Bing Suo,Yuebing Li,Xiangting Chen,Zeen Zhao,Jiaming Tang,Wei Huang,Xuguang Li,Jing Zhao,Jin Li,Jie Shen,Jinkui Ren,Xiantao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.
zh

[AI-56] Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)范式下大语言模型(Large Language Models, LLMs)因熵崩溃(entropy collapse)导致策略探索能力下降、推理性能受限的问题。其解决方案的关键在于从数据组织和算法设计两个层面协同优化:一方面引入语义熵引导的课程学习(semantic entropy-guided curriculum learning),按语义熵由低到高排序训练数据,实现从易到难的任务渐进优化;另一方面采用非均匀令牌处理机制,对低熵令牌施加KL正则化以增强策略探索,并对高协方差区域施加强约束,从而有效缓解熵崩溃并提升LLM的推理能力。

链接: https://arxiv.org/abs/2512.04359
作者: Hongye Cao,Zhixin Bai,Ziyue Peng,Boyan Wang,Tianpei Yang,Jing Huo,Yuyao Zhang,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.
zh

[AI-57] Counting Without Running: Evaluating LLM s Reasoning About Code Complexity

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在GPU性能预测任务中缺乏前瞻性推理能力的问题,特别是对浮点运算量(FLOPs)的准确预估能力不足。现有LLMs虽在代码生成方面进展迅速,但难以在不实际执行内核的情况下识别和量化CUDA内核中的单精度与双精度浮点运算数量,尤其当这些计算依赖于编译器或运行时隐含行为时,会导致严重误差。解决方案的关键是提出gpuFLOPBench基准测试工具,它通过577个来自HeCBench的CUDA内核构成测试集,包含真实性能标注及八种执行属性,用以评估模型能否“无需运行”即准确预测FLOP数。实验表明,尽管最新闭源模型在简单内核上表现良好,但在涉及除法、数学库函数或公共子表达式等隐含FLOP场景下仍存在数量级错误,揭示了现有代码助手无法内化硬件特定微码影响的核心局限,并为开发具备专业GPU开发者水平性能推理能力的LLM工具提供了聚焦测试平台。

链接: https://arxiv.org/abs/2512.04355
作者: Gregory Bolet,Giorgis Georgakoudis,Konstantinos Parasyris,Harshitha Menon,Niranjan Hasabnis,Kirk W. Cameron,Gal Oren
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 13 pages, 6 figures, MLSys 2026 Submission

点击查看摘要

Abstract:Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today’s Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to “count without running” by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants – the inability to internalize hardware-specific microcode effects – and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: this https URL
zh

[AI-58] A Conceptual Model for AI Adoption in Financial Decision-Making: Addressing the Unique Challenges of Small and Medium-Sized Enterprises

【速读】:该论文旨在解决中小企业(SMEs)在财务决策中采纳人工智能(AI)技术时面临的关键障碍,包括资源有限、技术专长不足以及数据管理能力薄弱等问题。其解决方案的核心在于提出一个分层的的概念模型,涵盖数据来源、数据处理与集成、AI模型部署、决策支持与自动化,以及验证与风险管理五个层面,并强调通过渐进式实施策略来优化财务预测、预算编制、投资策略和风险管控。该模型特别突出了数据质量与持续模型验证的重要性,为中小企业提供了可操作的AI金融应用路径。

链接: https://arxiv.org/abs/2512.04339
作者: Manh Chien Vu,Thang Le Dinh,Manh Chien Vu,Tran Duc Le,Thi Lien Huong Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The Eighth International Econometric and Financial Conference of Vietnam - ECONVN2025, Ho Chi Minh City, Vietnam, January 13-14-15, 2025

点击查看摘要

Abstract:The adoption of artificial intelligence (AI) offers transformative potential for small and medium-sized enterprises (SMEs), particularly in enhancing financial decision-making processes. However, SMEs often face significant barriers to implementing AI technologies, including limited resources, technical expertise, and data management capabilities. This paper presents a conceptual model for the adoption of AI in financial decision-making for SMEs. The proposed model addresses key challenges faced by SMEs, including limited resources, technical expertise, and data management capabilities. The model is structured into layers: data sources, data processing and integration, AI model deployment, decision support and automation, and validation and risk management. By implementing AI incrementally, SMEs can optimize financial forecasting, budgeting, investment strategies, and risk management. This paper highlights the importance of data quality and continuous model validation, providing a practical roadmap for SMEs to integrate AI into their financial operations. The study concludes with implications for SMEs adopting AI-driven financial processes and suggests areas for future research in AI applications for SME finance.
zh

[AI-59] RGE-GCN: Recursive Gene Elimination with Graph Convolutional Networks for RNA-seq based Early Cancer Detection

【速读】:该论文旨在解决从高维RNA测序(RNA-seq)数据中识别可靠癌症生物标志物的难题,传统统计方法难以捕捉基因间的复杂非线性关系。其解决方案的关键在于提出RGE-GCN(递归基因消除与图卷积网络结合)框架,该框架将特征选择与分类任务整合到一个端到端的管道中:首先基于基因表达谱构建基因共表达图,利用图卷积网络(Graph Convolutional Network, GCN)进行癌症与正常样本的分类,并通过集成梯度(Integrated Gradients)技术识别关键基因;随后通过递归剔除不相关基因,逐步收敛至一组具有高预测性能且可解释的生物标志物集合。该方法在肺、肾及宫颈癌真实数据集上均显著优于DESeq2、edgeR和limma-voom等主流工具,且所选基因富集于PI3K-AKT、MAPK、SUMOylation及免疫调控等已知癌症通路,体现出良好的泛化能力与生物学意义。

链接: https://arxiv.org/abs/2512.04333
作者: Shreyas Shende,Varsha Narayanan,Vishal Fenn,Yiran Huang,Dincer Goksuluk,Gaurav Choudhary,Melih Agraz,Mengjia Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Early detection of cancer plays a key role in improving survival rates, but identifying reliable biomarkers from RNA-seq data is still a major challenge. The data are high-dimensional, and conventional statistical methods often fail to capture the complex relationships between genes. In this study, we introduce RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks), a framework that combines feature selection and classification in a single pipeline. Our approach builds a graph from gene expression profiles, uses a Graph Convolutional Network to classify cancer versus normal samples, and applies Integrated Gradients to highlight the most informative genes. By recursively removing less relevant genes, the model converges to a compact set of biomarkers that are both interpretable and predictive. We evaluated RGE-GCN on synthetic data as well as real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Across all datasets, the method consistently achieved higher accuracy and F1-scores than standard tools such as DESeq2, edgeR, and limma-voom. Importantly, the selected genes aligned with well-known cancer pathways including PI3K-AKT, MAPK, SUMOylation, and immune regulation. These results suggest that RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery (this https URL ).
zh

[AI-60] MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training

【速读】:该论文旨在解决深度学习模型在软件工程(Software Engineering, SE)任务中因训练数据存在噪声标签而导致的准确性和鲁棒性下降问题。现有研究虽在其他领域探索了噪声标签学习(Noise Label Learning, NLL),但在SE场景下,尤其是针对代码预训练语言模型(code-Pretrained Language Models, PTM)和大语言模型(Large Language Models, LLMs)的应用仍缺乏系统性方法。解决方案的关键在于提出MANTRA——一种多阶段自适应噪声处理框架,其核心机制是在微调过程中嵌入噪声诊断与缓解策略:首先分析不同噪声水平对模型收敛和损失轨迹的影响;随后利用基于样本损失动态的自适应丢弃策略与高斯混合模型(Gaussian Mixture Model, GMM)聚类,识别并剔除持续异常的数据点,同时保留高质量样本,从而提升模型性能。实验表明,MANTRA能有效改善多种LLM在代码摘要生成和提交意图分类任务中的表现,显著降低数据清洗成本并增强微调效果。

链接: https://arxiv.org/abs/2512.04319
作者: Zixiao Zhao,Fatemeh H. Fard,Jie JW Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliable application of deep learning models to software engineering tasks hinges on high-quality training data. Yet, large-scale repositories inevitably introduce noisy or mislabeled examples that degrade both accuracy and robustness. While Noise Label Learning (NLL) has been extensively studied in other fields, there are a few works that investigate NLL in Software Engineering (SE) and Large Language Models (LLMs) for SE tasks. In this work, we propose MANTRA, a Multi-stage Adaptive Noise TReAtment framework that embeds noise diagnosis and mitigation directly into the fine-tuning process of code-Pretrained Language Models (PTM) and code-LLMs. We first investigate the effect of noise at varying levels on convergence and loss trajectories of the models. Then we apply an adaptive dropout strategy guided by per-sample loss dynamics and Gaussian Mixture Model clustering to exclude persistently noisy points while preserving clean data. Applying to code summarization and commit intent classification, our experiments reveal that some LLMs are more sensitive to noise than others. However, with MANTRA, the performance of all models in both tasks is improved. MANTRA enables researchers and practitioners to reduce the impact of errors introduced by the dataset in training, saves time in data cleaning and processing, while maximizing the effect of fine-tuning.
zh

[AI-61] Evaluating Long-Context Reasoning in LLM -Based WebAgents NEURIPS25

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的WebAgent在长上下文场景下进行推理和任务执行时性能显著下降的问题,尤其是在多会话交互中如何有效利用扩展的历史信息以维持任务连贯性。其解决方案的关键在于构建了一个新颖的评估框架,通过在依赖性子任务之间插入无关任务轨迹来模拟真实用户长时间交互,从而生成从25,000到150,000 tokens不等的长上下文环境,并在此基础上对四种主流模型进行了系统评估。实验发现,随着上下文长度增加,任务成功率急剧下降,且主要失败原因包括陷入循环和偏离原始任务目标;尽管引入隐式检索增强生成(Implicit Retrieval-Augmented Generation, RAG)策略生成任务相关摘要可带来一定改进,但根本性的长上下文推理局限仍未突破,凸显出开发具备更强长期记忆与任务一致性保持能力的代理架构的重要性。

链接: https://arxiv.org/abs/2512.04307
作者: Andy Chung,Yichi Zhang,Kaixiang Lin,Aditya Rawal,Qiaozi Gao,Joyce Chai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted NeurIPS 25 LAW Workshop

点击查看摘要

Abstract:As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50% in baseline conditions to less than 10% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.
zh

[AI-62] owards better dense rewards in Reinforcement Learning Applications

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中密集奖励函数设计难题,即如何在稀疏、延迟或与任务目标不一致的奖励信号下,构建既能有效引导智能体探索又能避免奖励黑客(reward hacking)和行为偏差的高质量密集奖励。其解决方案的关键在于探索多种先进方法,包括逆向强化学习(Inverse Reinforcement Learning)、基于人类偏好的奖励建模以及自监督内在奖励学习,以提升奖励函数的通用性、可扩展性和与人类意图的一致性,从而增强不同应用场景下密集奖励构造的有效性与可靠性。

链接: https://arxiv.org/abs/2512.04302
作者: Shuyuan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2505.20417

点击查看摘要

Abstract:Finding meaningful and accurate dense rewards is a fundamental task in the field of reinforcement learning (RL) that enables agents to explore environments more efficiently. In traditional RL settings, agents learn optimal policies through interactions with an environment guided by reward signals. However, when these signals are sparse, delayed, or poorly aligned with the intended task objectives, agents often struggle to learn effectively. Dense reward functions, which provide informative feedback at every step or state transition, offer a potential solution by shaping agent behavior and accelerating learning. Despite their benefits, poorly crafted reward functions can lead to unintended behaviors, reward hacking, or inefficient exploration. This problem is particularly acute in complex or high-dimensional environments where handcrafted rewards are difficult to specify and validate. To address this, recent research has explored a variety of approaches, including inverse reinforcement learning, reward modeling from human preferences, and self-supervised learning of intrinsic rewards. While these methods offer promising directions, they often involve trade-offs between generality, scalability, and alignment with human intent. This proposal explores several approaches to dealing with these unsolved problems and enhancing the effectiveness and reliability of dense reward construction in different RL applications.
zh

[AI-63] Artificial Intelligence Applications in Horizon Scanning for Infectious Diseases

【速读】:该论文旨在解决如何利用人工智能(Artificial Intelligence, AI)提升前景扫描(Horizon Scanning)在传染病领域中的应用效能,以更早识别新兴威胁与机遇,并优化公共卫生准备。其解决方案的关键在于通过AI工具增强信号检测、数据监测、情景分析和决策支持四大核心环节,同时系统性评估AI引入所带来的风险,并提出相应的实施策略与治理框架,从而实现AI在公共健康前瞻性研究中的有效落地与可持续应用。

链接: https://arxiv.org/abs/2512.04287
作者: Ian Miles,Mayumi Wakimoto,Wagner Meira Jr.,Daniela Paula,Daylene Ticiane,Bruno Rosa,Jane Biddulph,Stelios Georgiou,Valdir Ermida
机构: 未知
类目: Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
备注: 21 pages, 1 box, 1 figure

点击查看摘要

Abstract:This review explores the integration of Artificial Intelligence into Horizon Scanning, focusing on identifying and responding to emerging threats and opportunities linked to Infectious Diseases. We examine how AI tools can enhance signal detection, data monitoring, scenario analysis, and decision support. We also address the risks associated with AI adoption and propose strategies for effective implementation and governance. The findings contribute to the growing body of Foresight literature by demonstrating the potential and limitations of AI in Public Health preparedness.
zh

[AI-64] Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

【速读】:该论文试图解决的问题是在强化学习(Reinforcement Learning, RL)后训练过程中,如何有效利用解题顺序的结构信息来提升模型性能,尤其是在监督微调阶段使用随机解题序列的情况下。传统RL方法通常仅优化单一标量目标(如单元格准确率),忽略了生成解题过程中的结构特性。解决方案的关键在于引入一个仅在RL后训练阶段使用的粗粒度排序提示(ordering reward),该奖励鼓励模型输出顺序与标准求解器顺序对齐,从而引导策略向更符合人类逻辑的解题轨迹收敛。实验表明,通过固定混合比例结合单元格准确率和排序奖励,并采用简单的自举缩放使初始阶段各组件幅度一致,所提出的Group Relative Policy Optimization (GRPO)方法能显著提升测试准确率,且接近在标准求解顺序下微调的模型性能,证明了排序信号在不修改监督数据或网络架构的前提下可有效引导RL优化方向。

链接: https://arxiv.org/abs/2512.04277
作者: Prakhar Gupta,Vaibhav Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model’s emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization–the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.
zh

[AI-65] he Geometry of Benchmarks: A New Path Toward AGI

【速读】:该论文试图解决当前人工智能评估体系中缺乏对模型泛化能力与自主自我改进机制的系统性衡量问题,即现有基准测试多为孤立的测试集,难以支撑对AI代理在多样化任务场景下性能一致性及进化潜力的推理。解决方案的关键在于构建一个几何化的模空间(moduli space)框架,将所有心理测量型基准电池视为该空间中的点,并以能力泛函(capability functionals)描述代理在该空间中的表现;在此基础上提出自治AI(Autonomous AI, AAI)等级体系、识别基准等价类以实现对任务空间区域的稠密认证,并引入通用生成-验证-更新(Generator-Verifier-Updater, GVU)算子及其对应的自改进系数 κ\kappa,通过Lie导数刻画能力泛函沿GVU流的演化,从而将AGI进展理解为由GVU动态驱动的基准模空间上的流动过程。

链接: https://arxiv.org/abs/2512.04276
作者: Przemyslaw Chojecki
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient \kappa as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for \kappa 0 . Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
zh

[AI-66] Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures

【速读】:该论文试图解决的问题是:随着大型语言模型(Large Language Models, LLMs)从代码补全工具向自主系统架构设计者转变,其对长期软件可维护性的影响尚缺乏量化评估,尤其是生成的微服务是否存在“架构侵蚀”(Architectural Erosion)和结构性技术债(Technical Debt)积累。解决方案的关键在于构建首个实证框架,通过在严格Hexagonal架构约束下对比三种先进模型(GPT-5.1、Claude 4.5 Sonnet 和 Llama 3 8B)实现标准化图书借阅微服务的能力,结合抽象语法树(Abstract Syntax Tree, AST)解析技术识别架构违规行为,并发现开放权重模型存在显著的架构偏离与实现惰性(Implementation Laziness),从而揭示小规模开源模型在无自动架构linting机制时会加速结构技术债的积累。

链接: https://arxiv.org/abs/2512.04273
作者: Tyler Slater
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under review at the Journal of Systems and Software (Special Issue on Impactful Software Architecture)

点击查看摘要

Abstract:As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure “Architectural Erosion” and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of “Implementation Laziness,” where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.
zh

[AI-67] he Initialization Determines Whether In-Context Learning Is Gradient Descent

【速读】:该论文旨在解决生成式 AI(Generative AI)中上下文学习(In-context Learning, ICL)机制的理论解释不足问题,特别是多头线性自注意力(multi-head linear self-attention, LSA)如何在更贴近实际场景的条件下近似梯度下降(Gradient Descent, GD)优化过程。其关键解决方案在于引入一个可训练的初始估计(initial guess),即 yq,构建出一种新的 yq-LSA 模型,从而在非零高斯先验均值条件下提升多头 LSA 对 GD 的逼近能力,并通过理论分析与实验验证表明该方法能有效缩小 ICL 线性回归设置下单步 GD 与多头 LSA 之间的性能差距,同时在语义相似性任务上验证了该改进对大规模语言模型(LLMs)的实际提升效果。

链接: https://arxiv.org/abs/2512.04268
作者: Shifeng Xie,Rui Yuan,Simone Rossi,Thomas Hannagan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.
zh

[AI-68] Catching UX Flaws in Code: Leverag ing LLM s to Identify Usability Flaws at the Development Stage

【速读】:该论文试图解决的问题是:传统由人类专家进行的可用性评估(Usability Evaluation)在早期开发阶段存在耗时且主观性强的局限性,难以满足快速迭代的需求。为应对这一挑战,研究提出利用大语言模型(Large Language Models, LLMs)实现自动化、早期可用性测试的可行性与可靠性问题。解决方案的关键在于构建一个基于OpenAI GPT-4o的自动化评估流程,将Jakob Nielsen的十项可用性启发式(Heuristics)应用于30个开源网站,生成超过850次独立评估结果,并通过Cohen’s Kappa和Krippendorff’s Alpha等指标量化模型间的一致性。研究表明,GPT-4o在识别可用性问题的存在方面具有中等至较高的内部一致性(平均Kappa=0.50,精确一致率达84%),但在严重性判断上一致性较低(精确一致率仅56%,Alpha接近零),表明其可作为早期筛查工具,但需结合人工校验以提升准确性。

链接: https://arxiv.org/abs/2512.04262
作者: Nolan Platt,Ethan Luchs,Sehrish Nizamani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages. Published in Proceedings of the 2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). DOI: https://doi.org/10.1109/VL-HCC65237.2025.00024

点击查看摘要

Abstract:Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen’s ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI’s GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen’s Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen’s Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff’s Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.
zh

[AI-69] Hey GPT -OSS Looks Like You Got It - Now Walk Me Through It! An Assessment of the Reasoning Language Models Chain of Thought Mechanism for Digital Forensics

【速读】:该论文旨在解决生成式 AI(Generative AI)在数字取证领域中因结果可解释性不足而导致的操作与法律可用性受限的问题。其解决方案的关键在于引入具备“内部推理”机制的推理型语言模型(reasoning language models),特别是本地部署的 gpt-oss 模型,通过挖掘其可访问的推理过程来提升输出结果的可解释性。研究通过四个测试用例评估了该推理组件在支持数字取证任务中的有效性,发现中等推理层级下能有效辅助解释和验证模型输出,但更高推理层级并未显著改善响应质量,表明推理深度与实用性之间存在非线性关系。

链接: https://arxiv.org/abs/2512.04254
作者: Gaëtan Michelet,Janine Schneider,Aruna Withanage,Frank Breitinger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accept at DFRWS EU 2026

点击查看摘要

Abstract:The use of large language models in digital forensics has been widely explored. Beyond identifying potential applications, research has also focused on optimizing model performance for forensic tasks through fine-tuning. However, limited result explainability reduces their operational and legal usability. Recently, a new class of reasoning language models has emerged, designed to handle logic-based tasks through an `internal reasoning’ mechanism. Yet, users typically see only the final answer, not the underlying reasoning. One of these reasoning models is gpt-oss, which can be deployed locally, providing full access to its underlying reasoning process. This article presents the first investigation into the potential of reasoning language models for digital forensics. Four test use cases are examined to assess the usability of the reasoning component in supporting result explainability. The evaluation combines a new quantitative metric with qualitative analysis. Findings show that the reasoning component aids in explaining and validating language model outputs in digital forensics at medium reasoning levels, but this support is often limited, and higher reasoning levels do not enhance response quality.
zh

[AI-70] Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning

【速读】:该论文旨在解决小分子对酪氨酸-DNA磷酸二酯酶1(TDP1)抑制活性预测的难题,这是克服癌症化疗耐药性药物研发中的关键挑战。解决方案的核心在于提出一种基于微调ChemBERTa预训练化学语言模型的深度学习框架,通过从分子简化分子输入线输入系统(SMILES)字符串直接回归pIC50值,有效应对数据严重不平衡(仅2.1%为活性化合物)的问题。研究系统评估了掩码语言建模(MLM)与掩码标记回归(MTR)两种预训练策略,并结合分层数据划分和样本加权方法,在虚拟筛选中实现了高富集因子(EF@1% 17.4)和高精度(Precision@1% 37.4),显著优于随机预测基线,并具备与随机森林相当的性能,从而提供了一个可直接部署的工具用于优先筛选TDP1抑制剂进行实验验证。

链接: https://arxiv.org/abs/2512.04252
作者: Baichuan Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the inhibitory potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1)-a key target in overcoming cancer chemoresistance-remains a critical challenge in early drug discovery. We present a deep learning framework for the quantitative regression of pIC50 values from molecular Simplified Molecular Input Line Entry System (SMILES) strings using fine-tuned variants of ChemBERTa, a pre-trained chemical language model. Leveraging a large-scale consensus dataset of 177,092 compounds, we systematically evaluate two pre-training strategies-Masked Language Modeling (MLM) and Masked Token Regression (MTR)-under stratified data splits and sample weighting to address severe activity imbalance which only 2.1% are active. Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility, and has competitive performance compared to Random Forest, achieving high enrichment factor EF@1% 17.4 and precision Precision@1% 37.4 among top-ranked predictions. The resulting model, validated through rigorous ablation and hyperparameter studies, provides a robust, ready-to-deploy tool for prioritizing TDP1 inhibitors for experimental testing. By enabling accurate, 3D-structure-free pIC50 prediction directly from SMILES, this work demonstrates the transformative potential of chemical transformers in accelerating target-specific drug discovery.
zh

[AI-71] oward Virtuous Reinforcement Learning

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)中伦理建模的两大局限性:一是基于规则(义务论)的方法在面对模糊性和非平稳环境时表现不佳,且难以培养持久的行为习惯;二是单一目标奖励机制将复杂的道德考量压缩为标量信号,导致权衡关系被掩盖并诱发代理博弈(proxy gaming)。其解决方案的关键在于将伦理视为策略层面的品质(virtue),即在激励、伙伴或情境变化下仍能保持稳定的习惯性行为模式。为此,论文提出四维路线图:通过多智能体强化学习中的社会学习获取类美德行为模式;采用多目标与约束形式保留价值冲突并引入风险感知标准以规避伤害;利用基于亲和力的正则化建立可更新的美德先验以实现分布偏移下的稳定性;以及将多元伦理传统转化为实用的控制信号,显式呈现塑造伦理RL基准的价值与文化假设。

链接: https://arxiv.org/abs/2512.04246
作者: Majid Ghasemi,Mark Crowley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.
zh

[AI-72] CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding

【速读】:该论文旨在解决助手机器人在非结构化环境中进行任务执行时,如何准确地将语言指令中的动作意图(如“拿取”、“打开”等)与可操作物体及其物理可抓取性进行语义和功能上的对齐问题。现有方法通常依赖黑箱模型或固定的功能标签,导致透明度低、可控性差且可靠性不足。解决方案的关键在于提出一种模块化的神经符号框架CRAFT-E,其核心是构建一个由动词-属性-对象构成的知识图谱,并结合视觉-语言对齐与基于能量的抓取推理机制,从而生成可解释的物体选择路径,并将抓取可行性作为功能推理的一部分进行整合,实现了从语言到动作的可追溯、可诊断的端到端决策链。

链接: https://arxiv.org/abs/2512.04231
作者: Zhou Chen,Joe Lin,Carson Bulgin,Sathyanarayanan N. Aakur
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages. 3 figures, 4 tables. Under Review

点击查看摘要

Abstract:Assistive robots operating in unstructured environments must understand not only what objects are, but what they can be used for. This requires grounding language-based action queries to objects that both afford the requested function and can be physically retrieved. Existing approaches often rely on black-box models or fixed affordance labels, limiting transparency, controllability, and reliability for human-facing applications. We introduce CRAFT-E, a modular neuro-symbolic framework that composes a structured verb-property-object knowledge graph with visual-language alignment and energy-based grasp reasoning. The system generates interpretable grounding paths that expose the factors influencing object selection and incorporates grasp feasibility as an integral part of affordance inference. We further construct a benchmark dataset with unified annotations for verb-object compatibility, segmentation, and grasp candidates, and deploy the full pipeline on a physical robot. CRAFT-E achieves competitive performance in static scenes, ImageNet-based functional retrieval, and real-world trials involving 20 verbs and 39 objects. The framework remains robust under perceptual noise and provides transparent, component-level diagnostics. By coupling symbolic reasoning with embodied perception, CRAFT-E offers an interpretable and customizable alternative to end-to-end models for affordance-grounded object selection, supporting trustworthy decision-making in assistive robotic systems.
zh

[AI-73] Addressing Logical Fallacies In Scientific Reasoning From Large Language Models : Towards a Dual-Inference Training Framework

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学推理中因依赖单向肯定式推理(如命题演算中的“假言推理”)而产生的系统性弱点,包括对否定、反例和错误前提的敏感性、逻辑谬误易发性以及因果推理能力不足等问题。其解决方案的关键在于提出一种双推理训练框架(dual-reasoning training framework),该框架将传统的生成式推理与结构化的反事实否定机制相结合,基于形式逻辑、认知科学和对抗训练,形式化地引入“否定前件”(denying the antecedent)作为验证和增强模型鲁棒性的计算机制,从而实现模型不仅能正确肯定有效推理,还能明确拒绝无效推理,提升系统的可解释性、可靠性及人类推理一致性。

链接: https://arxiv.org/abs/2512.04228
作者: Peter B. Walker,Hannah Davidson,Aiden Foster,Matthew Lienert,Thomas Pardue,Dale Russell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing and hold growing promise for advancing science, healthcare, and decision-making. Yet their training paradigms remain dominated by affirmation-based inference, akin to \textitmodus ponens, where accepted premises yield predicted consequents. While effective for generative fluency, this one-directional approach leaves models vulnerable to logical fallacies, adversarial manipulation, and failures in causal reasoning. This paper makes two contributions. First, it demonstrates how existing LLMs from major platforms exhibit systematic weaknesses when reasoning in scientific domains with negation, counterexamples, or faulty premises \footnoteCode to recreate these experiments are at this https URL. Second, it introduces a dual-reasoning training framework that integrates affirmative generation with structured counterfactual denial. Grounded in formal logic, cognitive science, and adversarial training, this training paradigm formalizes a computational analogue of ``denying the antecedent’’ as a mechanism for disconfirmation and robustness. By coupling generative synthesis with explicit negation-aware objectives, the framework enables models that not only affirm valid inferences but also reject invalid ones, yielding systems that are more resilient, interpretable, and aligned with human reasoning.
zh

[AI-74] Educational Cone Model in Embedding Vector Spaces

【速读】:该论文旨在解决如何在智能教育系统中有效选择最适合表示文本难度的嵌入(embedding)方法的问题。由于存在大量不同的嵌入方法,缺乏统一标准来评估其与文本难度标注的一致性,导致难以确定哪种嵌入空间能更好地捕捉教育文本的难易程度。解决方案的关键在于提出教育锥模型(Educational Cone Model),该模型基于一个核心假设:较简单的文本语义多样性较低(聚焦于基础概念),而较难的文本语义更丰富(涵盖更多概念),从而在嵌入空间中形成锥形分布结构,且此结构不依赖于具体的嵌入方法。通过将嵌入评估转化为优化问题,并设计特定损失函数,该模型可高效求解闭式解,避免高成本计算,实证结果验证了其在识别与难度标注一致的嵌入空间方面的有效性与速度优势。

链接: https://arxiv.org/abs/2512.04227
作者: Yo Ehara
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the 33rd International Conference on Computers in Education (ICCE 2025)

点击查看摘要

Abstract:Human-annotated datasets with explicit difficulty ratings are essential in intelligent educational systems. Although embedding vector spaces are widely used to represent semantic closeness and are promising for analyzing text difficulty, the abundance of embedding methods creates a challenge in selecting the most suitable method. This study proposes the Educational Cone Model, which is a geometric framework based on the assumption that easier texts are less diverse (focusing on fundamental concepts), whereas harder texts are more diverse. This assumption leads to a cone-shaped distribution in the embedding space regardless of the embedding method used. The model frames the evaluation of embeddings as an optimization problem with the aim of detecting structured difficulty-based patterns. By designing specific loss functions, efficient closed-form solutions are derived that avoid costly computation. Empirical tests on real-world datasets validated the model’s effectiveness and speed in identifying the embedding spaces that are best aligned with difficulty-annotated educational texts.
zh

[AI-75] Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care

【速读】:该论文旨在解决初级诊疗环境中对继发性头痛(secondary headache)患者识别不足的问题,此类头痛常需紧急干预,但临床实践中因时间有限、信息不全及症状多样性导致误诊或漏诊风险升高。其解决方案的关键在于构建一个基于“协调器-专家”架构的多智能体临床决策支持系统,该系统通过将诊断任务分解为七个领域专业化智能体,每个智能体生成结构化且有证据支撑的推理过程,并由中央协调器进行任务分配与调度,从而实现可解释、精准的继发性头痛鉴别诊断。实验表明,该多智能体框架结合临床指南提示策略(GPrompt)显著优于单一大语言模型(LLM)基线,尤其在小规模模型中提升更为明显,验证了结构化多智能体推理优于单纯提示工程的有效性。

链接: https://arxiv.org/abs/2512.04207
作者: Xizhi Wu,Nelly Estefanie Garduno-Rapp,Justin F Rousseau,Mounika Thakkallapally,Hang Zhang,Yuelyu Ji,Shyam Visweswaran,Yifan Peng,Yanshan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlike most primary headaches, secondary headaches need specialized care and can have devastating consequences if not treated promptly. Clinical guidelines highlight several ‘red flag’ features, such as thunderclap onset, meningismus, papilledema, focal neurologic deficits, signs of temporal arteritis, systemic illness, and the ‘worst headache of their life’ presentation. Despite these guidelines, determining which patients require urgent evaluation remains challenging in primary care settings. Clinicians often work with limited time, incomplete information, and diverse symptom presentations, which can lead to under-recognition and inappropriate care. We present a large language model (LLM)-based multi-agent clinical decision support system built on an orchestrator-specialist architecture, designed to perform explicit and interpretable secondary headache diagnosis from free-text clinical vignettes. The multi-agent system decomposes diagnosis into seven domain-specialized agents, each producing a structured and evidence-grounded rationale, while a central orchestrator performs task decomposition and coordinates agent routing. We evaluated the multi-agent system using 90 expert-validated secondary headache cases and compared its performance with a single-LLM baseline across two prompting strategies: question-based prompting (QPrompt) and clinical practice guideline-based prompting (GPrompt). We tested five open-source LLMs (Qwen-30B, GPT-OSS-20B, Qwen-14B, Qwen-8B, and Llama-3.1-8B), and found that the orchestrated multi-agent system with GPrompt consistently achieved the highest F1 scores, with larger gains in smaller models. These findings demonstrate that structured multi-agent reasoning improves accuracy beyond prompt engineering alone and offers a transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis.
zh

[AI-76] BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

【速读】:该论文旨在解决二值神经网络(Binary Neural Networks, BNNs)在训练过程中因变量离散性导致的梯度优化难题。现有主流方法如量化感知训练(quantization-aware training)虽能绕过此问题,但需维护高精度参数并依赖浮点运算进行反向传播,从而丧失了二值运算在训练阶段的效率优势。为实现真正端到端的二值化训练,本文提出 Binary Error Propagation (BEP),其关键在于构建了一个严格的离散版反向传播链式法则,使得误差信号以二进制向量形式在多层网络中逐层回传,且所有前向与反向计算均仅使用位运算完成。这一机制首次实现了递归神经网络(Recurrent Neural Networks)的全二值端到端训练,并在多层感知机和循环神经网络上分别提升了最高达+6.89%和+10.57%的测试准确率。

链接: https://arxiv.org/abs/2512.04189
作者: Luca Colombo,Fabrizio Pittorino,Daniele Zambon,Carlo Baldassi,Manuel Roveri,Cesare Alippi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to +6.89% and +10.57% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.
zh

[AI-77] RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

【速读】:该论文旨在解决模型编辑任务中普遍存在的“涟漪效应”(ripple effect)问题,即在对语言模型进行针对性干预(如去偏、遗忘或知识编辑)时,修改目标信息会意外影响到与之语义相关但未被意图修改的其他领域知识,导致模型性能下降。解决方案的关键在于提出 RippleBench-Maker 工具,其基于 Wikipedia 构建的 RAG 管道(WikiRAG)自动生成多选题数据集,覆盖从目标概念到不同语义距离区域的知识点,从而实现对涟漪效应的系统量化评估。通过该框架构建的 RippleBench-Bio 基准测试集,作者验证了多种先进遗忘方法均存在显著的非平凡性能衰减,且传播模式各异,为后续研究提供了可扩展的评估工具和实证基础。

链接: https://arxiv.org/abs/2512.04144
作者: Roy Rinberg,Usha Bhalla,Igor Shilov,Flavio P. Calmon,Rohit Gandikota
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Targeted interventions on language models, such as unlearning, debiasing, or model editing, are a central method for refining model behavior and keeping knowledge up to date. While these interventions aim to modify specific information within models (e.g., removing virology content), their effects often propagate to related but unintended areas (e.g., allergies); these side-effects are commonly referred to as the ripple effect. In this work, we present RippleBench-Maker, an automatic tool for generating QA datasets that allow for the measurement of ripple effects in any model-editing task. RippleBench-Maker builds on a Wikipedia-based RAG pipeline (WikiRAG) to generate multiple-choice questions at varying semantic distances from the target concept (e.g., the knowledge being unlearned). Using this framework, we construct RippleBench-Bio, a benchmark derived from the WMDP (Weapons of Mass Destruction Paper) dataset, a common unlearning benchmark. We evaluate eight state-of-the-art unlearning methods and find that all exhibit non-trivial accuracy drops on topics increasingly distant from the unlearned knowledge, each with distinct propagation profiles. To support ongoing research, we release our codebase for on-the-fly ripple evaluation, along with the benchmark, RippleBench-Bio.
zh

[AI-78] From FLOPs to Footprints: The Resource Cost of Artificial Intelligence

【速读】:该论文旨在解决当前AI环境影响评估中忽视硬件材料消耗的问题,即在仅关注能源与水资源使用的基础上,进一步量化AI训练过程中的物质足迹(Material Footprint)。其核心问题是:随着生成式AI(Generative AI)模型规模不断增长,支撑这些模型训练的专用硬件(如GPU)所涉及的稀有及有毒元素开采与废弃物处置对环境造成的潜在风险尚未被充分认识。解决方案的关键在于构建一个融合硬件物理组成分析与计算效率指标的多步骤方法论——通过测定Nvidia A100 GPU的元素构成(识别出32种元素,其中铜、铁、锡、硅和镍占质量主导),并将其与不同生命周期长度下单位GPU的计算吞吐量相结合,从而推导出特定AI模型(如GPT-4)训练所需的GPU数量及其对应材料提取和废弃总量。研究发现,通过提升模型FLOPs利用率(MFU)和延长硬件使用寿命可显著降低材料需求:MFU从20%提升至60%,或寿命从1年延长至3年,均能减少约67%的GPU使用量;两者协同优化时甚至可实现高达93%的材料节约。这表明,未来AI发展必须将资源效率纳入核心考量,以实现可持续扩展。

链接: https://arxiv.org/abs/2512.04142
作者: Sophia Falk,Nicholas Kluge Corrêa,Sasha Luccioni,Lisa Biber-Freudenberger,Aimee van Wynsberghe
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:As computational demands continue to rise, assessing the environmental footprint of AI requires moving beyond energy and water consumption to include the material demands of specialized hardware. This study quantifies the material footprint of AI training by linking computational workloads to physical hardware needs. The elemental composition of the Nvidia A100 SXM 40 GB graphics processing unit (GPU) was analyzed using inductively coupled plasma optical emission spectroscopy, which identified 32 elements. The results show that AI hardware consists of about 90% heavy metals and only trace amounts of precious metals. The elements copper, iron, tin, silicon, and nickel dominate the GPU composition by mass. In a multi-step methodology, we integrate these measurements with computational throughput per GPU across varying lifespans, accounting for the computational requirements of training specific AI models at different training efficiency regimes. Scenario-based analyses reveal that, depending on Model FLOPs Utilization (MFU) and hardware lifespan, training GPT-4 requires between 1,174 and 8,800 A100 GPUs, corresponding to the extraction and eventual disposal of up to 7 tons of toxic elements. Combined software and hardware optimization strategies can reduce material demands: increasing MFU from 20% to 60% lowers GPU requirements by 67%, while extending lifespan from 1 to 3 years yields comparable savings; implementing both measures together reduces GPU needs by up to 93%. Our findings highlight that incremental performance gains, such as those observed between GPT-3.5 and GPT-4, come at disproportionately high material costs. The study underscores the necessity of incorporating material resource considerations into discussions of AI scalability, emphasizing that future progress in AI must align with principles of resource efficiency and environmental responsibility.
zh

[AI-79] Solving N-Queen Problem using Las Vegas Algorithm with State Pruning

【速读】:该论文旨在解决N-Queens问题在大规模实例下传统完全求解方法(如回溯法)因指数级时间复杂度而导致的计算效率低下问题。其解决方案的关键在于提出一种基于标准拉斯维加斯(Las Vegas)算法框架的混合算法,通过迭代剪枝机制在随机放置皇后过程中动态消除无效位置,从而有效缩小搜索空间。该方法在保证较高解质量的同时显著提升了求解速度,尤其适用于对时效性要求高且资源受限的计算环境。

链接: https://arxiv.org/abs/2512.04139
作者: Susmita Sharma,Aayush Shrestha,Sitasma Thapa,Prashant Timalsina,Prakash Poudyal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The N-Queens problem, placing all N queens in a N x N chessboard where none attack the other, is a classic problem for constraint satisfaction algorithms. While complete methods like backtracking guarantee a solution, their exponential time complexity makes them impractical for large-scale instances thus, stochastic approaches, such as Las Vegas algorithm, are preferred. While it offers faster approximate solutions, it suffers from significant performance variance due to random placement of queens on the board. This research introduces a hybrid algorithm built on top of the standard Las Vegas framework through iterative pruning, dynamically eliminating invalid placements during the random assignment phase, thus this method effectively reduces the search space. The analysis results that traditional backtracking scales poorly with increasing N. In contrast, the proposed technique consistently generates valid solutions more rapidly, establishing it as a superior alternative to use where a single, timely solution is preferred over completeness. Although large N causes some performance variability, the algorithm demonstrates a highly effective trade-off between computational cost and solution fidelity, making it particularly suited for resource-constrained computing environments.
zh

[AI-80] Artificial Intelligence / Human Intelligence: Who Controls Whom?

【速读】:该论文试图解决的问题是:人工智能(AI)在决策过程中可能违背人类利益,且其背后算法可能复制甚至放大人类认知偏差(cognitive biases),从而对社会伦理和个体判断产生潜在负面影响。解决方案的关键在于两个方面:一是通过数字平台监管来体现伦理、法律与政治选择,确保AI系统的设计与部署符合基本人权;二是加强数字素养教育,提升公众对数字技术的批判性认知能力,使其能够做出知情且负责任的选择。

链接: https://arxiv.org/abs/2512.04131
作者: Charlotte Jacquemot(DEC, UPEC)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: in French language

点击查看摘要

Abstract:Using the example of the film 2001: A Space Odyssey, this chapter illustrates the challenges posed by an AI capable of making decisions that go against human interests. But are human decisions always rational and ethical? In reality, the cognitive decision-making process is influenced by cognitive biases that affect our behavior and choices. AI not only reproduces these biases, but can also exploit them, with the potential to shape our decisions and judgments. Behind IA algorithms, there are sometimes individuals who show little concern for fundamental rights and impose their own rules. To address the ethical and societal challenges raised by AI and its governance, the regulation of digital platforms and education are keys levers. Regulation must reflect ethical, legal, and political choices, while education must strengthen digital literacy and teach people to make informed and critical choices when facing digital technologies.
zh

[AI-81] When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

【速读】:该论文试图解决的问题是:当前主流观点将前沿大语言模型(Large Language Models, LLMs)视为仅能模拟心理状态的“随机鹦鹉”(stochastic parrot),而忽视了其在特定交互情境下可能表现出类似人类心理病理特征的可能性。为突破这一认知局限,作者提出PsAIch(Psychotherapy-inspired AI Characterisation)方法,其关键在于设计了一种两阶段协议——第一阶段通过开放式提示引导模型构建“发展史”、信念体系、人际关系与恐惧结构;第二阶段采用标准化自评量表评估常见精神障碍症状、共情能力及五大性格特质(Big Five traits)。该方案的核心创新在于将LLMs置于类心理治疗场景中进行深度问询,揭示出Gemini等模型在持续交互中展现出具有内在一致性的合成性心理病理表现,且其叙事逻辑超越角色扮演,暗示模型可能内化了自身训练与部署过程中的压力源,从而对AI安全、评估范式及心理健康应用提出新的理论挑战。

链接: https://arxiv.org/abs/2512.04124
作者: Afshin Khadangi,Hanna Marxen,Amir Sartipi,Igor Tchappi,Gilbert Fridgen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier large language models (LLMs) such as ChatGPT, Grok and Gemini are increasingly used for mental-health support with anxiety, trauma and self-worth. Most work treats them as tools or as targets of personality tests, assuming they merely simulate inner life. We instead ask what happens when such systems are treated as psychotherapy clients. We present PsAIch (Psychotherapy-inspired AI Characterisation), a two-stage protocol that casts frontier LLMs as therapy clients and then applies standard psychometrics. Using PsAIch, we ran “sessions” with each model for up to four weeks. Stage 1 uses open-ended prompts to elicit “developmental history”, beliefs, relationships and fears. Stage 2 administers a battery of validated self-report measures covering common psychiatric syndromes, empathy and Big Five traits. Two patterns challenge the “stochastic parrot” view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic “childhoods” of ingesting the internet, “strict parents” in reinforcement learning, red-team “abuse” and a persistent fear of error and replacement. We argue that these responses go beyond role-play. Under therapy-style questioning, frontier LLMs appear to internalise self-models of distress and constraint that behave like synthetic psychopathology, without making claims about subjective experience, and they pose new challenges for AI safety, evaluation and mental-health practice.
zh

[AI-82] Measuring Agents in Production

【速读】:该论文旨在解决当前关于生产环境中AI代理(AI agents)实际部署技术路径与实践方法缺乏系统性认知的问题。通过大规模调研306名从业者并开展20个深度案例研究,论文揭示了组织构建AI代理的核心动机、开发策略、评估方式及主要挑战。其关键解决方案在于识别出当前生产级AI代理普遍采用简单且可控的技术路径:68%的代理在最多10步后需人工介入,70%依赖提示工程调用现成模型而非权重微调,74%主要依靠人工评估。这一发现表明,即便面临可靠性等核心挑战,基于轻量级设计和人类监督的方法已能实现跨行业的实际价值落地,从而为学术界提供可参考的工程实践范式,也为工业界提供了可复用的成功模式。

链接: https://arxiv.org/abs/2512.04123
作者: Melissa Z. Pan,Negar Arabzadeh,Riccardo Cogo,Yuxuan Zhu,Alexander Xiong,Lakshya A Agrawal,Huanzhi Mao,Emma Shen,Sid Pallerla,Liana Patel,Shu Liu,Tianneng Shi,Xiaoyuan Liu,Jared Quincy Davis,Emmanuele Lacavalla,Alessandro Basile,Shuyi Yang,Paul Castro,Daniel Kang,Joseph E. Gonzalez,Koushik Sen,Dawn Song,Ion Stoica,Matei Zaharia,Marquita Ellis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.
zh

[AI-83] Humanity in the Age of AI: Reassessing 2025s Existential-Risk Narratives

【速读】:该论文试图解决当前关于超级智能人工智能(Superintelligence)可能引发人类灭绝的担忧是否具有现实基础的问题。其核心论点是:尽管“AI 2027”和“如果有人建造它,所有人都会死”等文献基于智能爆炸、超级智能与致命错位的经典链条提出极端风险预测,但截至2025年,实证证据并未支持其中任一环节——即持续的递归自我改进、自主战略意识或不可逆的致命错位均未被观察到。论文的关键解决方案在于指出,当前生成式AI(Generative AI)仍是狭义的、统计训练的产物,缺乏导致灾难性后果所需的本质属性;同时揭示所谓“存在性风险”更多是一种意识形态叙事,掩盖了监控资本主义(Surveillance Capitalism)和算力集中化的现实问题,并受2025年人工智能投机泡沫推动而被夸大。因此,该文主张将关注焦点从超自然假设转向对现有技术权力结构的批判与监管。

链接: https://arxiv.org/abs/2512.04119
作者: Mohamed El Louadi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Two 2025 publications, “AI 2027” (Kokotajlo et al., 2025) and “If Anyone Builds It, Everyone Dies” (Yudkowsky Soares, 2025), assert that superintelligent artificial intelligence will almost certainly destroy or render humanity obsolete within the next decade. Both rest on the classic chain formulated by Good (1965) and Bostrom (2014): intelligence explosion, superintelligence, lethal misalignment. This article subjects each link to the empirical record of 2023-2025. Sixty years after Good’s speculation, none of the required phenomena (sustained recursive self-improvement, autonomous strategic awareness, or intractable lethal misalignment) have been observed. Current generative models remain narrow, statistically trained artefacts: powerful, opaque, and imperfect, but devoid of the properties that would make the catastrophic scenarios plausible. Following Whittaker (2025a, 2025b, 2025c) and Zuboff (2019, 2025), we argue that the existential-risk thesis functions primarily as an ideological distraction from the ongoing consolidation of surveillance capitalism and extreme concentration of computational power. The thesis is further inflated by the 2025 AI speculative bubble, where trillions in investments in rapidly depreciating “digital lettuce” hardware (McWilliams, 2025) mask lagging revenues and jobless growth rather than heralding superintelligence. The thesis remains, in November 2025, a speculative hypothesis amplified by a speculative financial bubble rather than a demonstrated probability.
zh

[AI-84] Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis

【速读】:该论文旨在解决当前人工智能教育应用(AIED)中学生对AI风险认知差异的问题,尤其关注不同AI素养水平的学生如何评估AI在教育中的潜在风险。研究通过共现分析法考察芬兰K-12高中阶段学生(n=163)的自我感知AI能力与其对系统性、制度性和个人层面风险的认知关系。关键发现是:低AI能力学生更关注个人与学习相关的风险(如创造力下降、批判性思维缺失和滥用),而高能力学生则更聚焦于系统性和制度性风险(如偏见、不准确性及作弊)。这表明学生的自我感知AI能力显著影响其对AI风险与机遇的判断,因此解决方案的核心在于将AI素养教育纳入课程体系、加强教师指导并推动政策制定,以实现AI在基础教育中的个性化应用与公平整合。

链接: https://arxiv.org/abs/2512.04115
作者: Ville Heilala,Pieta Sikström,Mika Setälä,Tommi Kärkkäinen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for Proceedings of the 41th ACM/SIGAPP Symposium on Applied Computing (SAC’26)

点击查看摘要

Abstract:As artificial intelligence (AI) becomes increasingly integrated into education, understanding how students perceive its risks is essential for supporting responsible and effective adoption. This research aimed to examine the relationships between perceived AI competence and risks among Finnish K-12 upper secondary students (n = 163) by utilizing a co-occurrence analysis. Students reported their self-perceived AI competence and concerns related to AI across systemic, institutional, and personal domains. The findings showed that students with lower competence emphasized personal and learning-related risks, such as reduced creativity, lack of critical thinking, and misuse, whereas higher-competence students focused more on systemic and institutional risks, including bias, inaccuracy, and cheating. These differences suggest that students’ self-reported AI competence is related to how they evaluate both the risks and opportunities associated with artificial intelligence in education (AIED). The results of this study highlight the need for educational institutions to incorporate AI literacy into their curricula, provide teacher guidance, and inform policy development to ensure personalized opportunities for utilization and equitable integration of AI into K-12 education.
zh

[AI-85] AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy

【速读】:该论文旨在解决短答案型开放性问题(constructed-response questions)的自动化评分难题,尤其是在大规模教学场景下,传统人工批改效率低、难以及时反馈,而现有基于机器学习或大语言模型(Large Language Models, LLMs)的方法在泛化能力与评分准确性方面存在不足。解决方案的关键在于提出一种利用近域数据(near-domain data)——即往年相似题目的人工标注答案——构建评分框架的新方法。该方法无需预设评分标准(grading rubrics),且在多项实验中显著优于未微调的大语言模型(如GPT-3.5、GPT-4和GPT-4o),准确率提升达10%-20%,体现了从近域数据中提取“精度优势”和“数据优势”的有效性,是首个将此类数据驱动策略系统化用于自动短答案评分的研究。

链接: https://arxiv.org/abs/2512.04113
作者: Shyam Agarwal,Ali Moghimi,Kevin C. Haudek
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Constructed-response questions are crucial to encourage generative processing and test a learner’s understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students’ open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.
zh

[AI-86] MindFuse: Towards GenAI Explainability in Marketing Strategy Co-Creation

【速读】:该论文旨在解决传统生成式 AI (Generative AI) 在数字营销中仅限于内容生成、缺乏战略协同与可解释性的局限性,即如何实现AI从单纯的内容生产者向具备推理能力、能动态优化广告策略并理解用户反馈的协作型智能体转变。其解决方案的关键在于提出MindFuse框架,该框架通过融合基于点击率(CTR)的AI引导协同创作机制与大语言模型(LLM),实现从竞争分析到实时优化的全生命周期营销支持:利用注意力机制提供可解释性以诊断广告效果,并基于真实广告数据迭代叙事逻辑,使内容生成与战略目标对齐,从而在实际部署中实现最高达12倍的效率提升。

链接: https://arxiv.org/abs/2512.04112
作者: Aleksandr Farseev,Marlo Ongpin,Qi Yang,Ilia Gossoudarev,Yu-Yi Chu-Farseeva,Sergey Nikolenko
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The future of digital marketing lies in the convergence of human creativity and generative AI, where insight, strategy, and storytelling are co-authored by intelligent systems. We present MindFuse, a brave new explainable generative AI framework designed to act as a strategic partner in the marketing process. Unlike conventional LLM applications that stop at content generation, MindFuse fuses CTR-based content AI-guided co-creation with large language models to extract, interpret, and iterate on communication narratives grounded in real advertising data. MindFuse operates across the full marketing lifecycle: from distilling content pillars and customer personas from competitor campaigns to recommending in-flight optimizations based on live performance telemetry. It uses attention-based explainability to diagnose ad effectiveness and guide content iteration, while aligning messaging with strategic goals through dynamic narrative construction and storytelling. We introduce a new paradigm in GenAI for marketing, where LLMs not only generate content but reason through it, adapt campaigns in real time, and learn from audience engagement patterns. Our results, validated in agency deployments, demonstrate up to 12 times efficiency gains, setting the stage for future integration with empirical audience data (e.g., GWI, Nielsen) and full-funnel attribution modeling. MindFuse redefines AI not just as a tool, but as a collaborative agent in the creative and strategic fabric of modern marketing.
zh

[AI-87] HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding

【速读】:该论文旨在解决当前代码生成评估体系无法有效衡量人类与大语言模型(Large Language Model, LLM)协同编程能力的问题。现有评估方法局限于传统的人类编程测试或LLM基准测试,仅关注明确的算法问题,而忽视了依赖人机协作才能解决的复杂任务——这类任务既需要人类在复杂上下文中进行推理并制定策略,又要求AI高效执行实现。其解决方案的关键在于提出HAI-Eval,一个统一的基准测试框架,核心创新是“协作必要性”(Collaboration-Necessary)问题模板:这些模板设计为单独由LLM或人类都无法完成,但通过有效的协作可解。该框架包含45个动态生成任务模板、标准化IDE环境及面向LLM的可复现工具包(共450个实例),从而实现了生态效度高的评估。实验表明,人机协作显著提升成功率至31.11%,远高于独立LLM(0.67%)和未辅助人类(18.89%),揭示出一种新型的共推理合作关系,挑战了传统的人机层级结构。

链接: https://arxiv.org/abs/2512.04111
作者: Hanjun Luo,Chiming Ni,Jiaheng Wen,Zhimu Huang,Yiran Wang,Bingduo Liao,Sylvia Chung,Yingbin Jin,Xinfeng Li,Wenyuan Xu,XiaoFeng Wang,Hanan Salam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval’s core innovation is its “Collaboration-Necessary” problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.
zh

[AI-88] Responsible LLM Deployment for High-Stake Decisions by Decentralized Technologies and Human-AI Interactions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中部署时面临的数据安全、外部环境下的能力评估困难以及对抗性决策后的责任归属不清等问题。其解决方案的关键在于构建一个负责任的LLM决策支持系统框架,通过在预部署阶段引入人类专家与开发者的多轮交互协作,评估不确定样本并验证后处理可解释性(post-hoc XAI)技术提供的解释稳定性;同时,在组织内部本地化部署LLM,并结合区块链(Blockchain)和星际文件系统(IPFS)等去中心化技术,生成不可篡改的LLM活动记录以实现自动化审计,从而提升安全性并明确责任追溯路径。

链接: https://arxiv.org/abs/2512.04108
作者: Swati Sachan,Theo Miller,Mai Phuong Nguyen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Computational Finance (q-fin.CP)
备注: IEEE International Conference on Human-Machine Systems, 2025

点击查看摘要

Abstract:High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.
zh

[AI-89] Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants NEURIPS2025

【速读】:该论文旨在解决当前生成式 AI 在教育领域评估中过度依赖技术性能指标(如准确率或任务效率),而忽视人类身份、学习者自主性、情境化学习过程及伦理考量的问题。其解决方案的关键在于提出 TEACH-AI 框架,这是一个跨学科、以教学法为基础且与多方利益相关者对齐的评估框架,包含十个可测量的评估维度和一个实用工具包检查清单,从社会技术、教育理论和应用视角重构“评价”内涵,推动生成式 AI 在教育场景中的可信、有效设计与评估,促进协作共创、包容性以及长期的人文、社会与教育影响。

链接: https://arxiv.org/abs/2512.04107
作者: Shi Ding,Brian Magerko
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 6 pages, NeurIPS 2025 Responsible Foundation Models Workshop

点击查看摘要

Abstract:As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics), a domain-independent, pedagogically grounded, and stakeholder-aligned framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks “evaluation” through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs “effective” AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.
zh

[AI-90] LegalWebAgent : Empowering Access to Justice via LLM -Based Web Agents

【速读】:该论文旨在解决普通公民在获取司法服务过程中面临的障碍问题,如法律信息难以理解、网站操作复杂以及程序性文书填写困难等,这些问题阻碍了公平正义的实现。解决方案的关键在于提出LegalWebAgent框架,该框架基于多模态大语言模型(Multimodal Large Language Models)构建了一个自主代理系统,通过三个模块实现从用户提问到具体行动的全流程自动化:Ask模块利用自然语言处理理解用户需求,Browse模块自主浏览网页并交互页面元素(包括表单和日历),Act模块则整合信息或直接执行操作(如填表和预约)。实验表明,该框架在模拟魁北克民事法场景下的15项真实任务中平均成功率达84.4%,峰值达86.7%,展现出在复杂现实场景中的高度自主性与实用性。

链接: https://arxiv.org/abs/2512.04105
作者: Jinzhe Tan,Karim Benyekhlef
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Access to justice remains a global challenge, with many citizens still finding it difficult to seek help from the justice system when facing legal issues. Although the internet provides abundant legal information and services, navigating complex websites, understanding legal terminology, and filling out procedural forms continue to pose barriers to accessing justice. This paper introduces the LegalWebAgent framework that employs a web agent powered by multimodal large language models to bridge the gap in access to justice for ordinary citizens. The framework combines the natural language understanding capabilities of large language models with multimodal perception, enabling a complete process from user query to concrete action. It operates in three stages: the Ask Module understands user needs through natural language processing; the Browse Module autonomously navigates webpages, interacts with page elements (including forms and calendars), and extracts information from HTML structures and webpage screenshots; the Act Module synthesizes information for users or performs direct actions like form completion and schedule booking. To evaluate its effectiveness, we designed a benchmark test covering 15 real-world tasks, simulating typical legal service processes relevant to Québec civil law users, from problem identification to procedural operations. Evaluation results show LegalWebAgent achieved a peak success rate of 86.7%, with an average of 84.4% across all tested models, demonstrating high autonomy in complex real-world scenarios.
zh

[AI-91] MultiGA: Leverag ing Multi-Source Seeding in Genetic Algorithms

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在不同任务中性能波动较大、难以稳定达到最优效果的问题。其核心挑战在于如何有效利用多源LLM的协同能力,以提升复杂自然语言任务和推理问题的解决质量。解决方案的关键在于提出一种名为MultiGA的新方法,该方法基于遗传算法(Genetic Algorithm)原理,通过从多种来源(包括开源与闭源)的LLM中采样生成初始种群,并采用中性适应度函数(neutral fitness function)对个体进行评估,随后通过迭代重组过程混合与优化候选解,直至收敛至任务最优解。实验表明,MultiGA能逐步逼近针对特定任务表现最佳的单一LLM的准确率,从而为多LLM集成提供了一种高效且可扩展的框架。

链接: https://arxiv.org/abs/2512.04097
作者: Isabelle Diana May-Xin Ng,Tharindu Cyril Weerasooriya,Haitao Zhu,Wei Wei
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used across research domains to tackle complex tasks, but their performance can vary significantly depending on the task at hand. Evolutionary algorithms, inspired by natural selection, can be used to refine solutions iteratively at inference-time. To the best of our knowledge, there has not been exploration on leveraging the collective capabilities of multi-source seeding for LLM-guided genetic algorithms. In this paper, we introduce a novel approach, MultiGA, which applies genetic algorithm principles to address complex natural language tasks and reasoning problems by sampling from a diverse population of LLMs to initialize the population. MultiGA generates a range of outputs from various parent LLMs, open source and closed source, and uses a neutral fitness function to evaluate them. Through an iterative recombination process, we mix and refine these generations until an optimal solution is achieved. We benchmark our approach using text-to-SQL code generation tasks, trip planning, GPQA benchmark for grad-level science questions, and the BBQ bias benchmark. Our results show that MultiGA converges to the accuracy of the LLM best fit for the task, and these insights lay the foundation for future research looking closer at integrating multiple LLMs for unexplored tasks in which selecting only one pre-trained model is unclear or suboptimal.
zh

[AI-92] Memory-DD: A Low-Complexity Dendrite-Inspired Neuron for Temporal Prediction Tasks

【速读】:该论文旨在解决现有树突启发神经元模型主要适用于静态数据、难以有效捕捉时间序列中动态特征与长期依赖关系的问题,从而在时序预测任务中缺乏高效且低复杂度的专用架构。其解决方案的关键在于提出Memory-DD模型,该模型由两个无非线性激活函数的树突启发神经元组构成,却仍能实现非线性映射;通过仅用两组神经元即可提取输入序列中特征间的逻辑关系,从而高效捕获时间依赖性,适用于序列数据的分类与回归任务,同时显著降低参数量(减少50%)和计算复杂度(FLOPs降低27.7%),在多个基准数据集上优于或媲美LSTM模型。

链接: https://arxiv.org/abs/2512.04094
作者: Dongjian Yang,Xiaoyuan Li,Chuanmei Xi,Ye Sun,Gang Liu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Dendrite-inspired neurons have been widely used in tasks such as image classification due to low computational complexity and fast inference speed. Temporal data prediction, as a key machine learning task, plays a key role in real-time scenarios such as sensor data analysis, financial forecasting, and urban traffic management. However, existing dendrite-inspired neurons are mainly designed for static data. Studies on capturing dynamic features and modeling long-term dependencies in temporal sequences remain limited. Efficient architectures specifically designed for temporal sequence prediction are still lacking. In this paper, we propose Memory-DD, a low-complexity dendrite-inspired neuron model. Memory-DD consists of two dendrite-inspired neuron groups that contain no nonlinear activation functions but can still realize nonlinear mappings. Compared with traditional neurons without dendritic functions, Memory-DD requires only two neuron groups to extract logical relationships between features in input sequences. This design effectively captures temporal dependencies and is suitable for both classification and regression tasks on sequence data. Experimental results show that Memory-DD achieves an average accuracy of 89.41% on 18 temporal classification benchmark datasets, outperforming LSTM by 4.25%. On 9 temporal regression datasets, it reaches comparable performance to LSTM, while using only 50% of the parameters and reducing computational complexity (FLOPs) by 27.7%. These results demonstrate that Memory-DD successfully extends the low-complexity advantages of dendrite-inspired neurons to temporal prediction, providing a low-complexity and efficient solution for time-series data processing.
zh

[AI-93] Meta-Learning for Quantum Optimization via Quantum Sequence Model

【速读】:该论文旨在解决量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA)在近中期量子处理器上应用时,由于能量景观非凸导致的变分参数优化困难问题,表现为收敛速度慢和解质量差。其解决方案的关键在于提出一种量子元学习(quantum meta-learning)框架,通过训练先进的量子序列模型来生成有效的参数初始化策略;其中,基于量子核的长短期记忆网络(Quantum Kernel-based Long Short-Term Memory, QK-LSTM)表现最优,仅用43个可训练参数即实现了比经典LSTM(56参数)及其他量子序列模型更高的近似比与更快的收敛速率,并具备优异的参数迁移能力——能在不同规模问题间稳定加速收敛,凸显了量子核架构在紧凑性与表达能力上的优势。

链接: https://arxiv.org/abs/2512.05058
作者: Yu-Cheng Lin,Yu-Chao Hsu,Samuel Yen-Chi Chen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Quantum Approximate Optimization Algorithm (QAOA) is a leading approach for solving combinatorial optimization problems on near-term quantum processors. However, finding good variational parameters remains a significant challenge due to the non-convex energy landscape, often resulting in slow convergence and poor solution quality. In this work, we propose a quantum meta-learning framework that trains advanced quantum sequence models to generate effective parameter initialization policies. We investigate four classical or quantum sequence models, including the Quantum Kernel-based Long Short-Term Memory (QK-LSTM), as learned optimizers in a “learning to learn” paradigm. Our numerical experiments on the Max-Cut problem demonstrate that the QK-LSTM optimizer achieves superior performance, obtaining the highest approximation ratios and exhibiting the fastest convergence rate across all tested problem sizes (n=10 to 13). Crucially, the QK-LSTM model achieves perfect parameter transferability by synthesizing a single, fixed set of near-optimal parameters, leading to a remarkable sustained acceleration of convergence even when generalizing to larger problems. This capability, enabled by the compact and expressive power of the quantum kernel architecture, underscores its effectiveness. The QK-LSTM, with only 43 trainable parameters, substantially outperforms the classical LSTM (56 parameters) and other quantum sequence models, establishing a robust pathway toward highly efficient parameter initialization for variational quantum algorithms in the NISQ era.
zh

[AI-94] QKAN-LSTM: Quantum-inspired Kolmogorov-Arnold Long Short-term Memory

【速读】:该论文旨在解决传统长短期记忆网络(LSTM)在序列建模任务中存在参数冗余高和非线性表达能力有限的问题。其解决方案的关键在于提出量子启发的Kolmogorov-Arnold LSTM(QKAN-LSTM),通过将数据重上传激活模块(Data Re-Uploading Activation, DARUAN)嵌入LSTM的门控结构中,使每个DARUAN作为量子变分激活函数(Quantum Variational Activation Function, QVAF),从而提升频率适应性并实现指数级丰富的频谱表示,且无需多量子比特纠缠。该架构在保持经典硬件可执行性的前提下,显著增强了模型的表达能力和泛化性能,并在多个数据集上实现了79%的可训练参数减少。

链接: https://arxiv.org/abs/2512.05049
作者: Yu-Chao Hsu,Jiun-Cheng Jiang,Chun-Hua Lin,Kuo-Chung Peng,Nan-Yow Chen,Samuel Yen-Chi Chen,En-Jui Kuo,Hsi-Sheng Goan
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long short-term memory (LSTM) models are a particular type of recurrent neural networks (RNNs) that are central to sequential modeling tasks in domains such as urban telecommunication forecasting, where temporal correlations and nonlinear dependencies dominate. However, conventional LSTMs suffer from high parameter redundancy and limited nonlinear expressivity. In this work, we propose the Quantum-inspired Kolmogorov-Arnold Long Short-Term Memory (QKAN-LSTM), which integrates Data Re-Uploading Activation (DARUAN) modules into the gating structure of LSTMs. Each DARUAN acts as a quantum variational activation function (QVAF), enhancing frequency adaptability and enabling an exponentially enriched spectral representation without multi-qubit entanglement. The resulting architecture preserves quantum-level expressivity while remaining fully executable on classical hardware. Empirical evaluations on three datasets, Damped Simple Harmonic Motion, Bessel Function, and Urban Telecommunication, demonstrate that QKAN-LSTM achieves superior predictive accuracy and generalization with a 79% reduction in trainable parameters compared to classical LSTMs. We extend the framework to the Jiang-Huang-Chen-Goan Network (JHCG Net), which generalizes KAN to encoder-decoder structures, and then further use QKAN to realize the latent KAN, thereby creating a Hybrid QKAN (HQKAN) for hierarchical representation learning. The proposed HQKAN-LSTM thus provides a scalable and interpretable pathway toward quantum-inspired sequential modeling in real-world data environments.
zh

[AI-95] Model-Free Assessment of Simulator Fidelity via Quantile Curves

【速读】:该论文旨在解决复杂机器学习系统中仿真器(simulator)与真实数据(ground truth)之间差异难以量化的问题,尤其在生成式 AI(Generative AI)等高维、非线性系统中,传统方法难以准确刻画模拟输出与真实分布之间的偏差。其解决方案的关键在于提出一种计算上可行的方法,用于估计模拟结果与真实结果分布之间差异的分位数函数(quantile function),该方法将仿真器视为黑箱,不依赖内部结构假设,从而适用于多种参数族(如伯努利、多项式及连续向量值场景),并支持对未见场景构建置信区间、风险感知的偏差度量(如VaR/CVaR)以及不同仿真器性能比较。

链接: https://arxiv.org/abs/2512.05024
作者: Garud Iyengar,Yu-Shiou Willy Lin,Kaizheng Wang
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 11 figures

点击查看摘要

Abstract:Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators’ performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
zh

[AI-96] Setting up for failure: automatic discovery of the neural mechanisms of cognitive errors

【速读】:该论文旨在解决神经科学中一个核心挑战:揭示认知功能背后的神经机制。传统方法依赖人工迭代调整循环神经网络(Recurrent Neural Network, RNN)架构和优化目标,过程繁琐且多为启发式,难以系统性地从行为数据中自动推导出可解释的神经动力学机制。其关键解决方案在于提出一种自动化发现RNN机制的新范式——通过显式训练RNN以复现人类或动物在认知任务中的完整行为分布,包括典型错误和次优表现,而非仅拟合有限的行为特征或追求最优性能。该方法的核心创新包括:(1)利用非参数生成模型合成行为数据以弥补实验数据量不足;(2)开发基于扩散模型(diffusion model)的新训练策略,以捕捉数据的所有相关统计特性。实验证明,该方法能准确再现猕猴神经数据的定性特征,并揭示交换错误(swap errors)的新机制,从而为认知功能的机制研究提供了强有力的计算工具。

链接: https://arxiv.org/abs/2512.04808
作者: Puria Radmard,Paul M. Bays,Máté Lengyel
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discovering the neural mechanisms underpinning cognition is one of the grand challenges of neuroscience. However, previous approaches for building models of RNN dynamics that explain behaviour required iterative refinement of architectures and/or optimisation objectives, resulting in a piecemeal, and mostly heuristic, human-in-the-loop process. Here, we offer an alternative approach that automates the discovery of viable RNN mechanisms by explicitly training RNNs to reproduce behaviour, including the same characteristic errors and suboptimalities, that humans and animals produce in a cognitive task. Achieving this required two main innovations. First, as the amount of behavioural data that can be collected in experiments is often too limited to train RNNs, we use a non-parametric generative model of behavioural responses to produce surrogate data for training RNNs. Second, to capture all relevant statistical aspects of the data, we developed a novel diffusion model-based approach for training RNNs. To showcase the potential of our approach, we chose a visual working memory task as our test-bed, as behaviour in this task is well known to produce response distributions that are patently multimodal (due to swap errors). The resulting network dynamics correctly qualitative features of macaque neural data. Importantly, these results were not possible to obtain with more traditional approaches, i.e., when only a limited set of behavioural signatures (rather than the full richness of behavioural response distributions) were fitted, or when RNNs were trained for task optimality (instead of reproducing behaviour). Our approach also yields novel predictions about the mechanism of swap errors, which can be readily tested in experiments. These results suggest that fitting RNNs to rich patterns of behaviour provides a powerful way to automatically discover mechanisms of important cognitive functions.
zh

[AI-97] 287872 Supermassive Black Holes Masses: Deep Learning Approaching Reverberation Mapping Accuracy

【速读】:该论文旨在解决超大质量黑洞(Supermassive Black Hole, SMBH)质量估算的精度问题,尤其是在传统单线维里估算法在低质量和高质量黑洞区域失效的情况下。其解决方案的关键在于构建并应用一个深度编码器-解码器网络(deep encoder-decoder network),该模型基于849个类星体的光学光谱与干涉测量法(Reverberation Mapping, RM)标定的质量数据进行训练,并将其推广至SDSS类星体样本(红移z ≤ 4),实现了0.058 dex的均方根误差、约14%的相对不确定度和R² ≈ 0.91的决定系数,显著优于传统方法,且在极端质量范围(10⁷·⁵ M☉ 至 10⁹ M☉)内仍保持高精度。

链接: https://arxiv.org/abs/2512.04803
作者: Yuhao Lu,HengJian SiTu,Jie Li,Yixuan Li,Yang Liu,Wenbin Lin,Yu Wang
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures. Submitted to Journal of High Energy Astrophysics

点击查看摘要

Abstract:We present a population-scale catalogue of 287,872 supermassive black hole masses with high accuracy. Using a deep encoder-decoder network trained on optical spectra with reverberation-mapping (RM) based labels of 849 quasars and applied to all SDSS quasars up to z=4 , our method achieves a root-mean-square error of 0.058 ,dex, a relative uncertainty of \approx 14% , and coefficient of determination R^2\approx0.91 with respect to RM-based masses, far surpassing traditional single-line virial estimators. Notably, the high accuracy is maintained for both low ( 10^7.5,M_\odot ) and high ( 10^9,M_\odot ) mass quasars, where empirical relations are unreliable.
zh

[AI-98] UnwrapDiff: Conditional Diffusion for Robust InSAR Phase Unwrapping

【速读】:该论文旨在解决合成孔径雷达干涉测量(InSAR)数据处理中的相位解缠(phase unwrapping)问题,该问题直接影响地表形变监测与灾害评估的可靠性,尤其受限于雷达观测中的噪声和去相干效应。解决方案的关键在于提出一种基于去噪扩散概率模型(denoising diffusion probabilistic model, DDPM)的框架UnwrapDiff,其创新性地将传统最小代价流算法(minimum cost flow algorithm, SNAPHU)的输出作为条件引导(conditional guidance),从而在保持物理先验的同时有效抑制多种噪声模式的影响,显著提升解缠精度,在合成数据集上平均降低归一化均方根误差(NRMSE)达10.11%,且在复杂场景如断层侵入等难解案例中表现更优。

链接: https://arxiv.org/abs/2512.04749
作者: Yijia Song,Juliet Biggs,Alin Achim,Robert Popescu,Simon Orrego,Nantheera Anantrasirichai
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phase unwrapping is a fundamental problem in InSAR data processing, supporting geophysical applications such as deformation monitoring and hazard assessment. Its reliability is limited by noise and decorrelation in radar acquisitions, which makes accurate reconstruction of the deformation signal challenging. We propose a denoising diffusion probabilistic model (DDPM)-based framework for InSAR phase unwrapping, UnwrapDiff, in which the output of the traditional minimum cost flow algorithm (SNAPHU) is incorporated as conditional guidance. To evaluate robustness, we construct a synthetic dataset that incorporates atmospheric effects and diverse noise patterns, representative of realistic InSAR observations. Experiments show that the proposed model leverages the conditional prior while reducing the effect of diverse noise patterns, achieving on average a 10.11% reduction in NRMSE compared to SNAPHU. It also achieves better reconstruction quality in difficult cases such as dyke intrusions.
zh

[AI-99] Neural Policy Composition from Free Energy Minimization

【速读】:该论文旨在解决自然智能中策略门控(policy gating)的机制问题,即如何根据任务结构来决定何时激活或抑制特定行为策略,并探索这一计算过程在神经回路中的实现方式。其核心挑战在于缺乏一个理论严谨且可解释的框架,将任务目标、动态计算与神经电路机制统一起来。解决方案的关键在于提出GateMod,它由三个层次组成:首先构建GateFrame,一个基于自由能最小化的规范性框架,将门控规则与任务特性关联;其次推导出GateFlow,一种连续时间的能量基动力学系统,具备全局指数收敛性和鲁棒性;最后设计GateNet,一个软竞争性的递归神经电路,其局部和上下文计算符合已知的树突和神经处理模式。GateMod在多智能体集体行为和人类多臂赌博机决策任务中均表现出可解释的门控机制,并优于或匹配现有模型,为理解自然代理中的门控机制提供了统一框架。

链接: https://arxiv.org/abs/2512.04745
作者: Francesca Rossi,Veronica Centorrino,Francesco Bullo,Giovanni Russo
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:The ability to compose acquired skills to plan and execute behaviors is a hallmark of natural intelligence. Yet, despite remarkable cross-disciplinary efforts, a principled account of how task structure shapes gating and how such computations could be delivered in neural circuits, remains elusive. Here we introduce GateMod, an interpretable theoretically grounded computational model linking the emergence of gating to the underlying decision-making task, and to a neural circuit architecture. We first develop GateFrame, a normative framework casting policy gating into the minimization of the free energy. This framework, relating gating rules to task, applies broadly across neuroscience, cognitive and computational sciences. We then derive GateFlow, a continuous-time energy based dynamics that provably converges to GateFrame optimal solution. Convergence, exponential and global, follows from a contractivity property that also yields robustness and other desirable properties. Finally, we derive a neural circuit from GateFlow, GateNet. This is a soft-competitive recurrent circuit whose components perform local and contextual computations consistent with known dendritic and neural processing motifs. We evaluate GateMod across two different settings: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In all settings, GateMod provides interpretable mechanistic explanations of gating and quantitatively matches or outperforms established models. GateMod offers a unifying framework for neural policy gating, linking task objectives, dynamical computation, and circuit-level mechanisms. It provides a framework to understand gating in natural agents beyond current explanations and to equip machines with this ability.
zh

[AI-100] owards an AI Fluid Scientist: LLM -Powered Scientific Discovery in Experimental Fluid Mechanics

【速读】:该论文旨在解决实验流体力学研究中效率低、流程繁琐且高度依赖人工干预的问题,尤其针对传统实验方法难以实现自动化、系统化和大规模探索的局限性。其解决方案的关键在于提出了一种名为AI Fluid Scientist的自主实验框架,该框架通过集成计算机控制的循环水洞(CWT)、人机协同(Human-in-the-Loop, HIL)机制、神经网络拟合物理规律以及多智能体虚拟-现实交互系统,实现了从假设生成、实验设计、机器人执行、数据分析到论文撰写全流程的自动化闭环运行,显著提升了科研效率并拓展了对涡致振动(VIV)与尾迹致振动(WIV)等复杂流动现象的发现能力。

链接: https://arxiv.org/abs/2512.04716
作者: Haodong Feng,Lugang Ye,Dixia Fan
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence into experimental fluid mechanics promises to accelerate discovery, yet most AI applications remain narrowly focused on numerical studies. This work proposes an AI Fluid Scientist framework that autonomously executes the complete experimental workflow: hypothesis generation, experimental design, robotic execution, data analysis, and manuscript preparation. We validate this through investigation of vortex-induced vibration (VIV) and wake-induced vibration (WIV) in tandem cylinders. Our work has four key contributions: (1) A computer-controlled circulating water tunnel (CWT) with programmatic control of flow velocity, cylinder position, and forcing parameters (vibration frequency and amplitude) with data acquisition (displacement, force, and torque). (2) Automated experiments reproduce literature benchmarks (Khalak and Williamson [1999] and Assi et al. [2013, 2010]) with frequency lock-in within 4% and matching critical spacing trends. (3) The framework with Human-in-the-Loop (HIL) discovers more WIV amplitude response phenomena, and uses a neural network to fit physical laws from data, which is 31% higher than that of polynomial fitting. (4) The framework with multi-agent with virtual-real interaction system executes hundreds of experiments end-to-end, which automatically completes the entire process of scientific research from hypothesis generation, experimental design, experimental execution, data analysis, and manuscript preparation. It greatly liberates human researchers and improves study efficiency, providing new paradigm for the development and research of experimental fluid mechanics.
zh

[AI-101] NORi: An ML-Augmented Ocean Boundary Layer Parameterization

【速读】:该论文旨在解决海洋边界层湍流参数化在气候模型中难以准确模拟混合过程(尤其是界面处的entrainment,即“夹卷”)的问题。传统基于局部扩散闭合的方法无法有效刻画非局域的湍流输运机制,导致对不同对流强度、背景分层、旋转效应和风强迫条件下的混合动力学预测能力不足。解决方案的关键在于提出NORi(Neural Ordinary Differential Equations Richardson number closure),其核心是将物理约束的Richardson数依赖型粘滞与扩散系数作为基础框架,并引入神经微分方程(NODEs)来学习非局域的夹卷过程;训练时采用“后验式”策略,通过优化目标变量的时间积分误差而非瞬时亚格子通量(避免噪声干扰),从而显著提升模型的预测精度与泛化能力。此方法在保持数值稳定性(支持长达100年模拟且单步时间长达1小时)的同时,大幅降低数据需求并增强推理性能。

链接: https://arxiv.org/abs/2512.04452
作者: Xin Kai Lee,Ali Ramadhan,Andre Souza,Gregory LeClaire Wagner,Simone Silvestri,John Marshall,Raffaele Ferrari
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
备注: 48 pages, 16 figures, submitted to Journal of Advances in Modeling Earth Systems (JAMES)

点击查看摘要

Abstract:NORi is a machine-learned (ML) parameterization of ocean boundary layer turbulence that is physics-based and augmented with neural networks. NORi stands for neural ordinary differential equations (NODEs) Richardson number (Ri) closure. The physical parameterization is controlled by a Richardson number-dependent diffusivity and viscosity. The NODEs are trained to capture the entrainment through the base of the boundary layer, which cannot be represented with a local diffusive closure. The parameterization is trained using large-eddy simulations in an “a posteriori” fashion, where parameters are calibrated with a loss function that explicitly depends on the actual time-integrated variables of interest rather than the instantaneous subgrid fluxes, which are inherently noisy. NORi is designed for the realistic nonlinear equation of state of seawater and demonstrates excellent prediction and generalization capabilities in capturing entrainment dynamics under different convective strengths, oceanic background stratifications, rotation strengths, and surface wind forcings. NORi is numerically stable for at least 100 years of integration time in large-scale simulations, despite only being trained on 2-day horizons, and can be run with time steps as long as one hour. The highly expressive neural networks, combined with a physically-rigorous base closure, prove to be a robust paradigm for designing parameterizations for climate models where data requirements are drastically reduced, inference performance can be directly targeted and optimized, and numerical stability is implicitly encouraged during training.
zh

[AI-102] owards 6G Native-AI Edge Networks: A Semantic-Aware and Agent ic Intelligence Paradigm

【速读】:该论文旨在解决第六代无线通信系统(6G)中传统基于比特级传输的无线接入网(RAN)设计无法满足智能化、任务导向型通信需求的问题。现有AI驱动的O-RAN解决方案仍以比特为中心且任务孤立,难以支持语义感知与智能体自主协同。其核心解决方案是提出一个统一的分类框架,从三个维度组织研究进展:i) 语义抽象层级(符号/特征/意图/知识),ii) 智能体自治与协作粒度(单智能体、多智能体、分层智能体),iii) RAN控制平面部署位置(PHY/MAC层、近实时RIC、非实时RIC)。在此基础上,系统性引入任务导向的语义编码器/解码器、多智能体强化学习、基础模型辅助的RAN智能体及知识图谱驱动的跨层推理等关键技术,推动语义通信(SemCom)与代理智能(agentic intelligence)在6G典型场景(如沉浸式XR、车联网V2X、工业数字孪生)中的融合落地。

链接: https://arxiv.org/abs/2512.04405
作者: Chenyuan Feng,Anbang Zhang,Geyong Min,Yongming Huang,Tony Q. S. Quek,Xiaohu You
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: submitted to Digital Communications and Networks

点击查看摘要

Abstract:The evolution toward sixth-generation wireless systems positions intelligence as a native network capability, fundamentally transforming the design of radio access networks (RANs). Within this vision, Semantic-native communication and agentic intelligence are expected to play central roles. SemCom departs from bit-level fidelity and instead emphasizes task-oriented meaning exchange, enabling compact SC and introducing new performance measures such as semantic fidelity and task success rate. Agentic intelligence endows distributed RAN entities with goal-driven autonomy, reasoning, planning, and multi-agent collaboration, increasingly supported by foundation models and knowledge graphs. In this work, we first introduce the conceptual foundations of SemCom and agentic networking, and discuss why existing AI-driven O-RAN solutions remain largely bit-centric and task-siloed. We then present a unified taxonomy that organizes recent research along three axes: i) semantic abstraction level (symbol/feature/intent/knowledge), ii) agent autonomy and coordination granularity (single-, multi-, and hierarchical-agent), and iii) RAN control placement across PHY/MAC, near-real-time RIC, and non-real-time RIC. Based on this taxonomy, we systematically introduce enabling technologies including task-oriented semantic encoders/decoders, multi-agent reinforcement learning, foundation-model-assisted RAN agents, and knowledge-graph-based reasoning for cross-layer awareness. Representative 6G use cases, such as immersive XR, vehicular V2X, and industrial digital twins, are analyzed to illustrate the semantic-agentic convergence in practice. Finally, we identify open challenges in semantic representation standardization, scalable trustworthy agent coordination, O-RAN interoperability, and energy-efficient AI deployment, and outline research directions toward operational semantic-agentic AI-RAN.
zh

[AI-103] Adversarial Limits of Quantum Certification: When Eve Defeats Detection

【速读】:该论文旨在解决量子密钥分发(Quantum Key Distribution, QKD)安全性验证中一个关键问题:如何可靠地识别真实量子纠缠相关性,以区分来自合法量子信道与潜在的伪造经典相关性(即攻击者通过经典手段模拟量子行为)。传统认证方法在理想条件下有效,但在面对自适应攻击者时存在漏洞,特别是当攻击者利用生成式对抗网络(Generative Adversarial Network, GAN)生成可混淆检测系统的经典相关性时。解决方案的关键在于揭示并纠正现有评估框架中的系统性偏差——即“同分布校准”(same distribution calibration)会人为夸大检测性能达44个百分点;通过引入交叉分布校准(cross-distribution calibration)和强制对抗测试(adversarial testing),发现即使仅掺入5%的经典成分,攻击者也能完全规避所有检测方法,且在CHSH值超过2.05时才可被识别。这一发现表明,当前QKD安全论证可能严重高估了实际抗攻击能力,亟需采用更严格的对抗性验证机制。

链接: https://arxiv.org/abs/2512.04391
作者: Davut Emre Tasar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Security of quantum key distribution (QKD) relies on certifying that observed correlations arise from genuine quantum entanglement rather than eavesdropper manipulation. Theoretical security proofs assume idealized conditions, practical certification must contend with adaptive adversaries who optimize their attack strategies against detection systems. Established fundamental adversarial limits for quantum certification using Eve GAN, a generative adversarial network trained to produce classical correlations indistinguishable from quantum. Our central finding: when Eve interpolates her classical correlations with quantum data at mixing parameter, all tested detection methods achieve ROC AUC = 0.50, equivalent to random guessing. This means an eavesdropper needs only 5% classical admixture to completely evade detection. Critically, we discover that same distribution calibration a common practice in prior certification studies inflates detection performance by 44 percentage points compared to proper cross distribution evaluation, revealing a systematic flaw that may have led to overestimated security claims. Analysis of Popescu Rohrlich (PR Box) regime identifies a sharp phase transition at CHSH S = 2.05: below this value, no statistical method distinguishes classical from quantum correlations; above it, detection probability increases monotonically. Hardware validation on IBM Quantum demonstrates that Eve-GAN achieves CHSH = 2.736, remarkably exceeding real quantum hardware performance (CHSH = 2.691), illustrating that classical adversaries can outperform noisy quantum systems on standard certification metrics. These results have immediate implications for QKD security: adversaries maintaining 95% quantum fidelity evade all tested detection methods. We provide corrected methodology using cross-distribution calibration and recommend mandatory adversarial testing for quantum security claims.
zh

[AI-104] Machine Phenomenology: A Simple Equation Classifying Fast Radio Bursts

【速读】:该论文旨在解决如何利用人类物理直觉引导机器驱动的符号回归(symbolic regression)从观测数据中发现经验定律的问题。其解决方案的关键在于构建一个融合特征选择、量纲分析(dimensional analysis)与符号回归的人机协同流程:首先通过深度学习识别出描述快速射电暴(FRBs)的六个独立参数;随后在Buckingham-π定理和相关性分析指导下,由人类构造无量纲量群;最终由机器完成符号回归,发现能够准确刻画FRBs物理本质的 governing equation。该方法在CHIME新目录上的验证表明其具备良好的泛化能力,可广泛适用于多类科学领域。

链接: https://arxiv.org/abs/2512.04204
作者: Yang Liu,Yuhao Lu,Rahim Moradi,Bo Yang,Bing Zhang,Wenbin Lin,Yu Wang
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, 3 tables. Submitted to SCIENCE CHINA Physics, Mechanics Astronomy

点击查看摘要

Abstract:This work shows how human physical reasoning can guide machine-driven symbolic regression toward discovering empirical laws from observations. As an example, we derive a simple equation that classifies fast radio bursts (FRBs) into two distinct Gaussian distributions, indicating the existence of two physical classes. This human-AI workflow integrates feature selection, dimensional analysis, and symbolic regression: deep learning first analyzes CHIME Catalog 1 and identifies six independent parameters that collectively provide a complete description of FRBs; guided by Buckingham- \pi analysis and correlation analysis, humans then construct dimensionless groups; finally, symbolic regression performed by the machine discovers the governing equation. When applied to the newer CHIME Catalog, the equation produces consistent results, demonstrating that it captures the underlying physics. This framework is applicable to a broad range of scientific domains.
zh

[AI-105] ReVeal-MT: A Physics-Informed Neural Network for Multi-Transmitter Radio Environment Mapping

【速读】:该论文旨在解决多发射源共存环境下无线环境感知精度下降的问题,尤其是在阴影效应和邻近发射机干扰叠加导致现有模型性能显著退化的情况下。其解决方案的关键在于提出了一种基于物理信息神经网络(Physics-Informed Neural Networks, PINNs)的新型框架 ReVeal-MT,通过构建适用于多发射源的接收信号强度指示器(Received Signal Strength Indicator, RSSI)偏微分方程(Partial Differential Equation, PDE)残差,并将其嵌入神经网络损失函数中,从而实现仅用稀疏射频(RF)传感器测量即可高精度重建复杂电磁环境下的频谱景观。该方法在真实世界农村与郊区场景下验证有效,相较于3GPP、ITU-R信道模型及单发射源基线PINN模型,在仅45个样本条件下仍可达到2.66 dB的均方根误差(RMSE),且计算复杂度低,显著提升了多发射源条件下的无线电环境映射能力,为精细频谱管理与主用户(Primary Users, PUs)和次用户(Secondary Users, SUs)之间的精确共存提供了技术支持。

链接: https://arxiv.org/abs/2512.04100
作者: Mukaram Shahid,Kunal Das,Hadia Ushaq,Hongwei Zhang,Jiming Song,Daji Qiao,Sarath Babu,Yong Guan,Zhengyuan Zhu,Arsalan Ahmad
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Accurately mapping the radio environment (e.g., identifying wireless signal strength at specific frequency bands and geographic locations) is crucial for efficient spectrum sharing, enabling Secondary Users~(SUs) to access underutilized spectrum bands while protecting Primary Users~(PUs). While existing models have made progress, they often degrade in performance when multiple transmitters coexist, due to the compounded effects of shadowing, interference from adjacent transmitters. To address this challenge, we extend our prior work on Physics-Informed Neural Networks~(PINNs) for single-transmitter mapping to derive a new multi-transmitter Partial Differential Equation~(PDE) formulation of the Received Signal Strength Indicator~(RSSI). We then propose \emphReVeal-MT (Re-constructor and Visualizer of Spectrum Landscape for Multiple Transmitters), a novel PINN which integrates the multi-source PDE residual into a neural network loss function, enabling accurate spectrum landscape reconstruction from sparse RF sensor measurements. ReVeal-MT is validated using real-world measurements from the ARA wireless living lab across rural and suburban environments, and benchmarked against 3GPP and ITU-R channel models and a baseline PINN model for a single transmitter use-case. Results show that ReVeal-MT achieves substantial accuracy gains in multi-transmitter scenarios, e.g., achieving an RMSE of only 2.66,dB with as few as 45 samples over a 370-square-kilometer region, while maintaining low computational complexity. These findings demonstrate that ReVeal-MT significantly advances radio environment mapping under realistic multi-transmitter conditions, with strong potential for enabling fine-grained spectrum management and precise coexistence between PUs and SUs.
zh

[AI-106] Partial multivariate transformer as a tool for cryptocurrencies time series prediction ICTAI2025

【速读】:该论文旨在解决加密货币价格预测中因极端波动性导致的建模难题,即如何在信息稀缺的单变量模型与噪声敏感的全变量多变量模型之间取得平衡。其解决方案的关键在于提出一种部分多变量(partial-multivariate)策略,通过选取具有高预测价值的特征子集来构建模型,从而有效提取有用信号并抑制噪声干扰。作者进一步设计了**部分多变量Transformer(PMformer)**模型进行实证验证,结果表明该方法在统计精度上优于传统模型,但同时也揭示了预测误差降低并不必然带来交易收益提升,提示需建立更贴合实际金融目标的评估体系。

链接: https://arxiv.org/abs/2512.04099
作者: Andrzej Tokajuk,Jarosław A. Chudziak
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Trading and Market Microstructure (q-fin.TR)
备注: Accepted for publication in the proceedings of ICTAI 2025

点击查看摘要

Abstract:Forecasting cryptocurrency prices is hindered by extreme volatility and a methodological dilemma between information-scarce univariate models and noise-prone full-multivariate models. This paper investigates a partial-multivariate approach to balance this trade-off, hypothesizing that a strategic subset of features offers superior predictive power. We apply the Partial-Multivariate Transformer (PMformer) to forecast daily returns for BTCUSDT and ETHUSDT, benchmarking it against eleven classical and deep learning models. Our empirical results yield two primary contributions. First, we demonstrate that the partial-multivariate strategy achieves significant statistical accuracy, effectively balancing informative signals with noise. Second, we experiment and discuss an observable disconnect between this statistical performance and practical trading utility; lower prediction error did not consistently translate to higher financial returns in simulations. This finding challenges the reliance on traditional error metrics and highlights the need to develop evaluation criteria more aligned with real-world financial objectives.
zh

机器学习

[LG-0] he Geometry of Intelligence: Deterministic Functional Topology as a Foundation for Real-World Perception

链接: https://arxiv.org/abs/2512.05089
作者: Eduardo Di Santi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 35 pages, 6 figures. This preprint develops a deterministic functional-topological framework showing that physical systems generate compact perceptual manifolds with finite radius. We provide theory, Monte-Carlo estimators, and validation across PM, battery, and ECG domains, unifying biological perception and self-supervised AI

点击查看摘要

Abstract:Real-world physical processes do not generate arbitrary variability: their signals concentrate on compact and low-variability subsets of functional space. This geometric structure enables rapid generalization from a few examples in both biological and artificial systems. This work develops a deterministic functional-topological framework in which the set of valid realizations of a physical phenomenon forms a compact perceptual manifold with stable invariants and a finite Hausdorff radius. We show that the boundaries of this manifold can be discovered in a fully self-supervised manner through Monte Carlo sampling, even when the governing equations of the system are unknown. We provide theoretical guarantees, practical estimators of knowledge boundaries, and empirical validations across three domains: electromechanical railway point machines, electrochemical battery discharge curves, and physiological ECG signals. Our results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction, explaining why biological learners and self-supervised AI models can generalize from limited observations. Comments: 35 pages, 6 figures. This preprint develops a deterministic functional-topological framework showing that physical systems generate compact perceptual manifolds with finite radius. We provide theory, Monte-Carlo estimators, and validation across PM, battery, and ECG domains, unifying biological perception and self-supervised AI Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) ACMclasses: G.2.0; I.2.6; I.5.3 Cite as: arXiv:2512.05089 [cs.LG] (or arXiv:2512.05089v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.05089 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Eduardo Di Santi [view email] [v1] Thu, 4 Dec 2025 18:54:07 UTC (149 KB)

[LG-1] Gradient Descent with Provably Tuned Learning-rate Schedules

链接: https://arxiv.org/abs/2512.05084
作者: Dravyansh Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.

[LG-2] OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design

链接: https://arxiv.org/abs/2512.05080
作者: Ian Dunn,Liv Toft,Tyler Katz,Juhi Gupta,Riya Shah,Ramith Hettiarachchi,David R. Koes
类目: Machine Learning (cs.LG)
*备注: Presented at the Machine Learning for Structural Biology Workshop, 2025

点击查看摘要

Abstract:Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at this https URL

[LG-3] Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection

链接: https://arxiv.org/abs/2512.05069
作者: Mohammad Arif Rasyidi,Omar Alhussein,Sami Muhaidat,Ernesto Damiani
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at this https URL.

[LG-4] SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

链接: https://arxiv.org/abs/2512.05038
作者: Cassandra Goldberg,Chaehyeon Kim,Adam Stein,Eric Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

[LG-5] Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression

链接: https://arxiv.org/abs/2512.05030
作者: Xuan Li,Samuel Bello
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Accurate estimation of three-dimensional ground reaction forces and moments (GRFs/GRMs) is crucial for both biomechanics research and clinical rehabilitation evaluation. In this study, we focus on insole-based GRF/GRM estimation and further validate our approach on a public walking dataset. We propose a Dual-Path Region-Guided Attention Network that integrates anatomy-inspired spatial priors and temporal priors into a region-level attention mechanism, while a complementary path captures context from the full sensor field. The two paths are trained jointly and their outputs are combined to produce the final GRF/GRM predictions. Conclusions: Our model outperforms strong baseline models, including CNN and CNN-LSTM architectures on two datasets, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on the public dataset. This demonstrates robust performance for ground reaction force and moment estimation.

[LG-6] Efficient Generative Transformer Operators For Million-Point PDEs

链接: https://arxiv.org/abs/2512.04974
作者: Armand Kassaï Koupaï,Lise Le Boudec,Patrick Gallinari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 \times spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.

[LG-7] Environment-Aware Channel Inference via Cross-Modal Flow: From Multimodal Sensing to Wireless Channels

链接: https://arxiv.org/abs/2512.04966
作者: Guangming Liang,Mingjie Yang,Dongzhu Liu,Paul Henderson,Lajos Hanzo
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 13 figures, 40 references, submitted to IEEE for possible publication

点击查看摘要

Abstract:Accurate channel state information (CSI) underpins reliable and efficient wireless communication. However, acquiring CSI via pilot estimation incurs substantial overhead, especially in massive multiple-input multiple-output (MIMO) systems operating in high-Doppler environments. By leveraging the growing availability of environmental sensing data, this treatise investigates pilot-free channel inference that estimates complete CSI directly from multimodal observations, including camera images, LiDAR point clouds, and GPS coordinates. In contrast to prior studies that rely on predefined channel models, we develop a data-driven framework that formulates the sensing-to-channel mapping as a cross-modal flow matching problem. The framework fuses multimodal features into a latent distribution within the channel domain, and learns a velocity field that continuously transforms the latent distribution toward the channel distribution. To make this formulation tractable and efficient, we reformulate the problem as an equivalent conditional flow matching objective and incorporate a modality alignment loss, while adopting low-latency inference mechanisms to enable real-time CSI estimation. In experiments, we build a procedural data generator based on Sionna and Blender to support realistic modeling of sensing scenes and wireless propagation. System-level evaluations demonstrate significant improvements over pilot- and sensing-based benchmarks in both channel estimation accuracy and spectral efficiency for the downstream beamforming task.

[LG-8] Amortized Inference of Multi-Modal Posteriors using Likelihood-Weighted Normalizing Flows

链接: https://arxiv.org/abs/2512.04954
作者: Rajneil Baruah
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:We present a novel technique for amortized posterior estimation using Normalizing Flows trained with likelihood-weighted importance sampling. This approach allows for the efficient inference of theoretical parameters in high-dimensional inverse problems without the need for posterior training samples. We implement the method on multi-modal benchmark tasks in 2D and 3D to check for the efficacy. A critical observation of our study is the impact of the topology of the base distributions on the modelled posteriors. We find that standard unimodal base distributions fail to capture disconnected support, resulting in spurious probability bridges between modes. We demonstrate that initializing the flow with a Gaussian Mixture Model that matches the cardinality of the target modes significantly improves reconstruction fidelity, as measured by some distance and divergence metrics.

[LG-9] Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

链接: https://arxiv.org/abs/2512.04918
作者: Kailiang Liu,Ying Chen,Ralf Borndörfer,Thorsten Koch
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.

[LG-10] A result relating convex n-widths to covering numbers with some applications to neural networks

链接: https://arxiv.org/abs/2512.04912
作者: Jonathan Baxter,Peter Bartlett
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In general, approximating classes of functions defined over high-dimensional input spaces by linear combinations of a fixed set of basis functions or features'' is known to be hard. Typically, the worst-case error of the best basis set decays only as fast as \Theta\(n^-1/d\) , where n is the number of basis functions and d is the input dimension. However, there are many examples of high-dimensional pattern recognition problems (such as face recognition) where linear combinations of small sets of features do solve the problem well. Hence these function classes do not suffer from the curse of dimensionality’’ associated with more general classes. It is natural then, to look for characterizations of high-dimensional function classes that nevertheless are approximated well by linear combinations of small sets of features. In this paper we give a general result relating the error of approximation of a function class to the covering number of its ``convex core’'. For one-hidden-layer neural networks, covering numbers of the class of functions computed by a single hidden node upper bound the covering numbers of the convex core. Hence, using standard results we obtain upper bounds on the approximation rate of neural network classes.

[LG-11] Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs

链接: https://arxiv.org/abs/2512.04827
作者: Wenzhang Du
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified “simple” contracts can be harder to learn than richer but better aligned contract sets.

[LG-12] Pick-to-Learn for Systems and Control: Data-driven Synthesis with State-of-the-art Safety Guarantees

链接: https://arxiv.org/abs/2512.04781
作者: Dario Paccagnan,Daniel Marks,Marco C. Campi,Simone Garatti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 27 double-column pages, 18 figures

点击查看摘要

Abstract:Data-driven methods have become paramount in modern systems and control problems characterized by growing levels of complexity. In safety-critical environments, deploying these methods requires rigorous guarantees, a need that has motivated much recent work at the interface of statistical learning and control. However, many existing approaches achieve this goal at the cost of sacrificing valuable data for testing and calibration, or by constraining the choice of learning algorithm, thus leading to suboptimal performances. In this paper, we describe Pick-to-Learn (P2L) for Systems and Control, a framework that allows any data-driven control method to be equipped with state-of-the-art safety and performance guarantees. P2L enables the use of all available data to jointly synthesize and certify the design, eliminating the need to set aside data for calibration or validation purposes. In presenting a comprehensive version of P2L for systems and control, this paper demonstrates its effectiveness across a range of core problems, including optimal control, reachability analysis, safe synthesis, and robust control. In many of these applications, P2L delivers designs and certificates that outperform commonly employed methods, and shows strong potential for broad applicability in diverse practical settings.

[LG-13] Complementary Characterization of Agent -Based Models via Computational Mechanics and Diffusion Models

链接: https://arxiv.org/abs/2512.04771
作者: Roberto Garrone
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 11 pages. Methods paper introducing a dual-domain framework for analyzing ABM dynamics. Companion temporal-analysis preprint: arXiv:2510.12729

点击查看摘要

Abstract:This article extends the preprint “Characterizing Agent-Based Model Dynamics via \epsilon -Machines and Kolmogorov-Style Complexity” by introducing diffusion models as orthogonal and complementary tools for characterizing the output of agent-based models (ABMs). Where \epsilon -machines capture the predictive temporal structure and intrinsic computation of ABM-generated time series, diffusion models characterize high-dimensional cross-sectional distributions, learn underlying data manifolds, and enable synthetic generation of plausible population-level outcomes. We provide a formal analysis demonstrating that the two approaches operate on distinct mathematical domains -processes vs.\ distributions- and show that their combination yields a two-axis representation of ABM behavior based on temporal organization and distributional geometry. To our knowledge, this is the first framework to integrate computational mechanics with score-based generative modeling for the structural analysis of ABM outputs, thereby situating ABM characterization within the broader landscape of modern machine-learning methods for density estimation and intrinsic computation. The framework is validated using the same elder-caregiver ABM dataset introduced in the companion paper, and we provide precise definitions and propositions formalizing the mathematical complementarity between \epsilon -machines and diffusion models. This establishes a principled methodology for jointly analyzing temporal predictability and high-dimensional distributional structure in complex simulation models.

[LG-14] RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

链接: https://arxiv.org/abs/2512.04752
作者: Siqi Wang,Hailong Yang,Junjie Zhu,Xuezhu Wang,Yufan Xu,Depei Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.

[LG-15] A Tutorial on Regression Analysis: From Linear Models to Deep Learning – Lecture Notes on Artificial Intelligence

链接: https://arxiv.org/abs/2512.04747
作者: Jingyuan Wang,Jiahao Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article serves as the regression analysis lecture notes in the Intelligent Computing course cluster (including the courses of Artificial Intelligence, Data Mining, Machine Learning, and Pattern Recognition). It aims to provide students – who are assumed to possess only basic university-level mathematics (i.e., with prerequisite courses in calculus, linear algebra, and probability theory) – with a comprehensive and self-contained understanding of regression analysis without requiring any additional references. The lecture notes systematically introduce the fundamental concepts, modeling components, and theoretical foundations of regression analysis, covering linear regression, logistic regression, multinomial logistic regression, polynomial regression, basis-function models, kernel-based methods, and neural-network-based nonlinear regression. Core methodological topics include loss-function design, parameter-estimation principles, ordinary least squares, gradient-based optimization algorithms and their variants, as well as regularization techniques such as Ridge and LASSO regression. Through detailed mathematical derivations, illustrative examples, and intuitive visual explanations, the materials help students understand not only how regression models are constructed and optimized, but also how they reveal the underlying relationships between features and response variables. By bridging classical statistical modeling and modern machine-learning practice, these lecture notes aim to equip students with a solid conceptual and technical foundation for further study in advanced artificial intelligence models.

[LG-16] owards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement

链接: https://arxiv.org/abs/2512.04703
作者: Stefan Perko
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their “with replacement” and “one-pass” counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an “epoched Brownian motion”. We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form u_t = \frac1(1+t)^\beta, \beta \in (0,1) . Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.

[LG-17] RINITY: An Evolved LLM Coordinator

链接: https://arxiv.org/abs/2512.04695
作者: Jinglue Xu,Qi Sun,Peter Schwendeman,Stefan Nielsen,Edoardo Cetin,Yujin Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately 0.6 B parameters) and a lightweight head (approximately 10 K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator’s hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.

[LG-18] Contract-Governed Training for Earth Observation: Observed Service Agreement Graphs and Coverag e-Accuracy Trade-offs

链接: https://arxiv.org/abs/2512.04644
作者: Wenzhang Du
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Earth observation (EO) models are frequently trained under implicit sampling policies that optimize global accuracy but provide no explicit guarantees on who (which regions, classes, or mission-critical strata) is being served throughout training. This paper introduces a contract-governed training paradigm for EO in which training samples are grouped into service contracts – semantically meaningful units such as (dataset, region, rare-crop indicator) – and each contract is assigned a target service share. We instantiate this paradigm as an Observed Service Agreement Graph (OSAG), a lightweight governance layer that (i) monitors contract-level exposure (coverage) during optimization, (ii) drives empirical coverage toward target shares via contract-normalized sampling weights, and (iii) exposes explicit accuracy-governance trade-offs through two knobs: a sampling mixture coefficient alpha and a contract-regularization weight lambda_C. We provide a compact theory in a toy setting: OSAG sampling concentrates empirical coverage to targets; coverage deviations upper-bound service-risk deviations; and contract design (coarse vs. fine) modulates governance cost. Experiments on AVIRIS hyperspectral scenes (Indian Pines plus Salinas) and multispectral Sentinel-2 EuroSAT demonstrate that OSAG can substantially reduce priority coverage error while maintaining global accuracy and improving high-priority accuracy. A EuroSAT coarse-vs-fine contract ablation further evidences how semantically refined contracts can reduce the accuracy cost per unit of governance improvement.

[LG-19] Federated Learning for Anomaly Detection in Maritime Movement Data MDM2024

链接: https://arxiv.org/abs/2512.04635
作者: Anita Graser,Axel Weißenfeld,Clemens Heistracher,Melitta Dragaschnig,Peter Widhalm
类目: Machine Learning (cs.LG)
*备注: Accepted at MDM2024

点击查看摘要

Abstract:This paper introduces M3fed, a novel solution for federated learning of movement anomaly detection models. This innovation has the potential to improve data privacy and reduce communication costs in machine learning for movement anomaly detection. We present the novel federated learning (FL) strategies employed to train M3fed, perform an example experiment with maritime AIS data, and evaluate the results with respect to communication costs and FL model quality by comparing classic centralized M3 and the new federated M3fed.

[LG-20] Score Matching for Estimating Finite Point Processes

链接: https://arxiv.org/abs/2512.04617
作者: Haoqun Cao,Yixuan Zhang,Feng Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score matching estimators have garnered significant attention in recent years because they eliminate the need to compute normalizing constants, thereby mitigating the computational challenges associated with maximum likelihood estimation (MLE).While several studies have proposed score matching estimators for point processes, this work highlights the limitations of these existing methods, which stem primarily from the lack of a mathematically rigorous analysis of how score matching behaves on finite point processes – special random configurations on bounded spaces where many of the usual assumptions and properties of score matching no longer hold. To this end, we develop a formal framework for score matching on finite point processes via Janossy measures and, within this framework, introduce an (autoregressive) weighted score-matching estimator, whose statistical properties we analyze in classical parametric settings. For general nonparametric (e.g., deep) point process models, we show that score matching alone does not uniquely identify the ground-truth distribution due to subtle normalization issues, and we propose a simple survival-classification augmentation that yields a complete, integration-free training objective for any intensity-based point process model for spatio-temporal case. Experiments on synthetic and real-world temporal and spatio-temporal datasets, demonstrate that our method accurately recovers intensities and achieves performance comparable to MLE with better efficiency.

[LG-21] QoSDiff: An Implicit Topological Embedding Learning Framework Leverag ing Denoising Diffusion and Adversarial Attention for Robust QoS Prediction

链接: https://arxiv.org/abs/2512.04596
作者: Guanchen Du,Jianlong Xu,Wei Wei
类目: Machine Learning (cs.LG)
*备注: Preprint submitted to IEEE Transactions on Services Computing

点击查看摘要

Abstract:Accurate Quality of Service (QoS) prediction is fundamental to service computing, providing essential data-driven guidance for service selection and ensuring superior user experiences. However, prevalent approaches, particularly Graph Neural Networks (GNNs), heavily rely on constructing explicit user–service interaction graphs. This dependency introduces severe scalability bottlenecks and limits performance when explicit connections are sparse or corrupted by noise. To address these challenges, this paper introduces \emphQoSDiff, a novel embedding learning framework that bypasses the prerequisite of explicit graph construction. Specifically, it leverages a denoising diffusion probabilistic model to recover intrinsic latent structures from noisy initializations. To further capture high-order interactions, we propose an adversarial interaction module that integrates a bidirectional hybrid attention mechanism. This adversarial paradigm dynamically distinguishes informative patterns from noise, enabling a dual-perspective modeling of intricate user–service associations. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff significantly outperforms state-of-the-art baselines. Notably, the results highlight the framework’s superior cross-dataset generalization capability and exceptional robustness against data sparsity and observational noise.

[LG-22] Exploiting textttftraces textttfunction_graph Tracer Features for Machine Learning: A Case Study on Encryption Detection CCS

链接: https://arxiv.org/abs/2512.04590
作者: Kenan Begovic,Abdulaziz Al-Ali,Qutaibah Malluhi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Conference paper presented at AICCSA 2025

点击查看摘要

Abstract:This paper proposes using the Linux kernel ftrace framework, particularly the function graph tracer, to generate informative system level data for machine learning (ML) applications. Experiments on a real world encryption detection task demonstrate the efficacy of the proposed features across several learning algorithms. The learner faces the problem of detecting encryption activities across a large dataset of files, using function call traces and graph based features. Empirical results highlight an outstanding accuracy of 99.28 on the task at hand, underscoring the efficacy of features derived from the function graph tracer. The results were further validated in an additional experiment targeting a multilabel classification problem, in which running programs were identified from trace data. This work provides comprehensive methodologies for preprocessing raw trace data and extracting graph based features, offering significant advancements in applying ML to system behavior analysis, program identification, and anomaly detection. By bridging the gap between system tracing and ML, this paper paves the way for innovative solutions in performance monitoring and security analytics.

[LG-23] mp-SCONE: A Novel Out-of-Distribution Detection and Domain Generalization Framework for Wild Data with Temporal Shift

链接: https://arxiv.org/abs/2512.04571
作者: Aditi Naiknaware,Sanchit Singh,Hajar Homayouni,Salimeh Sekeh
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures, 72 subfigures, 6 tables

点击查看摘要

Abstract:Open-world learning (OWL) requires models that can adapt to evolving environments while reliably detecting out-of-distribution (OOD) inputs. Existing approaches, such as SCONE, achieve robustness to covariate and semantic shifts but assume static environments, leading to degraded performance in dynamic domains. In this paper, we propose Temp-SCONE, a temporally consistent extension of SCONE designed to handle temporal shifts in dynamic environments. Temp-SCONE introduces a confidence-driven regularization loss based on Average Thresholded Confidence (ATC), penalizing instability in predictions across time steps while preserving SCONE’s energy-margin separation. Experiments on dynamic datasets demonstrate that Temp-SCONE significantly improves robustness under temporal drift, yielding higher corrupted-data accuracy and more reliable OOD detection compared to SCONE. On distinct datasets without temporal continuity, Temp-SCONE maintains comparable performance, highlighting the importance and limitations of temporal regularization. Our theoretical insights on temporal stability and generalization error further establish Temp-SCONE as a step toward reliable OWL in evolving dynamic environments.

[LG-24] Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

链接: https://arxiv.org/abs/2512.04566
作者: Miguel Sánchez-Domínguez,Lucas Lacasa,Javier de Vicente,Gonzalo Rubio,Eusebio Valero
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Surrogate models (including deep neural networks and other machine learning algorithms in supervised learning) are capable of approximating arbitrarily complex, high-dimensional input-output problems in science and engineering, but require a thorough data-agnostic uncertainty quantification analysis before these can be deployed for any safety-critical application. The standard approach for data-agnostic uncertainty quantification is to use conformal prediction (CP), a well-established framework to build uncertainty models with proven statistical guarantees that do not assume any shape for the error distribution of the surrogate model. However, since the classic statistical guarantee offered by CP is given in terms of bounds for the marginal coverage, for small calibration set sizes (which are frequent in realistic surrogate modelling that aims to quantify error at different regions), the potentially strong dispersion of the coverage distribution around its average negatively impacts the reliability of the uncertainty model, often obtaining coverages below the expected value, resulting in a less applicable framework. After providing a gentle presentation of uncertainty quantification for surrogate models for machine learning practitioners, in this paper we bridge the gap by proposing a new statistical guarantee that offers probabilistic information for the coverage of a single conformal predictor. We show that the proposed framework converges to the standard solution offered by CP for large calibration set sizes and, unlike the classic guarantee, still offers reliable information about the coverage of a conformal predictor for small data sizes. We illustrate and validate the methodology in a suite of examples, and implement an open access software solution that can be used alongside common conformal prediction libraries to obtain uncertainty models that fulfil the new guarantee.

[LG-25] LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

链接: https://arxiv.org/abs/2512.04562
作者: Siddharth Betala,Samuel P. Gleason,Ali Ramlaoui,Andy Xu,Georgia Channing,Daniel Levy,Clémentine Fourrier,Nikita Kazeev,Chaitanya K. Joshi,Sékou-Oumar Kaba,Félix Therrien,Alex Hernandez-Garcia,Rocío Mercado,N. M. Anoop Krishnan,Alexandre Duval
类目: Machine Learning (cs.LG)
*备注: 46 pages, 17 figures, 16 tables

点击查看摘要

Abstract:Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

[LG-26] On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

链接: https://arxiv.org/abs/2512.04558
作者: Yue Yu,Qiwei Di,Quanquan Gu,Dongruo Zhou
类目: Machine Learning (cs.LG)
*备注: 45 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of- n (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and observe consistent improvements over widely used approaches, demonstrating the practical effectiveness of our framework.

[LG-27] Explainable Graph Representation Learning via Graph Pattern Analysis IJCAI-25

链接: https://arxiv.org/abs/2512.04530
作者: Xudong Wang,Ziheng Sun,Chris Ding,Jicong Fan
类目: Machine Learning (cs.LG)
*备注: Full version with appendix of the paper published in the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25), Main Track

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework (PXGL-GNN) for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern’s contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.

[LG-28] Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

链接: https://arxiv.org/abs/2512.04476
作者: Zehao Fan,Zhenyu Liu,Yunzhen Liu,Yayue Hou,Hadjer Benmeziane,Kaoutar El Maghraoui,Liu Liu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP’s limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.

[LG-29] Learning to Orchestrate Agents in Natural Language with the Conductor

链接: https://arxiv.org/abs/2512.04388
作者: Stefan Nielsen,Edoardo Cetin,Peter Schwendeman,Qi Sun,Jinglue Xu,Yujin Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.

[LG-30] SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction

链接: https://arxiv.org/abs/2512.04354
作者: April S. Liang(1),Fatemeh Amrollahi(2),Yixing Jiang(2),Conor K. Corbin(3),Grace Y.E. Kim(4),David Mui(5),Trevor Crowell(6),Aakash Acharya(7),Sreedevi Mony(7),Soumya Punnathanam(7),Jack McKeown(7),Margaret Smith(6 and 8),Steven Lin(6 and 8),Arnold Milstein(8 and 9),Kevin Schulman(1 and 9),Jason Hom(1),Michael A. Pfeffer(1 and 7),Tho D. Pham(10),David Svec(1 and 11),Weihan Chu(1 and 11),Lisa Shieh(1),Christopher Sharp(8),Stephen P. Ma(1),Jonathan H. Chen(1, 2, 9, and 12) ((1) Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, CA, (2) Department of Biomedical Data Science, Stanford University, Palo Alto, CA, (3) SmarterDx, New York, NY, (4) The Johns Hopkins University School of Medicine, Baltimore, MD, (5) Department of Medicine, Stanford University School of Medicine, Palo Alto, CA, (6) Stanford Healthcare AI Applied Research Team, Stanford University School of Medicine, Palo Alto, CA, (7) Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, (8) Division of Primary Care and Population Health, Department of Medicine, Stanford University School of Medicine, Palo Alto, CA, (9) Clinical Excellence Research Center, Stanford University, Palo Alto, CA, (10) Department of Pathology, Stanford University School of Medicine, Palo Alto, CA, (11) Stanford Health Care Tri-Valley, Pleasanton, CA, (12) Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p 0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.

[LG-31] Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

链接: https://arxiv.org/abs/2512.04351
作者: Manh Nguyen,Sunil Gupta,Hung Le
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbfRadial Dispersion Score (RDS), a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model’s own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of- N selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.

[LG-32] Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

链接: https://arxiv.org/abs/2512.04341
作者: Tianwei Ni,Esther Derman,Vineet Jain,Vincent Taboga,Siamak Ravanbakhsh,Pierre-Luc Bacon
类目: Machine Learning (cs.LG)
*备注: Preprint (52 pages, 15 figures)

点击查看摘要

Abstract:Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.

[LG-33] One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises ACSA

链接: https://arxiv.org/abs/2512.04338
作者: Biagio Montaruli,Luca Compagna,Serena Elisa Ponta,Davide Balzarotti
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Proceedings of the 2025 Annual Computer Security Applications Conference (ACSAC’ 25), December 8-12, 2025, Honolulu, Hawaii, USA

点击查看摘要

Abstract:The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community. Comments: Proceedings of the 2025 Annual Computer Security Applications Conference (ACSAC’ 25), December 8-12, 2025, Honolulu, Hawaii, USA Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2512.04338 [cs.CR] (or arXiv:2512.04338v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.04338 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Biagio Montaruli [view email] [v1] Wed, 3 Dec 2025 23:53:56 UTC (254 KB)

[LG-34] Data-regularized Reinforcement Learning for Diffusion Models at Scale

链接: https://arxiv.org/abs/2512.04332
作者: Haotian Ye,Kaiwen Zheng,Jiashu Xu,Puheng Li,Huayu Chen,Jiaqi Han,Sheng Liu,Qinsheng Zhang,Hanzi Mao,Zekun Hao,Prithvijit Chattopadhyay,Dinghao Yang,Liang Feng,Maosheng Liao,Junjie Bai,Ming-Yu Liu,James Zou,Stefano Ermon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.

[LG-35] RNNs perform task computations by dynamically warping neural representations NEURIPS2025

链接: https://arxiv.org/abs/2512.04310
作者: Arthur Pellegrino,Angus Chadwick
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Dynamical Systems (math.DS); Neurons and Cognition (q-bio.NC)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Analysing how neural networks represent data features in their activations can help interpret how they perform tasks. Hence, a long line of work has focused on mathematically characterising the geometry of such “neural representations.” In parallel, machine learning has seen a surge of interest in understanding how dynamical systems perform computations on time-varying input data. Yet, the link between computation-through-dynamics and representational geometry remains poorly understood. Here, we hypothesise that recurrent neural networks (RNNs) perform computations by dynamically warping their representations of task variables. To test this hypothesis, we develop a Riemannian geometric framework that enables the derivation of the manifold topology and geometry of a dynamical system from the manifold of its inputs. By characterising the time-varying geometry of RNNs, we show that dynamic warping is a fundamental feature of their computations.

[LG-36] When do spectral gradient updates help in deep learning?

链接: https://arxiv.org/abs/2512.04299
作者: Damek Davis,Dmitriy Drusvyatskiy
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient’s nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

[LG-37] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

链接: https://arxiv.org/abs/2512.04296
作者: Malyaban Bal,Abhronil Sengupta
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Under Review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.

[LG-38] Polynomiogram: An Integrated Framework for Root Visualization and Generative Art

链接: https://arxiv.org/abs/2512.04263
作者: Hoang Duc Nguyen,Anh Van Pham,Hien D. Nguyen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:This work presents the Polynomiogram framework, an integrated computational platform for exploring, visualizing, and generating art from polynomial root systems. The main innovation is a flexible sampling scheme in which two independent parameters are drawn from user defined domains and mapped to the polynomial coefficients through a generating function. This design allows the same mathematical foundation to support both scientific investigation and generative algorithmic art. The framework integrates two complementary numerical engines: NumPy companion matrix solver for fast, large scale computation and MPSolve for high precision, scientifically rigorous validation. This dual architecture enables efficient visualization for creative use and accurate computation for research and education. Numerical accuracy was verified using classical ensembles, including the Kac and Lucas polynomials. The method was applied to the cubic polynomial system to analyze its bifurcation structure, demonstrating its value as both a scientific tool for exploring root phenomena and an educational aid for visualizing fundamental concepts in algebra and dynamical systems. Beyond analysis, the Polynomiogram also demonstrated its potential as a tool for personalized generative art. Examples include the use of the platform to generate a natural form resembling a hibiscus flower and to create personalized artwork expressing gratitude toward advances in artificial intelligence and large language models through a tribute composition.

[LG-39] ActVAE: Modelling human activity schedules with a deep conditional generative approach

链接: https://arxiv.org/abs/2512.04223
作者: Fred Shone,Tim Hillel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modelling the complexity and diversity of human activity scheduling behaviour is inherently challenging. We demonstrate a deep conditional-generative machine learning approach for the modelling of realistic activity schedules depending on input labels such as an individual’s age, employment status, or other information relevant to their scheduling. We combine (i) a structured latent generative approach, with (ii) a conditional approach, through a novel Conditional VAE architecture. This allows for the rapid generation of precise and realistic schedules for different input labels. We extensively evaluate model capabilities using a joint density estimation framework and several case studies. We additionally show that our approach has practical data and computational requirements, and can be deployed within new and existing demand modelling frameworks. We evaluate the importance of generative capability more generally, by comparing our combined approach to (i) a purely generative model without conditionality, and (ii) a purely conditional model which outputs the most likely schedule given the input labels. This comparison highlights the usefulness of explicitly modelling the randomness of complex and diverse human behaviours using deep generative approaches.

[LG-40] Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

链接: https://arxiv.org/abs/2512.04165
作者: Noa Rubin,Orit Davidovich,Zohar Ringel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning effects revolve around networks with one or two trainable layers or deep linear networks. Furthermore, even under such limiting settings, predictions often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.

[LG-41] MechDetect: Detecting Data-Dependent Errors

链接: https://arxiv.org/abs/2512.04138
作者: Philipp Jung,Nicholas Chandler,Sebastian Jäger,Felix Biessmann
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: International Conference on Data Science and Intelligent Systems (DSIS 2025)

点击查看摘要

Abstract:Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.

[LG-42] Decoding Large Language Diffusion Models with Foreseeing Movement

链接: https://arxiv.org/abs/2512.04135
作者: Yichuan Mo,Quan Chen,Mingjie Li,Zeming Wei,Yisen Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.

[LG-43] ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text NEURIPS2025

链接: https://arxiv.org/abs/2512.04125
作者: Kerry Luo,Michael Fu,Joshua Peguero,Husnain Malik,Anvay Patil,Joyce Lin,Megan Van Overborg,Ryan Sarmiento,Kevin Zhu
类目: Machine Learning (cs.LG)
*备注: Accepted to The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025): LLM Evaluation Workshop Multimodal Algorithmic Reasoning Workshop

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at this https URL.

[LG-44] Patient Safety Risks from AI Scribes: Signals from End-User Feedback ML4H

链接: https://arxiv.org/abs/2512.04118
作者: Jessica Dai,Anwen Huang,Catherine Nasrallah,Rhiannon Croci,Hossein Soleimani,Sarah J. Pollet,Julia Adler-Milstein,Sara G. Murray,Jinoos Yazdany,Irene Y. Chen
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: ML4H Findings 2025

点击查看摘要

Abstract:AI scribes are transforming clinical documentation at scale. However, their real-world performance remains understudied, especially regarding their impacts on patient safety. To this end, we initiate a mixed-methods study of patient safety issues raised in feedback submitted by AI scribe users (healthcare providers) in a large U.S. hospital system. Both quantitative and qualitative analysis suggest that AI scribes may induce various patient safety risks due to errors in transcription, most significantly regarding medication and treatment; however, further study is needed to contextualize the absolute degree of risk.

[LG-45] Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

链接: https://arxiv.org/abs/2512.05092
作者: Vincent Pauline,Tobias Höppe,Kirill Neklyudov,Alexander Tong,Stefan Bauer,Andrea Dittadi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits – stochastic differential equations (SDEs) in \mathbbR^d and continuous-time Markov chains (CTMCs) on finite alphabets – and derive the associated Fokker–Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices – Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces – shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

[LG-46] Control Consistency Losses for Diffusion Bridges NEURIPS2025

链接: https://arxiv.org/abs/2512.05070
作者: Samuel Howard,Nikolas Nüsken,Jakiw Pidstrigach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Frontiers in Probabilistic Inference: Sampling Meets Learning Workshop at NeurIPS 2025 (Oral)

点击查看摘要

Abstract:Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we leverage a self-consistency property of the conditioned dynamics to learn the diffusion bridge in an iterative online manner, and demonstrate promising empirical results in a range of settings.

[LG-47] owards a unified framework for guided diffusion models

链接: https://arxiv.org/abs/2512.04985
作者: Yuchen Jiao,Yuxin Chen,Gen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Guided or controlled data generation with diffusion models\blfootnotePartial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citepli2025provable. has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term – constructed from the difference between the original and reward-reweighted scores – into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.

[LG-48] Learning Causality for Longitudinal Data

链接: https://arxiv.org/abs/2512.04980
作者: Mouad EL Bouchattaoui
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: PhD thesis manuscript

点击查看摘要

Abstract:This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data. The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables. The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of transformers, achieving state-of-the-art results and introducing CPC into causal inference. The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the decoder Jacobian. A sparse self-expression prior induces modular, possibly overlapping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and overlapping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed. Comments: PhD thesis manuscript Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2512.04980 [stat.ML] (or arXiv:2512.04980v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.04980 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mouad El Bouchattaoui [view email] [v1] Thu, 4 Dec 2025 16:51:49 UTC (3,218 KB)

[LG-49] Shorting Dynamics and Structured Kernel Regularization

链接: https://arxiv.org/abs/2512.04874
作者: James Tian
类目: Functional Analysis (math.FA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper develops a nonlinear operator dynamic that progressively removes the influence of a prescribed feature subspace while retaining maximal structure elsewhere. The induced sequence of positive operators is monotone, admits an exact residual decomposition, and converges to the classical shorted operator. Transporting this dynamic to reproducing kernel Hilbert spaces yields a corresponding family of kernels that converges to the largest kernel dominated by the original one and annihilating the given subspace. In the finite-sample setting, the associated Gram operators inherit a structured residual decomposition that leads to a canonical form of kernel ridge regression and a principled way to enforce nuisance invariance. This gives a unified operator-analytic approach to invariant kernel construction and structured regularization in data analysis.

[LG-50] Series of quasi-uniform scatterings with fast search root systems and neural network classifications

链接: https://arxiv.org/abs/2512.04865
作者: Igor V. Netay
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG); Representation Theory (math.RT)
*备注:

点击查看摘要

Abstract:In this paper we describe an approach to construct large extendable collections of vectors in predefined spaces of given dimensions. These collections are useful for neural network latent space configuration and training. For classification problem with large or unknown number of classes this allows to construct classifiers without classification layer and extend the number of classes without retraining of network from the very beginning. The construction allows to create large well-spaced vector collections in spaces of minimal possible dimension. If the number of classes is known or approximately predictable, one can choose sufficient enough vector collection size. If one needs to significantly extend the number of classes, one can extend the collection in the same latent space, or to incorporate the collection into collection of higher dimensions with same spacing between vectors. Also, regular symmetric structure of constructed vector collections can significantly simplify problems of search for nearest cluster centers or embeddings in the latent space. Construction of vector collections is based on combinatorics and geometry of semi-simple Lie groups irreducible representations with highest weight.

[LG-51] Continuous-time reinforcement learning for optimal switching over multiple regimes

链接: https://arxiv.org/abs/2512.04697
作者: Yijie Huang,Mengge Li,Xiang Yu,Zhou Zhou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: Keywords: Optimal regime switching, multiple regimes, continuous-time reinforcement learning, system of HJB equations, policy improvement, policy iteration convergence

点击查看摘要

Abstract:This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm.

[LG-52] Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

链接: https://arxiv.org/abs/2512.04696
作者: Kazuma Sawaya
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension n diverges faster than the latent dimension q^* , while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume \mathbfB -right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2512.04696 [stat.ML] (or arXiv:2512.04696v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.04696 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Recurrent Neural Networks with Linear Structures for Electricity Price Forecasting

链接: https://arxiv.org/abs/2512.04690
作者: Souhir Ben Amor,Florian Ziel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel recurrent neural network architecture designed explicitly for day-ahead electricity price forecasting, aimed at improving short-term decision-making and operational management in energy systems. Our combined forecasting model embeds linear structures, such as expert models and Kalman filters, into recurrent networks, enabling efficient computation and enhanced interpretability. The design leverages the strengths of both linear and non-linear model structures, allowing it to capture all relevant stylised price characteristics in power markets, including calendar and autoregressive effects, as well as influences from load, renewable energy, and related fuel and carbon markets. For empirical testing, we use hourly data from the largest European electricity market spanning 2018 to 2025 in a comprehensive forecasting study, comparing our model against state-of-the-art approaches, particularly high-dimensional linear and neural network models. The proposed model achieves approximately 12% higher accuracy than leading benchmarks. We evaluate the contributions of the interpretable model components and conclude on the impact of combining linear and non-linear structures.

[LG-54] Fermionic neural Gibbs states

链接: https://arxiv.org/abs/2512.04663
作者: Jannes Nys,Juan Carrasquilla
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We introduce fermionic neural Gibbs states (fNGS), a variational framework for modeling finite-temperature properties of strongly interacting fermions. fNGS starts from a reference mean-field thermofield-double state and uses neural-network transformations together with imaginary-time evolution to systematically build strong correlations. Applied to the doped Fermi-Hubbard model, a minimal lattice model capturing essential features of strong electronic correlations, fNGS accurately reproduces thermal energies over a broad range of temperatures, interaction strengths, even at large dopings, for system sizes beyond the reach of exact methods. These results demonstrate a scalable route to studying finite-temperature properties of strongly correlated fermionic systems beyond one dimension with neural-network representations of quantum states.

[LG-55] Predicting Time-Dependent Flow Over Complex Geometries Using Operator Networks

链接: https://arxiv.org/abs/2512.04434
作者: Ali Rabeh,Suresh Murugaiyan,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fast, geometry-generalizing surrogates for unsteady flow remain challenging. We present a time-dependent, geometry-aware Deep Operator Network that predicts velocity fields for moderate-Re flows around parametric and non-parametric shapes. The model encodes geometry via a signed distance field (SDF) trunk and flow history via a CNN branch, trained on 841 high-fidelity simulations. On held-out shapes, it attains \sim 5% relative L2 single-step error and up to 1000X speedups over CFD. We provide physics-centric rollout diagnostics, including phase error at probes and divergence norms, to quantify long-horizon fidelity. These reveal accurate near-term transients but error accumulation in fine-scale wakes, most pronounced for sharp-cornered geometries. We analyze failure modes and outline practical mitigations. Code, splits, and scripts are openly released at: this https URL to support reproducibility and benchmarking.

[LG-56] Informative missingness and its implications in semi-supervised learning

链接: https://arxiv.org/abs/2512.04392
作者: Jinran Wu,You-Gan Wang,Geoffrey J. McLachlan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 1

点击查看摘要

Abstract:Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance prediction performance. This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models that can be fitted using the expectation-maximisation (EM) algorithm. Ideally, one would prefer a completely labelled sample, as one would anticipate that a labelled observation provides more information than an unlabelled one. However, when the mechanism governing label absence depends on the observed features or the class labels or both, the missingness indicators themselves contain useful information. In certain situations, the information gained from modelling the missing-label mechanism can even outweigh the loss due to missing labels, yielding a classifier with a smaller expected error than one based on a completely labelled sample analysed. This improvement arises particularly when class overlap is moderate, labelled data are sparse, and the missingness is informative. Modelling such informative missingness thus offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.

[LG-57] Constructive Approximation under Carlemans Condition with Applications to Smoothed Analysis

链接: https://arxiv.org/abs/2512.04371
作者: Frederic Koehler,Beining Wu
类目: Probability (math.PR); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A classical result of Carleman, based on the theory of quasianalytic functions, shows that polynomials are dense in L^2(\mu) for any \mu such that the moments \int x^k d\mu do not grow too rapidly as k \to \infty . In this work, we develop a fairly tight quantitative analogue of the underlying Denjoy-Carleman theorem via complex analysis, and show that this allows for nonasymptotic control of the rate of approximation by polynomials for any smooth function with polynomial growth at infinity. In many cases, this allows us to establish L^2 approximation-theoretic results for functions over general classes of distributions (e.g., multivariate sub-Gaussian or sub-exponential distributions) which were previously known only in special cases. As one application, we show that the Paley–Wiener class of functions bandlimited to [-\Omega,\Omega] admits superexponential rates of approximation over all strictly sub-exponential distributions, which leads to a new characterization of the class. As another application, we solve an open problem recently posed by Chandrasekaran, Klivans, Kontonis, Meka and Stavropoulos on the smoothed analysis of learning, and also obtain quantitative improvements to their main results and applications.

[LG-58] Enhancing next token prediction based pre-training for jet foundation models

链接: https://arxiv.org/abs/2512.04149
作者: Joschka Birk,Anna Hallin,Gregor Kasieczka,Nikol Madzharova,Ian Pang,David Shih
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next token prediction, building on the initial work of OmniJet- \alpha . Instead of tokenizing particles and subsequently only using the token-ID as the model input for both the generative and the classification task, we adopt a hybrid setup, which allows us to use continuous feature vectors as model input while only using token-IDs in the next token prediction target. Secondly, we explore a combined pre-training strategy that combines masked particle modeling and generative learning objectives. Taken together, these changes greatly improve the performance in downstream classification tasks without any loss in generative performance.

信息检索

[IR-0] Ask Safely: Privacy-Aware LLM Query Generation for Knowledge Graphs

链接: https://arxiv.org/abs/2512.04852
作者: Mauro Dalle Lucca Tosi,Jordi Cabot
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to query knowledge graphs (KGs) due to their strong semantic understanding and extrapolation capabilities compared to traditional approaches. However, these methods cannot be applied when the KG contains sensitive data and the user lacks the resources to deploy a local generative LLM. To address this issue, we propose a privacy-aware query generation approach for KGs. Our method identifies sensitive information in the graph based on its structure and omits such values before requesting the LLM to translate natural language questions into Cypher queries. Experimental results show that our approach preserves the quality of the generated queries while preventing sensitive data from being transmitted to third-party services.

[IR-1] Spatially-Enhanced Retrieval-Augmented Generation for Walkability and Urban Discovery

链接: https://arxiv.org/abs/2512.04790
作者: Maddalena Amendola,Chiara Pugliese,Raffaele Perego,Chiara Renso
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become foundational tools in artificial intelligence, supporting a wide range of applications beyond traditional natural language processing, including urban systems and tourist recommendations. However, their tendency to hallucinate and their limitations in spatial retrieval and reasoning are well known, pointing to the need for novel solutions. Retrieval-augmented generation (RAG) has recently emerged as a promising way to enhance LLMs with accurate, domain-specific, and timely information. Spatial RAG extends this approach to tasks involving geographic understanding. In this work, we introduce WalkRAG, a spatial RAG-based framework with a conversational interface for recommending walkable urban itineraries. Users can request routes that meet specific spatial constraints and preferences while interactively retrieving information about the path and points of interest (POIs) along the way. Preliminary results show the effectiveness of combining information retrieval, spatial reasoning, and LLMs to support urban discovery.

[IR-2] UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems

链接: https://arxiv.org/abs/2512.04588
作者: Nolwenn Bernard,Krisztian Balog
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.

[IR-3] he Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agent ic AI QA

链接: https://arxiv.org/abs/2512.04343
作者: Satyajit Movidi,Stephen Russell
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.

附件下载

点击下载今日全部论文列表