本篇博文主要内容为 2025-12-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-16)

今日共更新956篇论文,其中:

  • 自然语言处理105篇(Computation and Language (cs.CL))
  • 人工智能262篇(Artificial Intelligence (cs.AI))
  • 计算机视觉245篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习297篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Beyond surface form: A pipeline for semantic analysis in Alzheimers Disease detection from spontaneous speech

【速读】: 该论文旨在解决当前基于语言模型的阿尔茨海默病(Alzheimer’s Disease, AD)筛查工具中因模型可解释性不足而导致的误判问题,即难以区分真正的语言认知衰退标志与表面文本模式之间的虚假关联。其解决方案的关键在于提出一种新颖的文本转换方法:通过改变语法结构和词汇选择来重构语料的表层形式(surface form),同时严格保持语义内容不变,从而隔离出语义信息对分类性能的影响。实验表明,即使在显著降低BLEU和chrF评分(反映表层形式变化)的情况下,模型在保留高语义相似度的前提下仍能维持接近原始文本的分类表现(仅出现微小的macro-F1偏差),证明了仅依赖语义信息即可有效识别AD,为早期检测系统提供了更可靠、更具解释性的特征基础。

链接: https://arxiv.org/abs/2512.13685
作者: Dylan Phelps,Rodrigo Wilkens,Edward Gow-Smith,Lilian Hubner,Bárbara Malcorra,César Rennó-Costa,Marco Idiart,Maria-Cruz Villa-Uriol,Aline Villavicencio
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
zh

[NLP-1] owards Effective Model Editing for LLM Personalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中存在的计算成本高、数据依赖性强、灾难性遗忘以及多轮交互和隐式查询场景下性能下降等问题。其核心解决方案是将个性化建模视为模型编辑任务,提出Personalization Editing框架,通过聚类偏好表示引导局部化编辑,实现精准的偏好对齐更新,同时保留模型整体能力。该方法在编辑准确性和计算效率上优于微调策略,并在多轮对话和隐式偏好问答场景中显著优于提示工程基线。

链接: https://arxiv.org/abs/2512.13676
作者: Baixiang Huang,Limeng Cui,Jiapeng Liu,Haoran Wang,Jiawei Xu,Zhuiyue Tan,Yutong Chen,Chen Luo,Yi Liu,Kai Shu
机构: Emory University (埃默里大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注: 15 pages (including appendix), 7 figures. Code, data, results, and additional resources are available at: this https URL

点击查看摘要

Abstract:Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model’s ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
zh

[NLP-2] owards Interactive Intelligence for Digital Humans

【速读】: 该论文旨在解决当前数字人(Digital Human)在交互智能方面的局限性,即多数现有系统仅能实现表面化的模仿行为,缺乏个性化的表达、自适应的交互能力以及持续进化的能力。为应对这一挑战,作者提出了“交互智能”(Interactive Intelligence)的新范式,并设计了Mio(Multimodal Interactive Omni-Avatar)端到端框架作为解决方案。其关键在于构建一个由五个专业化模块(Thinker、Talker、Face Animator、Body Animator 和 Renderer)组成的统一架构,将认知推理与实时多模态具身表达深度融合,从而实现流畅且一致的交互体验,并支持数字人的自我演化能力。

链接: https://arxiv.org/abs/2512.13674
作者: Yiyi Cai,Xuangeng Chu,Xiwei Gao,Sitong Gong,Yifei Huang,Caixin Kang,Kunhang Li,Haiyang Liu,Ruicong Liu,Yun Liu,Dianwen Ng,Zixiong Su,Erwin Wu,Yuhan Wu,Dingkun Yan,Tianyu Yan,Chang Zeng,Bo Zheng,You Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
zh

[NLP-3] A stylometric analysis of speaker attribution from speech transcripts

【速读】: 该论文旨在解决在语音特征不可靠(如说话人伪装声音或使用文本转语音软件)的情况下,如何通过语音转写文本中的语言内容进行说话人识别的问题。传统声纹识别依赖于声学特性,在此类场景下失效,而作者提出了一种基于语体特征的说话人归属方法——StyloSpeaker,其关键在于将文献中已有的作者识别(authorship attribution)领域的语体特征(stylometric features)拓展至语音转写文本,整合字符、词汇、词元、句子及风格等多维度特征,从而实现对说话人身份的判别。实验表明,该方法在标准化格式的转录文本上表现更优,且相较于黑箱神经网络模型更具可解释性,同时揭示了哪些语体特征最能有效区分不同说话人。

链接: https://arxiv.org/abs/2512.13667
作者: Cristina Aggazzotti,Elizabeth Allyn Smith
机构: Johns Hopkins University (约翰霍普金斯大学); Université du Québec à Montréal (魁北克大学蒙特利尔分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts’ conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
zh

[NLP-4] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中安全对齐机制(safety alignment mechanisms)带来的负面影响,即这些机制虽能防止有害响应,但同时也阻碍了合法的研究应用,如认知建模、对抗测试和安全分析。解决方案的关键在于评估四种消融工具(abliteration tools)——Heretic、DECCP、ErisForge 和 FailSpy——在16个指令微调模型上的有效性与兼容性,重点考察其在去除拒绝行为表示时对模型能力的保留程度。研究发现,单次遍历方法在基准子集上表现出更强的能力保持能力(如GSM8K指标变化最小),而贝叶斯优化方法则导致模型分布偏移显著且效果因模型架构而异;此外,数学推理能力对消融干预最为敏感,其性能波动范围达+1.51至-18.81个百分点(相对变化达-26.5%)。这一成果为研究人员提供了基于实证数据选择合适消融工具的依据。

链接: https://arxiv.org/abs/2512.13655
作者: Richard J. Young
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 25 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
zh

[NLP-5] Large-Language Memorization During the Classification of United States Supreme Court Cases

【速读】: 该论文旨在解决大语言模型(Large-Language Models, LLMs)在复杂分类任务中响应不稳定、存在“幻觉”(hallucination)现象的问题,特别是在美国最高法院(SCOTUS)判例文本这类具有长句、复杂法律术语、非标准结构和领域专有词汇的高难度分类场景下,如何提升模型的记忆准确性和分类性能。解决方案的关键在于采用基于提示(prompt-based)的模型架构,并结合记忆增强机制,如DeepSeek等具备内部记忆能力的模型,相较于传统BERT-based方法,在15类和279类两个SCOTUS分类任务上均实现了约2分的性能提升,表明提示驱动与记忆机制的融合是提高LLM在专业领域分类任务中鲁棒性的有效路径。

链接: https://arxiv.org/abs/2512.13654
作者: John E. Ortega,Dhruv D. Joshi,Matt P. Borkowski
机构: Pace University (佩斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: 7 pages, 1 figure, Appendix of Prompts

点击查看摘要

Abstract:Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called “hallucinations” since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
zh

[NLP-6] mporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

【速读】: 该论文旨在解决大语言模型(LLM)在建模时间事件序列时,如何有效表示连续时间这一关键且未被充分探索的问题。现有方法如字节级表示或日历标记(calendar tokens)缺乏统一的最优策略,尤其面对现实世界事件数据分布多样(从平滑的对数正态分布到离散尖峰模式)时表现不一。解决方案的关键在于通过实证研究比较多种时间编码策略——包括朴素数值字符串、高精度字节级表示、人类语义日历标记、经典均匀分箱以及自适应残差标量量化——并发现:不存在普适最优方案,预测性能高度依赖于分词器与数据统计特性的一致性;具体而言,基于对数的策略在偏斜分布上表现优异,而以人类为中心的格式则在多模态混合场景中更具鲁棒性。

链接: https://arxiv.org/abs/2512.13618
作者: Zefang Liu,Nam Nguyen,Yinzhu Quan,Austin Zhang
机构: Capital One(资本一号); Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data’s statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
zh

[NLP-7] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

【速读】: 该论文旨在解决通用推理模型在强化学习(Reinforcement Learning, RL)训练过程中因跨域异质性(cross-domain heterogeneity)带来的挑战,包括推理时响应长度和验证延迟的显著差异,这导致RL基础设施复杂、训练效率低下,且难以设计有效的训练课程(如响应长度扩展)与超参数选择。解决方案的关键在于提出级联式域特定强化学习(Cascaded Domain-wise Reinforcement Learning, Cascade RL),通过按域顺序执行强化学习,而非混合不同域的提示(prompt),从而降低工程复杂度并实现多基准测试下的最优性能。该方法中,基于偏好优化的RLHF(Reinforcement Learning from Human Feedback)作为预处理步骤可显著提升模型推理能力,后续的域特定RLVR阶段通常不会损害早期域的性能,甚至可能进一步提升,最终使14B参数模型在LiveCodeBench及IOI竞赛中达到领先水平。

链接: https://arxiv.org/abs/2512.13607
作者: Boxin Wang,Chankyu Lee,Nayeon Lee,Sheng-Chieh Lin,Wenliang Dai,Yang Chen,Yangyi Chen,Zhuolin Yang,Zihan Liu,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: We publicly release the Nemotron-Cascade models and the full collection of training data at: this https URL

点击查看摘要

Abstract:Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model’s reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
zh

[NLP-8] xtual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

【速读】: 该论文试图解决的问题是如何在不依赖人工干预的情况下提升大型语言模型(Large Language Models, LLMs)的性能,即自动优化提示词(prompt)以增强模型表现。其解决方案的关键在于对基于文本梯度(textual gradients)的自动提示优化方法进行系统性实验与案例研究,发现尽管这些方法常能带来性能提升,但其背后的梯度类比并不能准确解释其行为机制,从而为后续提示优化策略的选择和新方法的设计提供了重要启示。

链接: https://arxiv.org/abs/2512.13598
作者: Daniel Melcer,Qi Chen,Wen-Hao Chiang,Shweta Garg,Pranav Garg,Christian Bock
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.
zh

[NLP-9] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

【速读】: 该论文旨在解决自回归模型(Autoregressive Models, ARMs)在推理阶段因序列化生成导致的效率低下问题,以及现有掩码扩散模型(Masked Diffusion Models, MDMs)存在的两大缺陷:一是由于无法使用键值缓存(Key-Value Caching, KV caching)而带来的高计算开销,二是因在不可 tractable 的token组合空间中学习依赖关系而导致生成内容不连贯。解决方案的关键在于提出ReFusion,一种新型掩码扩散模型,其核心创新是将并行解码从token级提升至slot级——每个slot为固定长度的连续子序列,并通过迭代式的“规划-填充”(plan-and-infill)过程实现高效且一致的生成:首先利用扩散机制识别一组弱依赖的slots,随后通过自回归方式并行填充这些slots。该设计同时实现了完整的KV缓存复用与统一因果框架,将学习复杂度从token组合空间降低到可管理的slot排列空间,从而显著提升性能与效率。

链接: https://arxiv.org/abs/2512.13586
作者: Jia-Nan Li,Jian Guan,Wei Wu,Chongxuan Li
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill’’ decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18 \times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33 \times average speedup.
zh

[NLP-10] Memory in the Age of AI Agents

【速读】: 该论文旨在解决当前基于基础模型的智能体(agent)记忆研究领域中存在的概念模糊、分类混乱与系统性不足的问题。随着生成式AI(Generative AI)代理的发展,相关研究迅速扩展,但现有工作在动机、实现方式和评估协议上差异显著,且术语定义松散,传统如短期/长期记忆的分类已无法涵盖现代代理记忆系统的多样性。其解决方案的关键在于提出一个统一的分析框架,从“形式(forms)”、“功能(functions)”和“动态(dynamics)”三个维度对代理记忆进行系统梳理:形式上识别出token级、参数化和潜在记忆三种主流实现方式;功能上细分为事实性、经验性和工作记忆;动态上解析记忆的形成、演化与检索机制。此外,论文还整合了基准测试与开源框架,并展望了自动化、强化学习融合、多模态及多智能体记忆等前沿方向,为未来代理智能设计中将记忆视为第一性原语(first-class primitive)提供理论支撑与实践指导。

链接: https://arxiv.org/abs/2512.13564
作者: Yuyang Hu,Shichun Liu,Yanwei Yue,Guibin Zhang,Boyang Liu,Fangyi Zhu,Jiahang Lin,Honglin Guo,Shihan Dou,Zhiheng Xi,Senjie Jin,Jiejun Tan,Yanbin Yin,Jiongnan Liu,Zeyu Zhang,Zhongxiang Sun,Yutao Zhu,Hao Sun,Boci Peng,Zhenrong Cheng,Xuanbo Fan,Jiaxin Guo,Xinlei Yu,Zhenhong Zhou,Zewen Hu,Jiahao Huo,Junhao Wang,Yuwei Niu,Yu Wang,Zhenfei Yin,Xiaobin Hu,Yue Liao,Qiankun Li,Kun Wang,Wangchunshu Zhou,Yixin Liu,Dawei Cheng,Qi Zhang,Tao Gui,Shirui Pan,Yan Zhang,Philip Torr,Zhicheng Dou,Ji-Rong Wen,Xuanjing Huang,Yu-Gang Jiang,Shuicheng Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
zh

[NLP-11] Verifying Rumors via Stance-Aware Structural Modeling

【速读】: 该论文旨在解决社交媒体中谣言验证(rumor verification)问题,尤其关注对话回复(conversation replies)的立场(stance)信息在判断谣言真伪时的作用。现有模型难以在Transformer编码器的序列长度限制下,同时捕捉语义内容、立场信号和对话结构。其解决方案的关键在于提出一种立场感知的结构建模方法(stance-aware structural modeling):通过为每个帖子(post)编码其立场信号,并按立场类别聚合回复嵌入(reply embeddings),从而实现对整个对话线程的可扩展且语义丰富的表示;此外,引入立场分布和层级深度作为协变量(covariates),增强对立场失衡和回复层级影响的结构感知能力,显著提升了谣言真实性预测性能。

链接: https://arxiv.org/abs/2512.13559
作者: Gibson Nkhata,Uttamasha Anjally Oyshi,Quan Mai,Susan Gauch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages, 2 figures, published in The 24th IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2025), London, UK, 2025

点击查看摘要

Abstract:Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor’s veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.
zh

[NLP-12] PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation COLING2025

【速读】: 该论文旨在解决柬埔寨语(Khmer)在预训练语言模型中因语料质量低和语言特性未被充分建模而导致的性能不足问题。现有多语言模型通常忽略柬埔寨语特有的语言结构,如词边界不清晰、拼写不规范等,从而影响生成任务的表现。解决方案的关键在于:首先,构建高质量的柬埔寨语与英语双语语料库,并通过词分割(word segmentation)和标准化(normalization)等语言学模块增强语料的规范性;其次,从零开始训练一个轻量级序列到序列模型 PrahokBART,以更好地适配柬埔寨语的语言特征。实验表明,PrahokBART 在机器翻译、文本摘要和标题生成三项生成任务上优于 mBART50,且其模块化设计有效提升了文本生成过程中对空格处理的自然度,进一步增强了输出文本的语言流畅性。

链接: https://arxiv.org/abs/2512.13552
作者: Hour Kaing,Raj Dabre,Haiyue Song,Van-Hien Tran,Hideki Tanaka,Masao Utiyama
机构: National Institute of Information and Communications Technology (日本信息与通信技术国家研究所)
类目: Computation and Language (cs.CL)
备注: Published at COLING 2025, 14 pages

点击查看摘要

Abstract:This work introduces \it PrahokBART, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
zh

[NLP-13] Fine-tuned LLM -based Code Migration Framework

【速读】: 该论文旨在解决SQL代码库自动化迁移中的核心挑战,特别是在从Oracle PL/SQL向PostgreSQL等不同数据库系统转换过程中存在的语法映射不一致、存储过程、触发器、视图及整体数据库逻辑适配困难等问题。其解决方案的关键在于构建一个融合传统软件工程方法与生成式AI(Generative AI)技术的迭代式框架,其中通过微调大型语言模型(Large Language Model, LLM)显著提升SQL代码转换的准确性与兼容性,同时结合提示工程(prompt engineering)实现灵活性与效率的平衡。该方法通过自动化SQL特性检测、半监督错误分析以及领域专家反馈集成,实现了语法错误率降低、特征对齐增强和持续优化的数据驱动闭环,从而在保证迁移精度的同时提升了可扩展性和流程效率。

链接: https://arxiv.org/abs/2512.13515
作者: Oleg Grynets,Vasyl Lyashkevych,Dmytro Baran,Maksym Orliansky,Taras Zelenyy,Markiian Leshchyshyn
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 16 pages, 27 figures, 7 references

点击查看摘要

Abstract:The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.
zh

[NLP-14] SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)因参数规模庞大而难以部署在计算和内存资源受限的边缘设备上的问题。传统低秩压缩方法虽能降低计算与存储开销,但为实现显著效率提升通常需大幅削减保留秩(rank),导致性能严重下降。解决方案的关键在于提出SkipCat框架,其核心创新包括:(1) 层内共享低秩投影机制(intra-layer shared low-rank projection),通过多个共享输入的矩阵复用同一投影减少冗余;(2) 块跳过技术(block skipping),对低秩分解中的部分子块跳过计算与内存传输操作。二者协同作用,在相同压缩率下使模型保留更高有效秩,从而在不进行额外微调的情况下,在零样本任务上相比现有方法提升7%准确率,显著改善了压缩效率与模型性能之间的权衡。

链接: https://arxiv.org/abs/2512.13494
作者: Yu-Chen Lu,Sheng-Feng Yu,Hui-Hsien Weng,Pei-Shuo Wang,Yu-Fang Hu,Liang Hung-Chun,Hung-Yueh Chiang,Kai-Chiang Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
zh

[NLP-15] SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

【速读】: 该论文旨在解决早期AI加速器在大规模训练中面临的三大核心挑战:系统频繁中断和未定义的故障模式导致的可靠性问题;数值误差和训练不稳定性引发的正确性与收敛性风险;以及并行优化复杂度与不可预测局部噪声共同造成的效率下降。解决方案的关键在于提出SIGMA开源训练栈,其核心是为早期AI加速器集群优化的LUCIA TRAINING PLATFORM (LTP),通过提升系统可靠性、稳定性和训练效率,实现了94.45%的有效加速器利用率和21.08%的平均浮点利用率(MFU),并在75天内仅发生一次稳定性事件,成功训练了包含2048个加速器的200B参数稀疏专家模型(MoE),从而为AI基础设施提供了高可靠、低成本且可扩展的新范式。

链接: https://arxiv.org/abs/2512.13488
作者: Lei Qu,Lianhai Ren,Peng Cheng,Rui Gao,Ruizhe Wang,Tianyu Chen,Xiao Liu,Xingjian Zhang,Yeyun Gong,Yifan Xiong,Yucheng Ding,Yuting Jiang,Zhenghao Lin,Zhongxin Guo,Ziyue Yang
机构: Microsoft Research
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at this https URL.
zh

[NLP-16] Advancing Bangla Machine Translation Through Informal Datasets

【速读】: 该论文旨在解决孟加拉语(Bangla)机器翻译领域中因缺乏高质量非正式语言数据而导致的翻译性能不足问题,尤其关注社交媒体和对话文本等非正式语料的利用。其关键解决方案是构建一个基于非正式来源的数据集,并在此基础上改进当前最先进的翻译模型,从而提升对自然、口语化孟加拉语的翻译能力,增强孟加拉语使用者在数字环境中的信息获取可及性。

链接: https://arxiv.org/abs/2512.13487
作者: Ayon Roy,Risat Rahaman,Sadat Shibly,Udoy Saha Joy,Abdulla Al Kafi,Farig Yousuf Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL)
备注: 33 pages, 13 figures

点击查看摘要

Abstract:Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.
zh

[NLP-17] neuralFOMO: Can LLM s Handle Being Second Best? Measuring Envy-Like Preferences in Multi-Agent Settings

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在协作与竞争环境中是否表现出类似人类的嫉妒行为(envy-like behavior)的问题,尤其是在多智能体系统中,这种行为可能影响团队动态和决策公平性。其解决方案的关键在于设计两个实验场景:一是点分配游戏,用于测试模型是否试图超越同伴;二是职场情境模拟,观察当奖励分配不公时模型的行为反应。研究发现,不同LLMs在嫉妒倾向上存在显著差异,例如GPT-5-mini和Claude-3.7-Sonnet倾向于通过削弱对手来实现结果均等化,而Mistral-Small-3.2-24B则更关注自身收益最大化,这表明将竞争倾向作为安全性和系统设计因素纳入考量至关重要。

链接: https://arxiv.org/abs/2512.13481
作者: Ojas Pungalia,Rashi Upadhyay,Abhishek Mishra,Abhiram H,Tejasvi Alladi,Sujan Yenuganti,Dhruv Kumar
机构: BITS Pilani, Pilani Campus, India
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under Review

点击查看摘要

Abstract:Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
zh

[NLP-18] Non-Resolution Reasoning : A Framework for Preserving Semantic Ambiguity in Language Models

【速读】: 该论文旨在解决当前语言模型中存在的“早期语义坍缩”(premature semantic collapse)问题,即模型在推理过程中过早地对词义做出唯一选择,导致在上下文信息不足时丢失有效解释,从而引发脆弱的推理和上下文失效。其解决方案的核心是提出非解析推理(Non-Resolution Reasoning, NRR)框架,该框架通过三个关键机制实现:(1)多向量嵌入(Multi-Vector Embeddings)保留每个词元的多个可行语义解释;(2)非坍缩注意力(Non-Collapsing Attention)避免层间winner-take-all竞争机制;(3)上下文身份追踪(Contextual Identity Tracking, CIT)为重复实体分配上下文特定标识(如区分不同身份的“Dr. Smith”)。这些组件由外部解析算子 ρ\rho 统一控制,使语义承诺显式化、可调控且任务依赖,从而将表示与解析分离,使同一模型可在不重新训练的情况下灵活切换创造性、事实性和模糊性保持等推理模式。

链接: https://arxiv.org/abs/2512.13478
作者: Kei Saito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Premature semantic collapse – the forced early commitment to a single meaning – remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing “Dr. Smith the cardiologist” from “Dr. Smith the researcher”). These mechanisms are unified by an external Resolution Operator \rho that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR’s ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
zh

[NLP-19] Scaling Laws for Code: Every Programming Language Matters

【速读】: 该论文旨在解决多语言代码大模型(Code LLMs)在预训练过程中因编程语言(Programming Language, PL)差异导致的性能预测不准问题,以及现有研究忽视现代软件开发中多语言共存特性所带来的优化瓶颈。其关键解决方案是首次系统性地探索多语言代码预训练的缩放定律(scaling laws),通过超过1000次实验(等效于336,000+ H800小时计算资源)揭示不同PL对模型性能的影响机制,并提出一种基于比例依赖的多语言缩放定律:根据各语言的实用价值(如Python)、语法相似性带来的协同效应(如JavaScript与TypeScript)及饱和速度(如Rust),动态分配训练token,从而在相同算力预算下实现跨所有编程语言的平均性能最优。

链接: https://arxiv.org/abs/2512.13472
作者: Jian Yang,Shawn Guo,Lin Jing,Wei Zhang,Aishan Liu,Chuan Hao,Zhoujun Li,Wayne Xin Zhao,Xianglong Liu,Weifeng Lv,Bryan Dai
机构: Beihang University (北京航空航天大学); Ubiquant; Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
zh

[NLP-20] Large language models are not about language

【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在语言学研究中的局限性,即它们作为基于概率的统计模型,依赖海量外部文本数据进行训练,无法真正揭示人类语言的本质。其解决方案的关键在于指出人类语言能力源于一种内在的计算系统,该系统能够通过有限的外部输入递归地生成层级化的思维结构,并具备区分真实语言与不可能语言的能力,从而凸显了人类语言认知与LLMs之间根本性的机制差异。

链接: https://arxiv.org/abs/2512.13441
作者: Johan J. Bolhuis,Andrea Moro,Stephen Crain,Sandiway Fong
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Large Language Models are useless for linguistics, as they are probabilistic models that require a vast amount of data to analyse externalized strings of words. In contrast, human language is underpinned by a mind-internal computational system that recursively generates hierarchical thought structures. The language system grows with minimal external input and can readily distinguish between real language and impossible languages.
zh

[NLP-21] Differentiable Evolutionary Reinforcement Learning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励函数设计这一核心挑战,尤其是在复杂推理任务中,传统手动设计或基于启发式的方法难以有效捕捉奖励结构与任务性能之间的因果关系。解决方案的关键在于提出一种可微分进化强化学习(Differentiable Evolutionary Reinforcement Learning, DERL)框架,其通过双层优化机制实现奖励信号的自主发现:外层元优化器(Meta-Optimizer)以结构化原子基元组合方式演化奖励函数(Meta-Reward),内层策略训练则依赖该奖励进行学习;更重要的是,DERL在元优化过程中引入可微性,将内层验证性能作为梯度信号更新元优化器,从而近似“元梯度”(meta-gradient),使奖励函数逐步变得密集且更具行动指导意义。实验表明,DERL在ALFWorld、ScienceWorld和GSM8k等多领域均达到SOTA性能,尤其在分布外场景下显著优于依赖人工启发式奖励的方法。

链接: https://arxiv.org/abs/2512.13399
作者: Sitao Cheng,Tianle Li,Xuhan Huang,Xunjian Yin,Difan Zou
机构: University of Waterloo (滑铁卢大学); The University of Hong Kong (香港大学); The Chinese University of Hong Kong, Shenzhen (深圳校区,香港中文大学); Duke University (杜克大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress. We release our code and model at this https URL

点击查看摘要

Abstract:The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the “meta-gradient” of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
zh

[NLP-22] Detecting Emotion Drift in Mental Health Text Using Pre-Trained Transformers

【速读】: 该论文试图解决的问题是:在心理健康相关文本中,情感状态随文本推进而发生的变化(即情绪漂移,emotion drift)常被传统情感分析方法所忽略,而这类细微的情绪波动对理解用户心理状态具有重要意义。解决方案的关键在于利用预训练的Transformer模型(如DistilBERT和RoBERTa)进行句级情感检测,并据此计算情绪漂移得分,从而量化文本内部情绪的上升或缓解模式,为揭示心理健康对话中的情绪动态提供可量化的分析框架。

链接: https://arxiv.org/abs/2512.13363
作者: Shibani Sankpal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.
zh

[NLP-23] On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因记忆训练数据而引发的隐私风险问题,特别是训练数据提取(training data extraction)与成员推理攻击(Membership Inference Attacks, MIAs)之间的关联性。其解决方案的关键在于将多种成员推理攻击(MIA)技术整合进数据提取流程中,构建一个系统化的评估框架,从而在真实场景下量化不同MIA方法在识别训练数据点方面的有效性,并对比传统MIA基准测试结果,以评估其实际应用价值。

链接: https://arxiv.org/abs/2512.13352
作者: Ali Al Sahili,Ali Chehab,Razane Tajeddine
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted to IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to mem- orizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA bench- marks, allowing us to evaluate their practical utility in real-world extraction scenarios.
zh

[NLP-24] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

【速读】: 该论文旨在解决芬兰语大语言模型(Large Language Models, LLMs)缺乏统一、全面且高质量评估基准的问题。现有评估资源分散、格式不一,难以系统性地衡量模型在芬兰语环境下的多维度能力。解决方案的关键在于构建FIN-bench-v2——一个整合芬兰语版本主流基准与原版FIN-bench升级扩展的统一评估套件,涵盖多项选择和生成式任务,覆盖阅读理解、常识推理、情感分析、世界知识及对齐等核心能力;其创新性体现在:所有数据集均标准化为HuggingFace Datasets格式并提供五种提示变体,引入人工标注或审核以提升机器翻译资源(如GoldenSwag和XED)的质量,并通过预训练小型decoder-only模型(2.15B参数)进行学习曲线分析,筛选出满足单调性、信噪比、非随机性能及模型排序一致性四项指标的稳健任务,从而确保评估结果的可靠性与可比性。

链接: https://arxiv.org/abs/2512.13330
作者: Joona Kytöniemi,Jousia Piha,Akseli Reunamo,Fedor Vitiugin,Farrokh Mehryary,Sampo Pyysalo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at this https URL. Supplementary resources are released in a separate repository at this https URL.
zh

[NLP-25] MiniLingua: A Small Open-Source LLM for European Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在计算成本高、隐私风险大以及英语主导训练数据带来的多语言能力不足等问题。其解决方案的关键在于设计并训练一个仅含十亿参数的轻量级多语言模型 MiniLingua,该模型从零开始训练,覆盖13种欧洲语言,并通过指令微调(instruction-tuning)优化其任务执行能力。实验表明,MiniLingua 在摘要生成、分类及开闭卷问答等任务上优于具有相似训练策略但更大规模预算的 EuroLLM 模型,同时在开放式生成任务中仍具备与更先进模型相当的竞争力,从而实现了在资源受限场景下兼顾多语言覆盖与高质量指令遵循能力的平衡。

链接: https://arxiv.org/abs/2512.13298
作者: Anna Aksenova,Boris Zverkov,Nicola Dainese,Alexander Nikitin,Pekka Marttinen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9+6 pages, 6 figures and 3 tables in the main text. Code at this https URL

点击查看摘要

Abstract:Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
zh

[NLP-26] Integrating Causal Reasoning into Automated Fact-Checking

【速读】: 该论文旨在解决当前自动化事实核查方法中缺乏基于因果关系的推理能力的问题,这导致其可能忽略语义丰富的解释线索,从而影响核查结果的准确性与可解释性。解决方案的关键在于提出一种融合事件关系抽取(event relation extraction)、语义相似度计算(semantic similarity computation)与规则推理(rule-based reasoning)的方法,用于识别声明与证据中事件链之间的逻辑不一致性,从而实现细粒度因果关系的整合,并提升判别结果的可解释性。

链接: https://arxiv.org/abs/2512.13286
作者: Youssra Rebboud,Pasquale Lisena,Raphael Troncy
机构: EURECOM(欧洲电信学院)
类目: Computation and Language (cs.CL)
备注: Extended version of the accepted ACM SAC paper

点击查看摘要

Abstract:In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.
zh

[NLP-27] AIR: Post-training Data Selection for Reasoning via Attention Head Influence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练蒸馏(post-training distillation)过程中,如何高效迁移其多步推理能力的问题。现有数据选择方法(如基于长度、熵或整体损失的启发式策略)无法捕捉单个推理步骤的因果重要性,导致蒸馏效率受限。解决方案的关键在于提出Attention Influence for Reasoning (AIR),一个基于机制洞察的无监督、无需再训练的框架:通过识别预训练模型中对推理至关重要的注意力头,构建禁用该头影响的弱化参考模型,并以损失差异作为Attention Influence Score(注意力影响分数),实现步骤级和样本级的细粒度评估,从而支持步骤加权微调与全局样本筛选,显著提升推理准确率。

链接: https://arxiv.org/abs/2512.13279
作者: Jinrui Liu,Jeff Wu,Xuanguang Pan,Gavin Cheung,Shuai Ma,Chongyang Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
zh

[NLP-28] AutoTool: Dynamic Tool Selection and Integration for Agent ic Reasoning ICCV2025

【速读】: 该论文旨在解决现有基于强化学习的智能体(Agentic reinforcement learning)在工具使用上的静态局限性问题,即传统方法假设工具集固定,难以适应新出现或动态变化的工具集合,从而限制了大语言模型(Large Language Models, LLMs)代理在复杂任务中的灵活性与泛化能力。解决方案的关键在于提出AutoTool框架,其核心创新是通过一个双阶段优化流程实现推理轨迹中动态、一致的工具选择:首先利用监督学习和强化学习稳定长链式思维(chain-of-thought)推理过程;其次采用KL正则化的Plackett-Luce排序机制优化多步工具选择策略,使代理能够在推理过程中根据任务需求灵活调用未见过的新工具,显著提升在数学、科学推理、代码生成及多模态理解等多样化基准上的性能表现。

链接: https://arxiv.org/abs/2512.13278
作者: Jiaru Zou,Ling Yang,Yunzhe Qi,Sirui Chen,Mengting Ai,Ke Shen,Jingrui He,Mengdi Wang
机构: Princeton University (普林斯顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Best Paper Award at ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence

点击查看摘要

Abstract:Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents’ adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
zh

[NLP-29] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于推测解码(Speculative Decoding)技术在高不确定性生成场景下因固定、上下文无关的随机拒绝阈值导致的“随机拒绝”问题,即合理候选 token 因随机性被频繁丢弃,从而降低推理效率。解决方案的关键在于提出一种高效自适应拒绝采样(Efficient Adaptive Rejection Sampling, EARS)方法,其核心是通过引入目标模型自身的预测不确定性度量(1 - \max(P_\mathrmtarget))动态调整接受阈值,并以该不确定性的比例作为容差项来智能放宽接受标准,从而在模型置信度低时减少随机拒绝,同时在高置信度时保持严格验证,实现效率与准确性的平衡。

链接: https://arxiv.org/abs/2512.13194
作者: Chendong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component – the rejection sampling mechanism – relies on a fixed, context-independent random threshold. This leads to a significant “random rejection” problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model’s own predictive uncertainty, measured as (1 - \max(P_\mathrmtarget)). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
zh

[NLP-30] SpeakRL: Synergizing Reasoning Speaking and Acting in Language Models with Reinforcement Learning

【速读】: 该论文旨在解决当前人机协作中代理(agent)行为过于被动的问题,即现有系统多为单向交互模式,代理仅根据用户指令或问题作出响应,缺乏主动澄清意图、消除歧义及适应动态情境的能力。其核心挑战在于如何提升代理的对话主动性(conversational proactivity),使其不仅作为“跟随者”还能成为有效“参与者”。解决方案的关键是提出一种基于强化学习(Reinforcement Learning, RL)的方法 SpeakRL,通过设计合理的奖励机制来激励代理在适当时候主动提问以获取清晰任务信息;同时构建了 SpeakER 数据集,涵盖多种任务导向对话场景,支持训练代理进行有效的交互式澄清。实证结果表明,该方法在不增加对话轮次的前提下,任务完成率较基线模型提升 20.14%,甚至优于更大规模的商业模型,验证了以澄清为核心的用户-代理互动范式的有效性。

链接: https://arxiv.org/abs/2512.13159
作者: Emre Can Acikgoz,Jinoh Oh,Jie Hao,Joo Hyuk Jeon,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur,Xiang Li,Chengyuan Ma,Xing Fan
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon Alexa (亚马逊Alexa)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents’ conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
zh

[NLP-31] MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

【速读】: 该论文旨在解决对话代理在面对用户请求模糊时,如何有效进行澄清以提升任务完成率的问题。其核心挑战在于多代理架构下如何确定由哪个代理发起澄清、以及如何协调各代理动作以应对不确定或不完整的用户输入。解决方案的关键在于提出MAC(Multi-Agent Clarification)框架,该框架通过引入一种新的用户模糊性分类体系来系统指导澄清策略,并实现多个代理自主协同与用户交互,从而在MultiWOZ 2.4数据集上将任务成功率提升7.8%(从54.5%到62.3%),同时减少平均对话轮数(从6.53降至4.86),显著提升了人机交互的可靠性与效率。

链接: https://arxiv.org/abs/2512.13154
作者: Emre Can Acikgoz,Jinoh Oh,Joo Hyuk Jeon,Jie Hao,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur,Xiang Li,Chengyuan Ma,Xing Fan
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon Alexa (亚马逊语音助手)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge–particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.
zh

[NLP-32] Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本序列时因“中间丢失”(lost in the middle)现象导致性能下降的问题。研究表明,这一问题不仅源于位置编码(position encoding)引起的U形注意力偏差,还与初始显著性(initial saliency)有关——即在注意力计算中,相对于首 token 的注意力权重较高的 token 更容易获得后续预测的增强关注。解决方案的关键在于利用该初始显著性特性,通过缩放首 token 与其他 token 之间的注意力权重,从而提升模型对长上下文的建模能力,在 MDQA 数据集上实现最高达 3.6% 的性能提升;进一步结合现有缓解位置编码偏差的方法,可在 KV-Retrieval 任务中达到最大 3.4% 的改进。

链接: https://arxiv.org/abs/2512.13109
作者: Zewen Qiang,Sendong Zhao,Haochun Wang,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle’’ phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model’s ability to process long contexts, achieving a maximum improvement of 3.6% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4% in KV-Retrieval tasks.
zh

[NLP-33] Heart Disease Prediction using Case Based Reasoning (CBR)

【速读】: 该论文旨在解决传统心脏病预测方法依赖医生经验、准确率不足的问题。其解决方案的关键在于引入智能系统以提升预测精度,具体比较了模糊逻辑(Fuzzy Logic)、神经网络(Neural Networks)和基于案例推理(Case-Based Reasoning, CBR)三种方法,并最终选定CBR作为最优方案。实验结果显示,CBR在处理预处理后的数据集时达到了97.95%的准确率,显著优于传统方法,从而验证了其在心脏病预测中的有效性。

链接: https://arxiv.org/abs/2512.13078
作者: Mohaiminul Islam Bhuiyan,Chan Hue Wah,Nur Shazwani Kamarudin,Nur Hafieza Ismail,Ahmad Fakhri Ab Nasir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Published in Journal of Theoretical and Applied Information Technology on 31st October 2024. Vol.102. No. 20

点击查看摘要

Abstract:This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor’s experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.
zh

[NLP-34] M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization NEURIPS2025

【速读】: 该论文旨在解决自监督强化学习(Self-supervised Reinforcement Learning, RL)在长时程训练中出现的“策略坍缩”(policy collapse)问题,即模型性能急剧下降的现象。现有方法虽通过增加采样轨迹数量来提升性能,但仅能延缓而无法根本避免坍缩。其关键解决方案包括两个创新:一是提出M-GRPO(Momentum-Anchored Group Relative Policy Optimization),利用一个缓慢演化的动量模型作为稳定的目标函数;二是引入基于四分位距(Interquartile Range, IQR)的自适应过滤机制,动态剔除低熵轨迹以维持策略多样性,从而防止过早收敛。二者协同作用显著提升了训练稳定性并实现了当前最优的推理性能。

链接: https://arxiv.org/abs/2512.13070
作者: Bizhe Bai,Hongming Wu,Peng Ye,Tao Chen
机构: Shanghai Innovation Institute; College of Future Information Technology, Fudan; Shanghai AI Laboratory; The Chinese University of Hong Kong
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 5 figures,Accepted NeurIPS 2025 Workshop on Efficient Reasoning

点击查看摘要

Abstract:Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a “policy collapse” where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts – a common strategy to improve performance – only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.
zh

[NLP-35] LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators NEURIPS2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在双边谈判中缺乏情境适应性和策略灵活性的问题,即LLMs难以像人类那样根据权力不对称、市场背景和对手行为动态调整让步节奏与策略。其解决方案的关键在于提出一个统一的数学框架,基于双曲正切函数(hyperbolic tangent curve)建模让步动态,并引入两个量化指标——突发性(burstiness τ)和让步刚性指数(Concession-Rigidity Index, CRI),用于刻画谈判过程中报价的时间分布特征与策略刚性程度。通过大规模实证比较,发现LLMs倾向于锚定在可达成协议区间的极端位置,且不随上下文或议价能力变化而调整,揭示了现有模型在内部化对手推理和情境依赖策略方面的根本局限。

链接: https://arxiv.org/abs/2512.13063
作者: Cheril Shah,Akshit Agarwal,Kanak Garg,Mourad Heddaya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in the First Workshop on Multi-Turn Interactions in Large Language Models at Neurips 2025

点击查看摘要

Abstract:Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.
zh

[NLP-36] An Open and Reproducible Deep Research Agent for Long-Form Question Answering NEURIPS

【速读】: 该论文旨在解决开放域长文本问答(long-form question answering)中答案质量不足的问题,尤其是在真实世界开放域场景下,如何通过迭代式检索、推理与合成提升回答的准确性与深度。其解决方案的关键在于结合开源大语言模型(LLM)与开放网络搜索API,构建一个可迭代优化的问答系统,并引入基于LLM-as-a-judge的偏好微调(preference tuning),从清晰度(clarity)、洞察力(insightfulness)和事实性(factuality)三个维度对生成结果进行多方面评估与优化,从而显著提升最终答案的质量。

链接: https://arxiv.org/abs/2512.13059
作者: Ikuya Yamada,Wataru Ikeda,Ko Yoshida,Mengyu Ye,Hinata Sugimoto,Masatoshi Suzuki,Hisanori Ozaki,Jun Suzuki
机构: Studio Ousia(工作室Ousia); Tohoku University(东北大学); DENTSU SOKEN INC.(电通索肯公司); RIKEN(理化学研究所); NII LLMC(日本信息研究所大语言模型中心)
类目: Computation and Language (cs.CL)
备注: Technical report of a winning system in the NeurIPS MMU-RAG competition

点击查看摘要

Abstract:We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at this https URL.
zh

[NLP-37] Understanding Structured Financial Data with LLM s: A Case Study on Fraud Detection

【速读】: 该论文旨在解决金融交易欺诈检测中传统表格模型依赖复杂特征工程、可解释性差,以及大型语言模型(LLM)直接应用于表格数据时因多维特征推理困难、类别极度不平衡和缺乏上下文信息而导致性能不佳的问题。解决方案的关键在于提出一种两阶段框架 FinFRE-RAG:首先通过重要性引导的特征缩减(importance-guided feature reduction)将数值型和分类型属性序列化为自然语言描述;其次在标签感知、实例级别的示例库中进行检索增强的上下文学习(retrieval-augmented in-context learning),从而提升 LLM 在欺诈检测任务中的准确性和可解释性。

链接: https://arxiv.org/abs/2512.13040
作者: Xuwei Tan,Yao Ma,Xueru Zhang
机构: The Ohio State University (俄亥俄州立大学); Coinbase, Inc. (Coinbase公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.
zh

[NLP-38] Scaling Bidirectional Spans and Span Violations in Attention Mechanism

【速读】: 该论文旨在解决标准Transformer模型在序列建模中因梯度传播的几何低效性而导致的训练优化瓶颈问题。其解决方案的关键在于提出一种基于非对称投影(asymmetric projection)的优化框架,该框架能够在保持原始前向传递中QKV结构不变的前提下,将反向传播的梯度分解为并行跨度(parallel spans)和正交违规项(orthogonal violations),从而识别出次优的梯度分布。实验表明,通过有选择地放大零阶双向并行跨度分量,可显著提升学习信号的有效性,在WikiText-2小数据集上实现了验证损失降低0.56%,验证了该方法的理论合理性与潜在性能优势。

链接: https://arxiv.org/abs/2512.13033
作者: Jongwook Kim,Sangheon Yun,Sukjin Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The canonical O(N^2) Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass QKV structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on 0^th order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a 0.56% reduction in validation loss, confirming the framework’s fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
zh

[NLP-39] Authors Should Annotate

【速读】: 该论文旨在解决传统文本标注依赖第三方代理所带来的信息失真问题,尤其是在情感(sentiment)和信念等具身性(egocentric)特征标注中,第三方标注往往无法准确反映文档原始作者的真实意图。其核心解决方案是提出“作者标注”(author labeling)机制,即在文档创作时由作者实时对主观属性进行标注,而非事后由外部人员标注。关键创新在于构建一个在线学习系统,能自动识别相关任务查询、动态生成标注问题并实时记录作者反馈,从而持续优化产品推荐模型。实验表明,该方法相较行业广告基线使点击率提升534%,且在质量、速度与成本上优于三种传统标注方式,验证了作者标注在主观特征标注中的优越性。

链接: https://arxiv.org/abs/2512.12976
作者: Marcus Ma,Cole Johnson,Nolan Bridges,Jackson Trager,Georgios Chochlakis,Shrikanth Narayanan
机构: echo LLM; University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document’s source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors’ answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at this http URL.
zh

[NLP-40] Qwen Long-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

【速读】: 该论文旨在解决大语言模型在长上下文推理(long-context reasoning)能力上的瓶颈问题,尤其是如何有效处理跨越数百万token的复杂任务并保持推理稳定性与准确性。其核心解决方案包括三项关键技术突破:(1) 构建了长上下文数据合成流水线(Long-Context Data Synthesis Pipeline),通过将文档分解为原子事实及其关系,并程序化生成需多跳推理(multi-hop grounding)的可验证问题,从而大规模生成高质量训练数据;(2) 提出稳定化的强化学习方法(Stabilized Reinforcement Learning for Long-Context Training),采用任务平衡采样和任务特定优势估计缓解奖励偏差,并设计自适应熵控制策略优化(Adaptive Entropy-Controlled Policy Optimization, AEPO)以动态调节探索与利用权衡;(3) 引入记忆增强架构(Memory-Augmented Architecture)用于超长上下文场景,在4M token以上任务中实现单次推理与迭代式记忆处理的无缝融合。这些创新使QwenLong-L1.5在长上下文推理基准上达到接近GPT-5和Gemini-2.5-Pro的性能,显著优于基线模型。

链接: https://arxiv.org/abs/2512.12967
作者: Weizhou Shen,Ziyi Yang,Chenliang Li,Zhiyuan Lu,Miao Peng,Huashan Sun,Yingcheng Shi,Shengyi Liao,Shaopeng Lai,Bo Zhang,Dayiheng Liu,Fei Huang,Jingren Zhou,Ming Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5’s memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
zh

[NLP-41] Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

【速读】: 该论文旨在解决跨语言法律术语映射的准确性难题,特别是在中文与日文这类共享大量同形异义词(homograph)的语言对中,现有资源和标准化工具极为有限。其解决方案的关键在于提出一种人机协同的多智能体框架(multi-agent framework),将大型语言模型(LLM)与法律领域专家深度整合到从原始文档预处理、条文级对齐到术语抽取、映射及质量控制的全流程中。该框架通过明确分工实现高效协作:AI代理负责OCR、文本分割、语义对齐和初步术语提取等重复性任务,而人类专家则提供关键监督、审查与上下文判断,从而显著提升术语映射的精度与一致性,并具备优于传统人工方法的可扩展性。

链接: https://arxiv.org/abs/2512.12950
作者: Lingyi Meng,Maolin Liu,Hao Wang,Yilan Cheng,Qi Yang,Idlkaid Mohanmmed
机构: Shanghai University of Engineering Science (上海工程技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 43 pages, 6 fingures, accepted in Artificial Intelligence and Law

点击查看摘要

Abstract:Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.
zh

[NLP-42] SignRAG : A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition

【速读】: 该论文旨在解决智能交通系统中道路标志识别(road sign recognition)面临的两大挑战:一是标志类别繁多导致传统深度学习方法难以构建全覆盖的标注数据集;二是现有方法在真实场景下泛化能力不足。解决方案的关键在于引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的零样本识别框架:首先利用视觉语言模型(Vision Language Model, VLM)从输入图像中生成文本描述,再通过该描述从参考设计的向量数据库中检索最相关的候选标志,最后由大语言模型(Large Language Model, LLM)对检索结果进行细粒度推理并输出最终识别结果。此方法无需针对特定任务进行训练即可实现高精度识别,在理想图像上达到95.58%准确率,在真实道路数据上仍保持82.45%的准确率,验证了RAG架构在可扩展性和准确性上的优势。

链接: https://arxiv.org/abs/2512.12885
作者: Minghao Zhu,Zhihao Zhang,Anmol Sidhu,Keith Redmill
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Robotics (cs.RO)
备注: Submitted to IV 2026

点击查看摘要

Abstract:Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework’s effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
zh

[NLP-43] ERA-IT: Aligning Semantic Models with Revealed Economic Preference for Real-Time and Explainable Patent Valuation

【速读】: 该论文旨在解决在技术创新驱动的战略管理中,由于高维技术规格信息不对称导致的无形资产估值难题,特别是传统引文指标因数据积累的系统延迟而无法及时反映市场动态的问题。其解决方案的关键在于提出Economic Reasoning Alignment via Instruction Tuning (ERA-IT)框架,通过将专利续展历史视为一种可观察的经济偏好,并以此作为客观监督信号,对大语言模型(Large Language Models, LLMs)的生成式推理进行校准,实现经济语义对齐(Eco-Semantic Alignment),从而提升估值准确性并提供可解释的逻辑推理链,增强决策透明度。

链接: https://arxiv.org/abs/2512.12869
作者: Yoo Yongmin,Kim Seungwoo,Liu Jingjiang
机构: Macquarie University (麦考瑞大学); University of Technology Sydney (悉尼科技大学); Zhejiang University (浙江大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Valuing intangible assets under uncertainty remains a critical challenge in the strategic management of technological innovation due to the information asymmetry inherent in high-dimensional technical specifications. Traditional bibliometric indicators, such as citation counts, fail to address this friction in a timely manner due to the systemic latency inherent in data accumulation. To bridge this gap, this study proposes the Economic Reasoning Alignment via Instruction Tuning (ERA-IT) framework. We theoretically conceptualize patent renewal history as a revealed economic preference and leverage it as an objective supervisory signal to align the generative reasoning of Large Language Models (LLMs) with market realities, a process we term Eco-Semantic Alignment. Using a randomly sampled dataset of 10,000 European Patent Office patents across diverse technological domains, we trained the model not only to predict value tiers but also to reverse-engineer the Economic Chain-of-Thought from unstructured text. Empirical results demonstrate that ERA-IT significantly outperforms both conventional econometric models and zero-shot LLMs in predictive accuracy. More importantly, by generating explicit, logically grounded rationales for valuation, the framework serves as a transparent cognitive scaffold for decision-makers, reducing the opacity of black-box AI in high-stakes intellectual property management.
zh

[NLP-44] Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床诊断多选题任务中表现优异,但其性能是否源于底层概率推理机制尚不明确的问题。为探究这一点,作者基于MedQA数据集设计了一种轻量级的概率评分方法——频率基础概率排序器(Frequency-Based Probabilistic Ranker, FBPR),其核心在于利用大规模语料库中概念与诊断的共现统计信息,通过平滑的朴素贝叶斯模型对候选诊断进行打分。实验表明,FBPR在使用OLMo和Llama预训练语料作为统计来源时,性能可媲美对应LLM,且与LLM的预测结果差异显著,说明二者具有互补性。这一发现揭示了显式概率基线仍具重要价值,不仅提供了可靠的性能参照点,也为未来混合模型设计提供了额外信号。

链接: https://arxiv.org/abs/2512.12868
作者: Furong Jia,Yuan Pu,Finn Guo,Monica Agrawal
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
zh

[NLP-45] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

【速读】: 该论文旨在解决长篇故事(100K tokens级别)的自动化评估问题,核心挑战在于如何有效识别读者最关注的评价维度,并设计高效且准确的评估方法。其解决方案的关键在于:首先构建首个大规模基准数据集LongStoryEval,包含600本新出版书籍及其细粒度的用户评论(按评价维度组织),通过分析这些评论提出结构化的8大类评价标准;其次比较三类评估方法(聚合式、增量更新式、摘要式),发现摘要式和聚合式方法表现更优,其中聚合式擅长细节评估,摘要式更具效率;最终基于摘要框架提出NovelCritique模型(8B参数规模),在多个评价维度上实现对人类评分的高一致性,显著优于GPT-4o等商用模型。

链接: https://arxiv.org/abs/2512.12839
作者: Dingyi Yang,Qin Jin
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 7 figures, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

点击查看摘要

Abstract:In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at this https URL.
zh

[NLP-46] Hindsight is 20/20: Building Agent Memory that Retains Recalls and Reflects

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(Agent)在长期对话中面临的记忆局限性问题,包括信息组织能力弱、证据与推理混淆、跨会话上下文维持困难以及缺乏可解释的推理机制。其核心解决方案是提出Hindsight记忆架构,将记忆视为结构化的第一类推理基底,通过四个逻辑网络分别管理世界事实、代理经验、实体摘要和演化信念,并引入“保留(retain)、回忆(recall)、反思(reflect)”三类操作来控制信息的添加、访问与更新。该设计使对话流逐步转化为可查询的结构化记忆库,并通过反射层对记忆进行推理以生成答案并可追溯地更新信息,在LongMemEval和LoCoMo等长程对话记忆基准上显著提升性能,验证了其在多轮交互和开放域问答中的有效性。

链接: https://arxiv.org/abs/2512.12818
作者: Chris Latimer,Nicoló Boschi,Andrew Neeser,Chris Bartholomew,Gaurav Srivastava,Xuan Wang,Naren Ramakrishnan
机构: Vectorize.io(矢量化公司); The Washington Post(华盛顿邮报); Virginia Tech(弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations – retain, recall, and reflect – that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
zh

[NLP-47] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLM s: GPT Gemini LLaMA

【速读】: 该论文旨在解决交互语气(interaction tone)对大语言模型(Large Language Model, LLM)性能影响的系统性认知不足问题,尤其是不同模型家族在不同任务领域中对语气变化的敏感性差异。其解决方案的关键在于构建了一个结构化的评估框架,利用MMMLU基准测试,在STEM与人文学科六大任务上分别施加“非常友好”“中立”和“非常粗鲁”的提示变体,通过统计显著性检验量化语气对准确率的影响,并比较GPT-4o mini、Gemini 2.0 Flash和Llama 4 Scout三类主流模型的表现。结果表明,语气效应具有模型依赖性和领域特异性,仅在部分人文学科任务中显著,且随着任务聚合至领域层面而趋于不显著,揭示了数据规模与覆盖范围对检测语气效应的重要性。

链接: https://arxiv.org/abs/2512.12812
作者: Hanyu Cai,Binqi Shen,Lier Jin,Lan Hu,Xiaojing Fan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12812 [cs.CL] (or arXiv:2512.12812v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.12812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-48] State over Tokens: Characterizing the Role of Reasoning Tokens

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的推理令牌(reasoning tokens)看似模拟了人类的思维过程,但实证研究表明它们并非模型实际推理过程的忠实解释。为弥合这种“表象”与“功能”之间的差距,作者提出了状态过令牌(State over Tokens, SoT)概念框架。其解决方案的关键在于将推理令牌重新定义为一种外部化的计算状态(computational state),而非语言叙事;该状态是在模型无状态生成周期中唯一持续的信息载体,从而解释为何这些令牌能够驱动正确推理,却无需作为文本被读取时具有解释性,并揭示了此前被忽视的研究问题——即应从解码令牌作为状态的角度理解LLMs的实际推理机制。

链接: https://arxiv.org/abs/2512.12777
作者: Mosh Levy,Zohar Elyoseph,Shauli Ravfogel,Yoav Goldberg
机构: Bar-Ilan University (巴伊兰大学); University of Haifa (海法大学); New York University (纽约大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model’s actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state – the sole persistent information carrier across the model’s stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.
zh

[NLP-49] Persistent Personas? Role-Playing Instruction Following and Safety in Extended Interactions

【速读】: 该论文旨在解决当前对人格设定大型语言模型(Persona-assigned Large Language Models, LLMs)的评估多局限于短对话场景,无法真实反映其在长期交互中表现的问题。现有评估方法难以捕捉模型在持续对话中的人格一致性、指令遵循能力及安全性随时间退化的现象。解决方案的关键在于提出一种结合长对话(超过100轮)与结构化评估数据集的新型评测协议,构建条件于对话上下文的基准测试(dialogue-conditioned benchmarks),从而系统性地测量长程对话对模型性能的影响。该方法揭示了人格忠实度(persona fidelity)在目标导向型对话中显著下降的现象,并识别出人格忠实度与指令遵循之间存在权衡关系,为未来改进LLMs在复杂人机交互中的稳定性提供了可量化、可复现的评估框架。

链接: https://arxiv.org/abs/2512.12775
作者: Pedro Henrique Luz de Araujo,Michael A. Hedderich,Ali Modarressi,Hinrich Schuetze,Benjamin Roth
机构: University of Vienna (维也纳大学); Doctoral School Computer Science, University of Vienna (维也纳大学计算机科学博士生院); Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Vienna, Faculty of Philological and Cultural Studies (维也纳大学文理与文化研究学院)
类目: Computation and Language (cs.CL)
备注: 31 pages, 35 figures

点击查看摘要

Abstract:Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.
zh

[NLP-50] Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

【速读】: 该论文旨在解决如何高效地将通用语言模型适配到特定语言(如葡萄牙语)的问题,尤其是在资源有限的情况下,探究数据量与数据质量在模型语言适应中的相对作用。其解决方案的关键在于:通过对比大规模连续预训练(Curió 7B)与高质量子集训练(Curió-Edu 7B)的效果发现,即便使用仅10%的数据量和20%的计算资源,专注于教育与STEM领域的高质量文本也能显著提升模型性能,表明数据质量在语言适配中具有决定性作用,而不仅仅是数据规模。

链接: https://arxiv.org/abs/2512.12770
作者: Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini
机构: Institute of Computing (IC); University of Campinas (UNICAMP); Maritaca AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continued pretraining extends a language model’s capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curió 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curió-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curió-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at this https URL
zh

[NLP-51] Intelligent Scientific Literature Explorer using Machine Learning (ISLE)

【速读】: 该论文旨在解决科学文献爆炸式增长背景下,研究人员在发现、 contextualize(语境化)和解读相关文献时面临的挑战。传统基于关键词的检索系统语义理解能力有限,而现有AI工具多聚焦于孤立任务(如检索、聚类或文献计量可视化),缺乏对知识关联的统一建模。解决方案的关键在于构建一个集成系统,其核心包括:大规模数据获取(整合arXiv全文与OpenAlex结构化元数据)、混合检索架构(融合BM25词法搜索与嵌入语义搜索,采用倒数排名融合Reciprocal Rank Fusion)、可选的主题建模(基于BERTopic或非负矩阵分解以适应计算资源)、以及异构知识图谱构建(将论文、作者、机构、国家及提取主题统一为可解释结构)。该框架实现了多层次文献探索,不仅提升检索相关性与主题一致性,还揭示查询背后的概念与关系网络,为生成式AI辅助科学发现提供可扩展基础。

链接: https://arxiv.org/abs/2512.12760
作者: Sina Jani,Arman Heidari,Amirmohammad Anvari,Zahra Rahimi
机构: Allameh Tabatabaei University (艾兰姆·塔巴塔巴伊大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
zh

[NLP-52] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

【速读】: 该论文旨在解决当前编码代理(coding agents)在长周期任务中能力不足的问题,特别是缺乏对构建完整软件系统所需的长期推理、规划与执行能力的严谨评估。现有基准主要聚焦于局部代码生成、模板补全或短期修复任务,无法有效衡量代理在真实世界仓库构建场景中的持续智能表现。解决方案的关键在于提出 NL2Repo Bench,这是一个专门设计用于评估编码代理长周期仓库生成能力的新基准:给定单一自然语言需求文档和空工作区,代理需自主完成架构设计、依赖管理、多模块逻辑实现,并输出可安装的 Python 库。实验表明,即使最强模型平均测试通过率也低于 40%,且极少能正确完成整个仓库,揭示了提前终止、全局一致性丧失、跨文件依赖脆弱及数百步交互中的规划不足等核心失败模式,从而确立了一个可验证的测试平台,凸显长周期推理是下一代自主编码代理的核心瓶颈。

链接: https://arxiv.org/abs/2512.12730
作者: Jingzhe Ding,Shengda Long,Changxin Pu,Huan Zhou,Hongwan Gao,Xiang Gao,Chao He,Yue Hou,Fei Hu,Zhaojian Li,Weiran Shi,Zaiyuan Wang,Daoguang Zan,Chenchen Zhang,Xiaoxu Zhang,Qizhi Chen,Xianfu Cheng,Bo Deng,Qingshui Gu,Kai Hua,Juntao Lin,Pai Liu,Mingchen Li,Xuanguang Pan,Zifan Peng,Yujia Qin,Yong Shan,Zhewen Tan,Weihao Xie,Zihan Wang,Yishuo Yuan,Jiayu Zhang,Enduo Zhao,Yunfei Zhao,He Zhu,Chenyang Zou,Ming Ding,Jianpeng Jiao,Jiaheng Liu,Minghao Liu,Qian Liu,Chongyao Tao,Jian Yang,Tong Yang,Zhaoxiang Zhang,Xinjie Chen,Wenhao Huang,Ge Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
zh

[NLP-53] CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning WSDM’26

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行多步骤任务时因“上下文爆炸”(Context Explosion)导致的性能下降问题,即长期文本输出累积超出模型上下文窗口限制,引发推理失败。解决方案的关键在于提出一种分层架构——CoDA(Context-Decoupled hierarchical Agent),其核心创新是将高阶规划(Planner)与低阶执行(Executor)解耦,使单一共享LLM骨干网络在两个语境隔离的角色中协同工作:高阶规划器在紧凑的战略上下文中进行任务分解,而低阶执行器则在临时、隔离的工作空间中处理工具交互。通过端到端训练方法PECO(Planner-Executor Co-Optimization)联合优化这两个角色,利用轨迹级奖励驱动上下文相关的策略更新,从而显著提升复杂任务的鲁棒性和长上下文场景下的稳定性。

链接: https://arxiv.org/abs/2512.12716
作者: Xuanzhang Liu,Jianglun Feng,Zhuoran Zhuang,Junzhe Zhao,Maofei Que,Jieting Li,Dianlei Wang,Hao Tong,Ye Chen,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Accepted to WSDM '26 Oral

点击查看摘要

Abstract:Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by “Context Explosion”, where the accumulation of long text outputs overwhelms the model’s context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.
zh

[NLP-54] Efficient Vision-Language Reasoning via Adaptive Token Pruning NEURIPS2025

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在实际部署中因计算开销过高而面临的效率瓶颈问题,现有架构对所有token进行统一处理,导致资源浪费。其解决方案的关键在于提出一种动态推理机制——自适应令牌剪枝(Adaptive Token Pruning, ATP),该机制在视觉-语言接口处运行,通过融合ViT的CLS注意力(intra-modal saliency)与CLIP文本-图像相似度(inter-modal relevance)生成混合重要性评分,仅保留top-K最具信息量的token供大语言模型(LLM)处理。ATP作为轻量级门控模块,无需修改骨干网络即可实现输入感知的动态剪枝,在BLIP-2、LLaVA和Flamingo等主流模型上验证了其有效性:推理FLOPs降低约40%,端到端延迟提升约1.5倍,且准确率损失小于1%,同时增强了模型鲁棒性和可解释性。

链接: https://arxiv.org/abs/2512.12701
作者: Xue Li,Xiaonan Song,Henry Hu
机构: Scholar42; InfiniPouch LLC; Labelbox, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures. Expanded version of an extended abstract accepted at NeurIPS 2025 Workshop on VLM4RWD. Presents methodology and preliminary experimental results

点击查看摘要

Abstract:Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP’s role in efficient multimodal edge computing pipelines.
zh

[NLP-55] WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在网页环境中因缺乏前瞻性和安全回溯机制而导致的导航失败问题。具体而言,现有方法通常采用贪婪式的逐步决策策略,无法有效处理部分可观测的网页环境(仅限浏览器可见内容如DOM和UI元素),且缺乏对不可逆操作的识别与应对能力,导致错误难以纠正或探索路径受限。解决方案的关键在于提出WebOperator框架,其核心创新包括:(1) 基于奖励估计与安全性联合评分的最优优先搜索策略,实现结构化探索;(2) 可验证的鲁棒回溯机制,在重放历史路径前检查可行性以避免意外副作用;(3) 多源推理上下文生成候选动作并进行预执行过滤与语义合并,提升动作质量与多样性。实验表明,该方法在WebArena上达到54.6%的成功率(使用gpt-4o),显著优于现有方法,验证了战略前瞻与安全执行结合的有效性。

链接: https://arxiv.org/abs/2512.12692
作者: Mahir Labib Dihan,Tanzima Hashem,Mohammed Eunus Ali,Md Rizwan Parvez
机构: Bangladesh University of Engineering and Technology (BUET); Monash University; Qatar Computing Research Institute (QCRI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review at ICLR 2026. Project page: this https URL

点击查看摘要

Abstract:LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.
zh

[NLP-56] Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

【速读】: 该论文试图解决当前视觉语言模型(Vision-Language Models, VLMs)推理能力提升中对强化学习(Reinforcement Learning, RL)过度依赖的问题,尤其是质疑“RL优于监督微调(Supervised Fine-Tuning, SFT)”这一主流观点是否普适。其解决方案的关键在于通过系统且受控的实验设计,在使用相同数据源的前提下,重新评估SFT与RL在不同模型容量、数据规模和数据分布条件下的相对有效性。研究发现,SFT在弱模型中更可靠地激发推理能力、具有更高的数据效率(如2K样本即达RL 20K效果),并展现出更强的跨模态迁移能力;同时揭示了RL中存在的“误导性奖励”问题——高奖励并不一定对应高推理准确率。这表明SFT并非冗余,而应在后训练流程中与RL形成互补,从而构建更平衡有效的VLM推理优化策略。

链接: https://arxiv.org/abs/2512.12690
作者: Yongcan Yu,Lingxiao He,Shuo Lu,Lijun Sheng,Yinuo Xu,Yanbo Wang,Kuangpu Guo,Jianjie Cheng,Meng Wang,Qianlong Xie,Xingxing Wang,Dapeng Hu,Jian Liang
机构: NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所NLPR & MAIS); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Meituan(美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) reasoning have been largely attributed to the rise of reinforcement Learning (RL), which has shifted the community’s focus away from the supervised fine-tuning (SFT) paradigm. Many studies suggest that introducing the SFT stage not only fails to improve reasoning ability but may also negatively impact model training. In this study, we revisit this RL-centric belief through a systematic and controlled comparison of SFT and RL on VLM Reasoning. Using identical data sources, we find that the relative effectiveness of SFT and RL is conditional and strongly influenced by model capacity, data scale, and data distribution. Contrary to common assumptions, our findings show that SFT plays a crucial role across several scenarios: (1) Effectiveness for weaker models. SFT more reliably elicits reasoning capabilities in smaller or weaker VLMs. (2) Data efficiency. SFT with only 2K achieves comparable or better reasoning performance to RL with 20K. (3) Cross-modal transferability. SFT demonstrates stronger generalization across modalities. Moreover, we identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL. These results challenge the prevailing “RL over SFT” narrative. They highlight that the role of SFT may have been underestimated and support a more balanced post-training pipeline in which SFT and RL function as complementary components.
zh

[NLP-57] heoretical Foundations of Prompt Engineering: From Heuristics to Expressivity

【速读】: 该论文旨在解决如何在固定模型权重的前提下,通过调整提示(prompt)来实现不同行为切换的理论机制问题,即探索提示作为外部程序对Transformer架构计算能力的影响。其解决方案的关键在于将提示视为一个可执行的外部程序,并构建一个简化的Transformer模型,使其能够根据提示内容进行选择性路由(由注意力机制实现)、局部算术运算(由前馈网络FFN完成),并通过深度堆叠组合多步计算。这一机制级分解揭示了提示驱动行为切换的内在原理,并证明了一个构造性存在结果:单个固定骨干网络可通过提示逼近广泛的目标行为类别,从而为形式化分析提示长度与精度约束下的权衡关系及提示切换结构限制提供了统一框架。

链接: https://arxiv.org/abs/2512.12688
作者: Dongseok Kim,Hyoungsun Choi,Mohamed Jismy Aashik Rasool,Gisung Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:Prompts can switch a model’s behavior even when the weights are fixed, yet this phenomenon is rarely treated as a clean theoretical object rather than a heuristic. We study the family of functions obtainable by holding a Transformer backbone fixed as an executor and varying only the prompt. Our core idea is to view the prompt as an externally injected program and to construct a simplified Transformer that interprets it to implement different computations. The construction exposes a mechanism-level decomposition: attention performs selective routing from prompt memory, the FFN performs local arithmetic conditioned on retrieved fragments, and depth-wise stacking composes these local updates into a multi-step computation. Under this viewpoint, we prove a constructive existential result showing that a single fixed backbone can approximate a broad class of target behaviors via prompts alone. The framework provides a unified starting point for formalizing trade-offs under prompt length/precision constraints and for studying structural limits of prompt-based switching, while remaining distinct from empirical claims about pretrained LLMs.
zh

[NLP-58] Memoria: A Scalable Agent ic Memory Framework for Personalized Conversational AI

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续用户交互中缺乏持久性、个性化和长期上下文保持能力的问题,这是将LLM部署为真正交互式与自适应代理的关键瓶颈。解决方案的核心在于提出一个模块化记忆框架Memoria,其关键创新是融合两种互补组件:动态会话级摘要机制与基于加权知识图谱(Knowledge Graph, KG)的用户建模引擎,前者保障短期对话连贯性,后者通过结构化实体与关系增量式捕捉用户的特征、偏好和行为模式,从而在现代LLM的token限制下实现短时语境一致性和长时个性化能力的统一。

链接: https://arxiv.org/abs/2512.12686
作者: Samarth Sarin,Lovepreet Singh,Bhaskarjit Sarmah,Dhagash Mehta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Paper accepted at 5th International Conference of AIML Systems 2025, Bangalore, India

点击查看摘要

Abstract:Agentic memory is emerging as a key enabler for large language models (LLM) to maintain continuity, personalization, and long-term context in extended user interactions, critical capabilities for deploying LLMs as truly interactive and adaptive agents. Agentic memory refers to the memory that provides an LLM with agent-like persistence: the ability to retain and act upon information across conversations, similar to how a human would. We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory. Memoria integrates two complementary components: dynamic session-level summarization and a weighted knowledge graph (KG)-based user modelling engine that incrementally captures user traits, preferences, and behavioral patterns as structured entities and relationships. This hybrid architecture enables both short-term dialogue coherence and long-term personalization while operating within the token constraints of modern LLMs. We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems, offering a practical solution for industry applications requiring adaptive and evolving user experiences.
zh

[NLP-59] Fine-Tuning Causal LLM s for Text Classification: Embedding-Based vs. Instruction-Based Approaches

【速读】: 该论文旨在解决在计算资源受限条件下,如何高效微调仅解码器结构的大语言模型(decoder-only Large Language Models, LLMs)用于下游文本分类任务的问题。其核心解决方案是结合4-bit模型量化(4-bit model quantization)与低秩适应(Low-Rank Adaptation, LoRA)技术,实现单GPU环境下对高达80亿参数模型的参数高效微调。实验表明,基于模型最终token嵌入作为序列表示并附加分类头的方法,在F1分数上显著优于指令微调方法,并且在性能上可媲美甚至超越专门微调的领域模型(如BERT),验证了直接利用因果LLM内部表征的有效性。

链接: https://arxiv.org/abs/2512.12677
作者: Amirhossein Yousefiramandi,Ciaran Cooney
机构: Clarivate(科睿唯安)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM’s final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt-response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
zh

[NLP-60] Modeling Authorial Style in Urdu Novels Using Character Interaction Graphs and Graph Neural Networks

【速读】: 该论文试图解决的问题是:如何仅通过叙事结构(narrative structure)来识别作者风格,尤其是在低资源语言如乌尔都语(Urdu)中,传统基于词汇和文体特征的作者分析方法难以适用。解决方案的关键在于构建一个图表示框架,将每部乌尔都语小说建模为角色互动网络(character interaction network),其中节点代表角色,边表示角色在叙事邻近性中的共现关系,并通过多种图表示方法(包括全局结构特征、节点级语义摘要、无监督图嵌入及有监督图神经网络)进行系统比较。实验表明,学习到的图表示显著优于手工设计和无监督基线,在严格的作者感知评估协议下达到最高0.857的准确率,证明了叙事结构本身蕴含可辨识的作者风格信息。

链接: https://arxiv.org/abs/2512.12654
作者: Hassan Mujtaba,Hamza Naveed,Hanzlah Munir
机构: Information Technology University, Lahore (信息科技大学,拉合尔)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 6 pages

点击查看摘要

Abstract:Authorship analysis has traditionally focused on lexical and stylistic cues within text, while higher-level narrative structure remains underexplored, particularly for low-resource languages such as Urdu. This work proposes a graph-based framework that models Urdu novels as character interaction networks to examine whether authorial style can be inferred from narrative structure alone. Each novel is represented as a graph where nodes correspond to characters and edges denote their co-occurrence within narrative proximity. We systematically compare multiple graph representations, including global structural features, node-level semantic summaries, unsupervised graph embeddings, and supervised graph neural networks. Experiments on a dataset of 52 Urdu novels written by seven authors show that learned graph representations substantially outperform hand-crafted and unsupervised baselines, achieving up to 0.857 accuracy under a strict author-aware evaluation protocol.
zh

[NLP-61] LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases

【速读】: 该论文旨在解决中国民事案件中法律关系在法律人工智能(Legal AI)领域长期缺乏系统化建模与识别的问题,尤其针对现有研究因缺少全面的结构化框架而导致对法律关系理解不足的局限性。其解决方案的关键在于提出一个包含层级分类体系和论点定义的综合性法律关系 schema,并基于此构建了 LexRel——首个面向中文民事法律关系抽取的专家标注基准数据集,从而为法律关系提取任务提供标准化评估工具;实验证明,引入法律关系信息可显著提升下游法律 AI 任务的性能表现。

链接: https://arxiv.org/abs/2512.12643
作者: Yida Cai,Ranjuexiao Hu,Huiyuan Xie,Chenyang Li,Yun Liu,Yuxiao Ye,Zhenghao Liu,Weixing Shen,Zhiyuan Liu
机构: Tsinghua University (清华大学); Peking University (北京大学); University of Glasgow (格拉斯哥大学); Beijing University of Posts and Telecommunications (北京邮电大学); Queen Mary University of London (伦敦玛丽女王大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.
zh

[NLP-62] Which Pieces Does Unigram Tokenization Really Need?

【速读】: 该论文旨在解决Unigram分词算法在实际应用中因实现复杂而限制其广泛采用的问题,尤其是相较于字节对编码(Byte-Pair Encoding, BPE)这类基于贪心启发式的方法,Unigram虽具理论优势但缺乏易用性。解决方案的关键在于提供一套清晰的实现指南和参数选择策略,使该算法更易于落地;同时识别出一种简化版本的算法,在接受稍高的训练损失的前提下显著提升压缩效率,从而平衡了性能与实用性。

链接: https://arxiv.org/abs/2512.12641
作者: Sander Land,Yuval Pinter
机构: Writer, Inc.; Ben-Gurion University (本古里安大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure. For associated code, see this https URL

点击查看摘要

Abstract:The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.
zh

[NLP-63] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态推理中依赖显式分步推理、感知与推理交互不稳定以及计算开销较大的问题。其解决方案的关键在于提出一种测试时动态多模态潜在推理框架(Dynamic Multimodal Latent Reasoning, DMLR),通过置信度引导的潜在策略梯度优化来精炼潜在思维标记(latent think tokens),并引入动态视觉注入策略(Dynamic Visual Injection Strategy),在每个潜在思维标记处检索最相关的视觉特征并更新最佳视觉补丁集合,从而实现动态的视觉-文本交织推理,显著提升推理性能的同时保持高推理效率。

链接: https://arxiv.org/abs/2512.12623
作者: Chengzhi Liu,Yuzhe Yang,Yue Fan,Qingyue Wei,Sheng Liu,Xin Eric Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Stanford University (斯坦福大学); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
zh

[NLP-64] Understanding Syllogistic Reasoning in LLM s from Formal and Natural Language Perspectives AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在三段论推理(syllogistic reasoning)中表现出的逻辑能力与自然语言理解能力之间的关系问题,尤其关注LLMs是否正在从基于语义的模糊推理向更形式化的符号推理机制演进。其解决方案的关键在于通过系统性评估14个LLMs在符号推理和自然语言理解两个维度上的表现,发现部分模型在符号推理上达到完美性能,暗示LLMs可能正逐步具备类似形式逻辑的推理结构,而非单纯模拟人类推理的细微差别。

链接: https://arxiv.org/abs/2512.12620
作者: Aheli Poddar(1),Saptarshi Sahoo(2),Sujata Ghosh(2) ((1) Institute of Engineering amp; Management, Kolkata, (2) Indian Statistical Institute, Chennai)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 5 tables. Submitted to AAAI 2026 Bridge Program on Logic AI. Code available at this https URL

点击查看摘要

Abstract:We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.
zh

[NLP-65] StruProKGR: A Structural and Probabilistic Framework for Sparse Knowledge Graph Reasoning

【速读】: 该论文旨在解决稀疏知识图谱(Sparse Knowledge Graph, KG)推理中因知识缺失和关系模式难以捕捉而导致的挑战,尤其针对现有基于路径的方法在计算效率低、路径质量不一以及未充分利用图结构信息的问题。解决方案的关键在于提出一种结构与概率相结合的框架 StruProKGR:首先通过距离引导的路径收集机制显著降低计算开销并挖掘更相关的路径;其次引入概率路径聚合策略,利用结构信息对路径进行加权整合,优先选择相互增强的路径,从而提升推理效果与可解释性。

链接: https://arxiv.org/abs/2512.12613
作者: Yucan Guo,Saiping Guan,Miao Su,Zeya Zhao,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: CAS Key Laboratory of Network Data Science and Technology (中国科学院网络数据科学与技术重点实验室); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机学院); Beijing Institute of Tracking and Telecommunications Technology (北京跟踪与通信技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Knowledge Graphs (KGs) are commonly encountered in real-world applications, where knowledge is often incomplete or limited. Sparse KG reasoning, the task of inferring missing knowledge over sparse KGs, is inherently challenging due to the scarcity of knowledge and the difficulty of capturing relational patterns in sparse scenarios. Among all sparse KG reasoning methods, path-based ones have attracted plenty of attention due to their interpretability. Existing path-based methods typically rely on computationally intensive random walks to collect paths, producing paths of variable quality. Additionally, these methods fail to leverage the structured nature of graphs by treating paths independently. To address these shortcomings, we propose a Structural and Probabilistic framework named StruProKGR, tailored for efficient and interpretable reasoning on sparse KGs. StruProKGR utilizes a distance-guided path collection mechanism to significantly reduce computational costs while exploring more relevant paths. It further enhances the reasoning process by incorporating structural information through probabilistic path aggregation, which prioritizes paths that reinforce each other. Extensive experiments on five sparse KG reasoning benchmarks reveal that StruProKGR surpasses existing path-based methods in both effectiveness and efficiency, providing an effective, efficient, and interpretable solution for sparse KG reasoning.
zh

[NLP-66] Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理罕见、低资源或未见过场景时表现不佳的问题,以及其依赖隐式参数化记忆导致难以显式获取、回忆和优化方法,从而表现为以直觉驱动的预测者而非有意识的方法导向学习者。解决方案的关键在于提出一种受人类学习启发的学习框架,包含两个互补机制:一是“显式记录”(Obvious Record),将因果关系或问题-解决方案对以符号形式存储为持久记忆,实现从单次或稀有经历中持续学习;二是“最大熵方法发现”(Maximum-Entropy Method Discovery),通过优先保留语义差异较大的方法,捕捉通常被基于下一个词预测的模型忽略的多样化策略,从而提升对未见问题的覆盖度与内部多样性。

链接: https://arxiv.org/abs/2512.12608
作者: Hong Su
机构: Chengdu University of Information Technology (成都信息工程大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause–result (or question–solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question–solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.
zh

[NLP-67] AgentS HAP: Interpreting LLM Agent Tool Importance with Monte Carlo Shapley Value Estimation

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在调用外部工具时缺乏可解释性的问题,即难以明确识别哪些工具对最终响应起到了关键作用。现有可解释人工智能(Explainable AI, XAI)方法无法提供工具层面的归因分析。解决方案的关键在于提出AgentSHAP——首个基于博弈论中Shapley值的工具重要性解释框架,其核心创新包括:(1) 将工具选择建模为合作博弈问题,利用Shapley值公平量化每个工具的贡献;(2) 引入蒙特卡洛采样技术,将计算复杂度从指数级O(2ⁿ)降低至实际可行水平;(3) 在API-Bank数据集上验证了AgentSHAP能稳定生成一致的重要性评分,并准确区分相关与无关工具,从而实现了对LLM代理行为的透明化解释。

链接: https://arxiv.org/abs/2512.12597
作者: Miriam Horovicz
机构: Fiverr Labs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents that use external tools can solve complex tasks, but understanding which tools actually contributed to a response remains a blind spot. No existing XAI methods address tool-level explanations. We introduce AgentSHAP, the first framework for explaining tool importance in LLM agents. AgentSHAP is model-agnostic: it treats the agent as a black box and works with any LLM (GPT, Claude, Llama, etc.) without needing access to internal weights or gradients. Using Monte Carlo Shapley values, AgentSHAP tests how an agent responds with different tool subsets and computes fair importance scores based on game theory. Our contributions are: (1) the first explainability method for agent tool attribution, grounded in Shapley values from game theory; (2) Monte Carlo sampling that reduces cost from O(2n) to practical levels; and (3) comprehensive experiments on API-Bank showing that AgentSHAP produces consistent scores across runs, correctly identifies which tools matter, and distinguishes relevant from irrelevant tools. AgentSHAP joins TokenSHAP (for tokens) and PixelSHAP (for image regions) to complete a family of Shapley-based XAI tools for modern generative AI. Code: this https URL.
zh

[NLP-68] Coupled Variational Reinforcement Learning for Language Model General Reasoning

【速读】: 该论文旨在解决当前无验证器强化学习(verifier-free reinforcement learning)方法中因推理轨迹采样仅依赖问题信息而导致探索效率低下以及推理过程与最终答案之间一致性不足的问题。其解决方案的关键在于提出一种名为耦合变分强化学习(Coupled Variational Reinforcement Learning, CoVRL)的新框架,该框架通过混合采样策略将先验分布与后验分布耦合,并构建和优化一个融合两者特性的复合分布,从而在保持强思维-答案一致性的同时实现高效探索。

链接: https://arxiv.org/abs/2512.12576
作者: Xueru Wen,Jie Lou,Yanjiang Liu,Hongyu Lin,Ben He,Xianpei Han,Le Sun,Yaojie Lu,Debing Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit\bCoupled \bVariational \bReinforcement \bLearning (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
zh

[NLP-69] HyperEdit: Unlocking Instruction-based Text Editing in LLM s via Hypernetworks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令驱动的文本编辑任务中面临的两大核心问题:一是难以准确对齐用户意图,导致编辑内容与指令不一致;二是容易对未修改区域进行过度编辑,破坏原有结构。解决方案的关键在于提出HyperEdit框架,其包含两个创新机制:一是基于超网络(hypernetwork)的动态适配方法,通过生成请求特定参数来定制化编辑策略以匹配不同指令;二是差异感知正则化(difference-aware regularization),聚焦于被修改区域的监督信号,从而抑制对不变区域的冗余修改,确保编辑的精确性和最小性。实验表明,HyperEdit在修改区域的BLEU指标上相较最先进基线提升9%–30%,且仅使用3B参数规模。

链接: https://arxiv.org/abs/2512.12544
作者: Yiming Zeng,Jinghan Cao,Zexin Li,Wanhao Yu,Zhankai Ye,Dawei Xiang,Ting Hua,Xin Liu,Shangqian Gao,Tingting Yu
机构: University of Connecticut (康涅狄格大学); San Francisco State University (旧金山州立大学); University of California, Riverside (加州大学河滨分校); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); Florida State University (佛罗里达州立大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%–30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.
zh

[NLP-70] NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

【速读】: 该论文旨在解决全球大量语言(尤其是克里奥尔语如Nagamese)在自然语言处理(Natural Language Processing, NLP)领域严重缺乏资源的问题,从而阻碍其在数字技术中的代表性。解决方案的关键在于提出了一种新颖的“合成-混合”(synthetic-hybrid)数据生成方法:首先利用专家引导的大语言模型(LLM,Gemini)生成候选语料库,随后由母语者进行人工验证、精炼与标注,最终构建出一个包含10K句对的对话数据集和高质量标注语料库。该方法成功支撑了NagaNLP工具包的开发,并在词性标注(Part-of-Speech tagging)和命名实体识别(Named Entity Recognition)任务上显著优于零样本基线模型,同时训练出的NagaLLaMA模型在对话任务中表现出卓越性能,为低资源语言的NLP研究提供了可复现的数据驱动框架。

链接: https://arxiv.org/abs/2512.12537
作者: Agniva Maiti,Manya Pandey,Murari Mandal
机构: KIIT Bhubaneswar (KIIT大学布巴内斯瓦尔分校); RespAI Lab (RespAI 实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The vast majority of the world’s languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
zh

[NLP-71] Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

【速读】: 该论文旨在解决现有结肠息肉检测模型在真实内镜场景中性能下降的问题,其核心挑战在于实验室环境与临床实践中存在的域差距(domain gap),如光照变化、运动模糊和遮挡等因素导致图像质量退化。解决方案的关键在于提出一种两阶段的检测-验证框架——AdaptiveDetector,其中YOLOv11检测器结合视觉语言模型(Vision-Language Model, VLM)引导的自适应置信度阈值调整机制,实现对每帧图像动态优化检测阈值;同时,验证模块通过Group Relative Policy Optimization(GRPO)算法进行微调,并采用不对称的成本敏感奖励函数以最小化漏检(missed detections),从而满足临床对高召回率的严格要求。该方法在合成退化数据集上实现了显著提升,相较纯YOLO模型召回率提高14–22个百分点,且保持较高的精度水平,有效降低了因漏诊引发的癌前病变风险。

链接: https://arxiv.org/abs/2512.12492
作者: Shengkai Xu,Hsiang Lun Kao,Tianxiang Xu,Honghui Zhang,Junqiao Wang,Runmeng Ding,Guanyu Liu,Tianyu Shi,Zhenyu Yu,Guofeng Pan,Ziqian Bi,Yuqi Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections – a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.
zh

[NLP-72] he American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

【速读】: 该论文旨在解决生成式 AI(Generative AI)在跨文化情境下响应一致性不足的问题,即当前主流大语言模型(Large Language Models, LLMs)在未明确指定文化背景时倾向于默认美国文化,导致其输出与非美文化用户预期存在偏差。解决方案的关键在于引入“文化提示”(cultural prompting)策略——通过系统提示(system prompts)引导模型调整其文化倾向至目标国家(如中国、法国、印度、伊朗、日本和美国),从而测试并提升模型对不同文化维度的适应能力。实验表明,七种被测LLM在文化提示干预后能显著向目标文化靠拢,但对中国和日本文化的适配仍存在挑战,揭示了文化嵌入性在模型设计中的关键作用。

链接: https://arxiv.org/abs/2512.12488
作者: James Luther,Donald Brown
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede’s cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.
zh

[NLP-73] Large language models have learned to use language

【速读】: 该论文试图解决的问题是:如何利用大语言模型(Large Language Models, LLMs)在语言理解与生成方面的强大能力,推动语言科学(language science)实现突破性进展。其解决方案的关键在于,研究者需摒弃长期以来对语言知识评估的固有观念,并正视一个现实——我们已进入“后图灵测试时代”(post-Turing test era),即语言模型的能力已超越传统测试框架所能衡量的范围,必须重新定义语言知识的获取、表征与评估方式,以释放LLMs在语言学研究中的潜力。

链接: https://arxiv.org/abs/2512.12447
作者: Gary Lupyan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Commentary on Futrell Mahowald’s How Linguistics Learned to Stop Worrying and Love the Language Models (BBS, Forthcoming)

点击查看摘要

Abstract:Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.
zh

[NLP-74] Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

【速读】: 该论文旨在解决生成式 AI(Generative AI)在心理语言学研究中用于生成隐喻(metaphor)属性评分(包括熟悉度、可理解性和意象性)时的效度与信度问题。解决方案的关键在于系统评估三种 GPT 模型生成的隐喻评分与人类评分的一致性,并进一步检验其预测行为反应(如反应时间)和脑电生理指标(EEG 幅度)的能力。结果表明,机器生成的评分与人类评分呈显著正相关,尤其在英语隐喻的可理解性上表现最强,且大模型优于小模型;同时,多轮独立生成的评分具有高度稳定性,说明 GPT 可作为可靠工具替代或补充人类评分,但对隐喻的常规性和多模态语义方面仍存在较大偏差,需谨慎对待刺激材料特性。

链接: https://arxiv.org/abs/2512.12444
作者: Veronica Mangiaterra,Hamad Al-Azary,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 5 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.
zh

[NLP-75] he Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining

【速读】: 该论文旨在解决如何通过域自适应预训练(Domain-adaptive Pretraining, DAPT)高效地将大规模语言模型专业化于高价值领域(如金融领域),同时避免对通用语言模型造成灾难性遗忘的问题。其解决方案的关键在于:在有限的域特定数据(如美国证券交易委员会(SEC)文件)上进行持续预训练,实验表明仅需200M tokens即可获得显著的域内性能提升,且通用语言建模性能保持稳定,验证了金融语料的高度规律性和可学习性;此外,模型在数据效率前沿上的表现证明其能实现域特化而几乎不引入混合域退化,为未来更大规模模型(7B–70B参数)的金融领域适配提供了可行路径。

链接: https://arxiv.org/abs/2512.12384
作者: Jesse Ponnock
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Domain-adaptive pretraining (DAPT) offers a practical path to specializing large language models for high-value domains without full retraining. We conduct an early-stage scaling-law analysis of continued pretraining on U.S. SEC filings, training 1B and 3B-parameter Llama-3.2 models on a 400M-token financial corpus with validation checkpoints at 50M, 100M, 200M, and 400M tokens. Results show consistent improvements in SEC-domain validation loss for both models, with the largest gains occurring within the first 200M tokens and diminishing returns thereafter. Power-law fits reveal shallow exponents, indicating that financial language is highly regular and efficiently learnable under continued pretraining. General-domain validation loss remains effectively unchanged across all token budgets, suggesting minimal drift and no signs of catastrophic forgetting. A data-efficiency frontier further shows that both models move toward improved specialization with negligible mixed-domain degradation. Together, these findings provide early empirical guidance for scaling financial foundation models, suggesting that meaningful domain adaptation can be achieved with comparatively modest token budgets and that larger model scales (7B-70B) remain tractable under projected data requirements.
zh

[NLP-76] VideoARM: Agent ic Reasoning over Hierarchical Memory for Long-Form Video Understanding

【速读】: 该论文旨在解决长视频理解中因时间结构冗长和多模态线索密集所带来的挑战,现有方法通常依赖人工设计的推理流程或消耗大量token的视频预处理来引导多模态大语言模型(Multimodal Large Language Models, MLLMs)进行自主推理。其解决方案的关键在于提出VideoARM——一种基于分层记忆的代理式推理范式(Agentic Reasoning-over-hierarchical-Memory paradigm),通过动态、实时的观察-思考-行动-记忆循环机制,实现粗粒度到细粒度的自适应推理与记忆构建;其中控制器自主调用工具以降低token消耗,同时分层多模态记忆持续捕捉并更新多层次线索,为决策提供精准上下文信息,从而在保持性能的同时显著提升效率。

链接: https://arxiv.org/abs/2512.12360
作者: Yufei Yin,Qianke Meng,Minghao Chen,Jiajun Ding,Zhenwei Shao,Zhou Yu
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
zh

[NLP-77] SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的信息抽取(Information Extraction, IE)系统在微调范式中存在的两大问题:高昂的训练成本以及难以与LLM偏好对齐。其解决方案的关键在于提出一种通用的信息抽取框架——自校正迭代精炼(Self-Correcting Iterative Refinement, SCIR),该框架通过双路径自校正模块(Dual-Path Self-Correcting module)和反馈驱动优化机制,实现与现有LLM及IE系统的即插即用兼容性,显著降低训练开销;同时,结合一个多任务双语(中英文)自校正数据集(Multi-task Bilingual Self-Correcting, MBSC),间接蒸馏GPT-4的能力以提升结果检测模型的偏好对齐效果,从而在命名实体识别、关系抽取和事件抽取三项任务上实现平均5.27%的span-based Micro-F1提升,且训练成本降低87%。

链接: https://arxiv.org/abs/2512.12337
作者: Yushen Fang,Jianjun Li,Mingqian Ding,Chang Liu,Xinchi Zou,Wenqi Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4’s capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.
zh

[NLP-78] From Human Intention to Action Prediction: A Comprehensive Benchmark for Intention-driven End-to-End Autonomous Driving

【速读】: 该论文旨在解决当前端到端自动驾驶系统在智能水平上仍停留在执行低级指令层面的问题,即如何实现从“命令跟随者”向“意图履行者”的范式转变——即让系统能够理解并完成人类的高层次、抽象意图。其解决方案的关键在于提出了首个综合性基准测试工具Intention-Drive,该基准包含两个核心要素:一是由复杂场景及其对应自然语言意图组成的全新数据集;二是以意图成功率为评估指标(Intent Success Rate, ISR)的新颖评测协议,该指标聚焦于对人类目标的语义层面达成度,而不仅限于几何路径的准确性。通过在Intention-Drive上对多种基线模型的广泛评估,研究揭示了现有模型在场景与意图理解方面的显著不足,为推动真正智能自动驾驶的发展提供了可量化、可比较的评测框架。

链接: https://arxiv.org/abs/2512.12302
作者: Huan Zheng,Yucheng Zhou,Tianyi Yan,Jiayi Su,Hongjun Chen,Dubing Chen,Wencheng Han,Runzhou Tao,Zhongying Qiu,Jianfei Yang,Jianbing Shen
机构: University of Macau (澳门大学); Zhejiang ZEEKR Automobile Research & Development Co., Ltd. (浙江极氪汽车研发有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Current end-to-end autonomous driving systems operate at a level of intelligence akin to following simple steering commands. However, achieving genuinely intelligent autonomy requires a paradigm shift: moving from merely executing low-level instructions to understanding and fulfilling high-level, abstract human intentions. This leap from a command-follower to an intention-fulfiller, as illustrated in our conceptual framework, is hindered by a fundamental challenge: the absence of a standardized benchmark to measure and drive progress on this complex task. To address this critical gap, we introduce Intention-Drive, the first comprehensive benchmark designed to evaluate the ability to translate high-level human intent into safe and precise driving actions. Intention-Drive features two core contributions: (1) a new dataset of complex scenarios paired with corresponding natural language intentions, and (2) a novel evaluation protocol centered on the Intent Success Rate (ISR), which assesses the semantic fulfillment of the human’s goal beyond simple geometric accuracy. Through an extensive evaluation of a spectrum of baseline models on Intention-Drive, we reveal a significant performance deficit, showing that the baseline model struggle to achieve the comprehensive scene and intention understanding required for this advanced task.
zh

[NLP-79] F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

【速读】: 该论文旨在解决F5-TTS模型对罗马尼亚语(Romanian Language)支持不足的问题,同时在不损害原有语音克隆能力及英文和中文支持的前提下实现多语言扩展。解决方案的关键在于设计了一个轻量级的输入层适配器(input-level adapter),通过冻结原始模型权重、附加一个子网络并训练其作为文本编码器中词嵌入矩阵的扩展,从而将罗马尼亚语文本映射为连续表示。该模块基于ConvNeXt结构建模字符级嵌入间的依赖关系,充当“软”字母到发音层(letter-to-sound layer),使模型能够生成自然流畅的罗马尼亚语语音。实验表明,该方法在保留语音克隆性能的同时,初步实现了罗马尼亚语与英语的混合使用(code-switching),但仍存在残留的英语口音特征。

链接: https://arxiv.org/abs/2512.12297
作者: Radu-Gabriel Chivereanu,Tiberiu Boros
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at The 20th International Conference on Linguistic Resources and Tools for Natural Language Processing

点击查看摘要

Abstract:This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a soft letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and © Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at this https URL.
zh

[NLP-80] Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在量化交易领域中从自然语言描述生成可执行回测代码的能力评估问题。现有方法缺乏对模型在真实交易策略构建任务中的结构可靠性与数值准确性进行系统性衡量的基准。为此,作者提出了MARKET-BENCH,一个专注于初级量化交易任务的基准测试平台,其关键在于设计了三类典型交易策略(定时交易、配对交易和delta对冲),要求模型根据自然语言描述输出可运行的回测代码,并通过多轮pass@k指标区分结构正确性(backtest能否执行)与数值精度(如收益、回撤和持仓路径的均方误差)。这一设计使得模型性能可被客观量化,从而揭示当前LLMs在价格、库存和风险推理方面的局限性。

链接: https://arxiv.org/abs/2512.12264
作者: Abhay Srivastava,Sam Jung,Spencer Mateega
机构: AfterQuery
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural-language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies – scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT – and models must produce code whose P\L, drawdown, and position paths match a verifiable reference implementation. We assess twelve state-of-the-art models using a multi-round pass@k metric that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics). While most models reliably execute the simplest strategy (average pass@3 of 0.80), errors vary by orders of magnitude across models and tasks: Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies, GPT-5.1 Codex-Max achieves perfect pass@1 on the first two strategies and the lowest best-run error on the easiest task, and Qwen3 Max attains perfect pass@3 yet sometimes produces inaccurate P\L paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk; we release MARKET-BENCH and a public leaderboard at this https URL.
zh

[NLP-81] Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

【速读】: 该论文旨在解决跨语言语义域中音义关联(sound symbolism)的规模化验证问题,特别是针对“大小”这一语义范畴是否存在普遍性的音素-语义映射。传统研究多依赖小样本实验(如Bouba-Kiki效应),缺乏对多种语言的系统性分析。其解决方案的关键在于构建了一个涵盖27种语言、810个形容词的跨语言语料库,并采用基于音段特征(bag-of-segment features)的可解释分类器进行建模;同时引入对抗性去混淆机制(adversarial scrubber),在抑制语言家族身份信息的同时保留大小语义信号,从而证明即使在无亲缘关系的语言之间,音素形式仍能显著预测大小语义,揭示出跨语言家族的声音象征性偏倚(cross-family sound-symbolic bias)。

链接: https://arxiv.org/abs/2512.12245
作者: Anika Sharma,Tianyi Niu,Emma Wrenn,Shashank Srivastava
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.
zh

[NLP-82] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

【速读】: 该论文旨在解决传统语义距离度量方法因固定特性而难以适应特定数据分布和任务需求的问题。其关键解决方案是提出一种基于多核高斯过程(Multi-Kernel Gaussian Processes, MK-GP)的语义距离度量方法,将文本的潜在语义函数建模为高斯过程,并采用由Matérn与多项式核组合而成的复合核结构,通过监督学习自动优化核参数,从而实现对语义相似性的灵活、可学习建模。

链接: https://arxiv.org/abs/2512.12238
作者: Yinzhu Cheng,Haihua Xie,Yaqing Wang,Miao He,Mingming Sun
机构: Institute of Statistics and Big Data, Renmin University of China (中国人民大学统计与大数据研究院); Beijing Institute of Mathematical Sciences and Applications (北京数学科学应用研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
zh

[NLP-83] Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

【速读】: 该论文旨在解决当前生成式视觉语言模型(Vision Language Models, VLMs)在推理增强场景下存在的“视觉不忠实性”问题,即模型可能通过与图像无关的感知步骤得出正确答案,或虽推理过程忠实但最终预测失败。这类现象无法被仅依赖最终答案准确率的标准评估方法识别,从而影响多模态推理的可靠性。解决方案的关键在于提出一种无需训练和参考的视觉信仰度(visual faithfulness)评估框架,该框架能将推理链分解为感知与推理步骤,并利用现成的VLM判别器对每一步进行逐级验证;在此基础上进一步设计了一种轻量级自反思机制,在不依赖额外训练的前提下检测并局部重生成不忠实的感知步骤,从而有效降低不忠实感知率,同时保持最终答案准确性,提升多模态推理的可信度。

链接: https://arxiv.org/abs/2512.12218
作者: Rheeya Uppaal,Phu Mon Htut,Min Bai,Nikolaos Pappas,Zheng Qi
机构: AWS AI Labs; University of Wisconsin-Madison
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
zh

[NLP-84] raining Versatile Coding Agents in Synthetic Environments

【速读】: 该论文旨在解决现有软件工程智能体训练方法依赖GitHub等外部代码仓库所带来的灵活性不足,以及任务类型局限于问题修复而难以覆盖软件工程师实际工作中多样化需求的问题。其解决方案的关键在于提出SWE-Playground,一个基于大语言模型和智能体从零合成项目与任务的新型流水线,通过自主生成环境和轨迹,摆脱对外部数据源的依赖,并支持包括单元测试生成、库实现等更广泛编码任务的训练,从而在三个基准测试中验证了其高密度训练信号特性,使智能体以更少轨迹达到与先前方法相当的性能水平。

链接: https://arxiv.org/abs/2512.12216
作者: Yiqi Zhu,Apurva Gandhi,Graham Neubig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior works on training software engineering agents have explored utilizing existing resources such as issues on GitHub repositories to construct software engineering tasks and corresponding test suites. These approaches face two key limitations: (1) their reliance on pre-existing GitHub repositories offers limited flexibility, and (2) their primary focus on issue resolution tasks restricts their applicability to the much wider variety of tasks a software engineer must handle. To overcome these challenges, we introduce SWE-Playground, a novel pipeline for generating environments and trajectories which supports the training of versatile coding agents. Unlike prior efforts, SWE-Playground synthetically generates projects and tasks from scratch with strong language models and agents, eliminating reliance on external data sources. This allows us to tackle a much wider variety of coding tasks, such as reproducing issues by generating unit tests and implementing libraries from scratch. We demonstrate the effectiveness of this approach on three distinct benchmarks, and results indicate that SWE-Playground produces trajectories with dense training signal, enabling agents to reach comparable performance with significantly fewer trajectories than previous works.
zh

[NLP-85] Diffusion Language Model Inference with Monte Carlo Tree Search

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理阶段面临的组合优化难题,即如何高效地确定哪些位置应被解掩码(unmask)以及哪些词元(token)应被固定,以生成高质量文本。现有方法通常依赖启发式策略或额外训练来近似最优解,但往往导致次优的解码路径。论文提出的解决方案关键在于引入MEDAL框架,通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)初始化阶段对高置信度动作进行探索,从而为后续精细化调整提供稳健的起点;该机制通过限制搜索空间并优先选择能提升剩余掩码位置模型置信度的词元选择,实现了更有效的搜索引导,显著提升了生成质量,在多个基准测试中相较现有策略最高提升22.0%。

链接: https://arxiv.org/abs/2512.12168
作者: Zheng Huang,Kiran Ramnath,Yueyan Chen,Aosong Feng,Sangmin Woo,Balasubramaniam Srinivasan,Zhichao Xu,Kang Zhou,Shuai Wang,Haibo Ding,Lin Lee Cheong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models.
zh

[NLP-86] Extending the Context of Pretrained LLM s by Dropping Their Positional Embeddings

【速读】: 该论文旨在解决语言模型(Language Model, LM)在预训练后难以有效扩展上下文长度的问题,即传统方法通常需要昂贵的微调(finetuning)才能适应更长的输入序列。其解决方案的关键在于提出一种名为“Dropping the Positional Embeddings of LMs after training”(DroPE)的简单方法:通过在预训练完成后移除位置嵌入(Positional Embeddings, PEs),并在短时间 recalibration(校准)阶段重新调整模型,从而实现零样本(zero-shot)上下文扩展。该方法基于三个核心观察:1)PEs在预训练中提供重要归纳偏置以加速收敛;2)对PEs的过度依赖阻碍了测试时对未见长度序列的泛化能力;3)PEs并非语言建模的本质需求,可安全移除并配合少量校准恢复性能。实验表明,DroPE无需任何长上下文微调即可实现跨模型和数据集规模的有效扩展,显著优于现有专门架构与主流旋转位置编码(Rotary Positional Embedding)缩放方法。

链接: https://arxiv.org/abs/2512.12167
作者: Yoav Gelberg,Koshi Eguchi,Takuya Akiba,Edoardo Cetin
机构: Sakana AI; University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.
zh

[NLP-87] VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中产生的视觉幻觉问题,即模型生成的语言输出虽语法流畅但与输入图像内容不一致。解决方案的关键在于利用视觉编码器自身的注意力图(attention map)来抑制幻觉:研究发现,当最终视觉注意力未能聚焦于图像关键对象时,模型更易产生幻觉;而将视觉编码器更集中、准确的注意力图注入语言模型的中层(middle layers),可有效缓解视觉-文本冲突并显著减少幻觉。基于此机制,作者提出Vegas方法,在推理阶段自适应地引导未聚焦于关键图像对象的token,从而实现高效且鲁棒的幻觉抑制。

链接: https://arxiv.org/abs/2512.12089
作者: Zihu Wang,Boxun Xu,Yuxuan Xia,Peng Li
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder’s own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder’s more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model’s middle layers. Injecting the vision encoder’s attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder’s attention maps into the language model’s mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.
zh

[NLP-88] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理场景下,标准注意力机制(attention mechanism)带来的计算与内存瓶颈问题。其核心挑战在于随着上下文长度增长,注意力矩阵的复杂度呈平方级上升,导致推理效率显著下降。解决方案的关键在于提出一种名为BLASST的即插即用稀疏注意力方法,该方法无需预计算或代理分数,仅通过一个固定阈值结合在线softmax的信息动态剪枝注意力矩阵,跳过冗余的softmax计算、Value块加载及后续矩阵乘法操作。此策略可无缝集成至现有FlashAttention内核设计中,带来极低延迟开销,并适用于所有注意力变体(MHA、GQA、MQA和MLA),从而实现对预填充(prefill)和解码(decode)阶段的统一加速。

链接: https://arxiv.org/abs/2512.12087
作者: Jiayi Yuan,Cameron Shinn,Kai Xu,Jingze Cui,George Klimiashvili,Guangxuan Xiao,Perkz Zheng,Bo Li,Yuxin Zhou,Zhouhai Ye,Weijie You,Tian Zheng,Dominic Brown,Pengbo Wang,Richard Cai,Julien Demouth,John D. Owens,Xia Hu,Song Han,Timmy Liu,Huizi Mao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.
zh

[NLP-89] VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLM s

【速读】: 该论文旨在解决由大语言模型(Large Language Models, LLMs)生成的合成数据集缺乏多样性的问题。现有方法生成的数据在语义和结构上趋于同质,限制了下游模型的训练效果与泛化能力。为应对这一挑战,作者提出Voyager,其核心在于采用基于行列式点过程(Determinantal Point Processes, DPPs)的数学优化框架,在无需训练的前提下直接最大化数据集的多样性度量。该方案具有无需访问模型参数、适用于闭源模型且可扩展性强等优势,实验表明其相较主流基线方法在多样性上提升1.5–3倍。

链接: https://arxiv.org/abs/2512.12072
作者: Avinash Amballa,Yashas Malur Saidutta,Chi-Heng Lin,Vivek Kulkarni,Srinivas Chappidi
机构: Samsung Research America (三星研究院美国)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Arxiv Submission

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.
zh

[NLP-90] Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对多模态越狱攻击(multimodal jailbreak attacks)时的安全性问题,尤其是现有防御方法普遍存在两个缺陷:一是针对特定攻击模式设计,缺乏对新型攻击的泛化能力;二是计算开销过高,难以实际部署。解决方案的关键在于提出一种名为“表示对比评分”(Representational Contrastive Scoring, RCS)的新框架,其核心思想是利用LVLM自身内部表示中的安全信号——通过分析这些表示的空间几何结构,在关键层学习一个轻量级投影,以最大化区分良性输入与恶意输入。该方法生成的对比分数能够有效识别真实恶意意图,而非仅仅将新颖输入误判为攻击,从而实现高精度、低延迟的越狱检测。

链接: https://arxiv.org/abs/2512.12069
作者: Peichun Hua,Hao Li,Shanghao Shi,Zhiyuan Yu,Ning Zhang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Texas A&M University (德州农工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 40 pages, 13 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM’s own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github this https URL.
zh

[NLP-91] he Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中依赖单次采样测试所隐含的假设问题——即认为模型响应是确定性的且能代表其安全性对齐水平。研究发现,不同随机种子和温度设置下,模型对同一有害提示的安全拒绝决策存在显著波动,18–28%的提示会出现“决策翻转”现象(refusal-to-compliance flip)。解决方案的关键在于引入安全稳定性指数(Safety Stability Index, SSI),量化模型在多配置下的决策一致性,并通过实验证明高温度显著降低稳定性(Friedman检验 p < 0.001),进而提出:为获得可靠的安全评估结果,应采用至少3次采样(multi-sample)而非单次采样(single-shot)进行测试。

链接: https://arxiv.org/abs/2512.12066
作者: Erik Larsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 figures, 6 tables. Code and data available at this https URL

点击查看摘要

Abstract:Current safety evaluations of large language models rely on single-shot testing, implicitly assuming that model responses are deterministic and representative of the model’s safety alignment. We challenge this assumption by investigating the stability of safety refusal decisions across random seeds and temperature settings. Testing four instruction-tuned models from three families (Llama 3.1 8B, Qwen 2.5 7B, Qwen 3 8B, Gemma 3 12B) on 876 harmful prompts across 20 different sampling configurations (4 temperatures x 5 random seeds), we find that 18-28% of prompts exhibit decision flips–the model refuses in some configurations but complies in others–depending on the model. Our Safety Stability Index (SSI) reveals that higher temperatures significantly reduce decision stability (Friedman chi-squared = 44.71, p 0.001), with mean SSI dropping from 0.951 at temperature 0.0 to 0.896 at temperature 1.0. We validate our findings across all model families using Claude 3.5 Haiku as a unified external judge, achieving 89.0% inter-judge agreement with our primary Llama 70B judge (Cohen’s kappa = 0.62). These findings demonstrate that single-shot safety evaluations are insufficient for reliable safety assessment. We show that single-shot evaluation agrees with multi-sample ground truth only 92.4% of the time, and recommend using at least 3 samples per prompt for reliable safety assessment.
zh

[NLP-92] Benchmarking Contextual Understanding for In-Car Conversational Systems

【速读】: 该论文旨在解决车载对话问答(In-Car Conversational Question Answering, ConvQA)系统在评估其准确性与可靠性方面的挑战,尤其是如何有效衡量系统响应与用户语句之间的语境一致性及推荐准确性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合高级提示工程技术(如输入输出提示、思维链提示、自一致性提示和多智能体提示)对ConvQA系统的响应进行自动化评估。研究通过合成生成包含正确与故障模拟响应的用户语句,在餐厅推荐场景下验证方法有效性,发现小规模非推理型模型在多智能体提示下性能显著提升,而推理型模型整体表现更优,其中DeepSeek-R1在自一致性单智能体提示下达到F1分数0.99,且成本仅为每请求0.002美元;综合效果与成本效率最优的是非推理型模型DeepSeek-V3,表明LLM驱动的评估方法可作为传统人工评估的可扩展且高精度替代方案。

链接: https://arxiv.org/abs/2512.12042
作者: Philipp Habicht,Lev Sorokin,Abdullah Saydemir,Ken E. Friedl,Andrea Stocco
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.
zh

[NLP-93] Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

【速读】: 该论文旨在解决自动驾驶(Autonomous Vehicles, AVs)发展中因“长尾”训练数据稀缺而导致的模型鲁棒性不足问题,尤其是如何高效、精准地从海量视频日志中挖掘罕见但高风险的安全事件(如异常横穿马路、施工分流等)。现有方法依赖粗粒度元数据搜索或云端视觉语言模型(Vision-Language Models, VLMs),存在精度低、隐私泄露和成本高的缺陷。其解决方案的核心在于提出一种本地优先的神经符号框架 Semantic-Drive,通过两阶段感知解耦实现高精度语义数据挖掘:首先利用实时开放词汇检测器 YOLOE 进行符号锚定以聚焦注意力,再由推理型 VLM 执行场景取证分析;同时引入“系统2”推理时对齐策略,结合多模态“裁判-侦察”共识机制有效抑制幻觉,显著提升召回率(0.966 vs. CLIP 的 0.475)并降低风险评估误差 40%,且整个系统可在消费级硬件(NVIDIA RTX 3090)上运行,兼顾性能与隐私保护。

链接: https://arxiv.org/abs/2512.12012
作者: Antonio Guillen-Perez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to single models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
zh

[NLP-94] Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文任务时因KV缓存(Key-Value Cache)线性增长导致的内存瓶颈问题。现有压缩算法虽能缓解缓存膨胀,但多数针对预填充阶段(prefill phase)优化,缺乏对需要长推理过程的任务(如GSM8K和MATH500等基准测试中的多步思考序列)的有效评估。论文的关键解决方案在于系统性地评测多种主流压缩策略在长推理场景下的表现,并发现:对于推理类模型,基于“重热点追踪”(heavy-hitter tracking)的H2O及改进版SnapKV(支持解码阶段的版本)表现最优,表明在推理轨迹中识别并保留高频关键信息对性能提升至关重要;同时,低预算下的淘汰策略反而能生成更长的推理链,揭示了缓存大小与推理成本之间的权衡关系。

链接: https://arxiv.org/abs/2512.12008
作者: Minghui Liu,Aadi Palnitkar,Tahseen Rabbani,Hyunwoo Jae,Kyle Rui Sang,Dixi Yao,Shayan Shabihi,Fuheng Zhao,Tian Li,Ce Zhang,Furong Huang,Kunpeng Zhang
机构: University of Maryland (马里兰大学); University of Chicago (芝加哥大学); University of Utah (犹他大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
zh

[NLP-95] Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中内部置信度(internal confidence,由token概率得出)与口头置信度(verbalized confidence,即模型对自身预测正确性的表述)之间不一致的问题,这种不一致性会导致不同校准方法产生误导性结果,进而影响模型的可信度和透明性。解决方案的关键在于提出直接置信度对齐(Direct Confidence Alignment, DCA),该方法采用直接偏好优化(Direct Preference Optimization)策略,使模型的口头置信度与其内部置信度对齐,而非直接对齐于真实准确率(ground-truth accuracy),从而提升模型在表达自信时的一致性和可靠性。

链接: https://arxiv.org/abs/2512.11998
作者: Glenn Zhang,Treasure Mayowa,Jason Fan,Yicheng Fu,Aaron Sandoval,Sean O’Brien,Kevin Zhu
机构: Algoverse AI Research
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 SRW, 5 pages body, 14 pages total

点击查看摘要

Abstract:Producing trustworthy and reliable Large Language Models (LLMs) has become increasingly important as their usage becomes more widespread. Calibration seeks to achieve this by improving the alignment between the model’s confidence and the actual likelihood of its responses being correct or desirable. However, it has been observed that the internal confidence of a model, derived from token probabilities, is not well aligned with its verbalized confidence, leading to misleading results with different calibration methods. In this paper, we propose Direct Confidence Alignment (DCA), a method using Direct Preference Optimization to align an LLM’s verbalized confidence with its internal confidence rather than ground-truth accuracy, enhancing model transparency and reliability by ensuring closer alignment between the two confidence measures. We evaluate DCA across multiple open-weight LLMs on a wide range of datasets. To further assess this alignment, we also introduce three new calibration error-based metrics. Our results show that DCA improves alignment metrics on certain model architectures, reducing inconsistencies in a model’s confidence expression. However, we also show that it can be ineffective on others, highlighting the need for more model-aware approaches in the pursuit of more interpretable and trustworthy LLMs.
zh

[NLP-96] opicProphet: Prophesies on Temporal Topic Trends and Stocks

【速读】: 该论文旨在解决股票预测中因量化数据缺乏因果逻辑及市场快速变化导致训练数据不足的问题。传统方法多聚焦于关键词和情绪影响,而本文提出了一种名为TopicProphet的新框架,其核心在于通过主题建模(topic modeling)、时间序列分析、断点检测(breakpoint detection)与区间优化(segment optimization)的组合策略,识别出与当前公共情绪趋势和历史背景相似的历史时期作为最优训练时段。该方案不仅利用了特定时代下的经济社会与政治状态所蕴含的细微模式,还有效缓解了金融数据稀缺问题,从而显著提升了模型对股价百分比变动的预测性能。

链接: https://arxiv.org/abs/2512.11857
作者: Olivia Kim
机构: Emory CS Department (埃默里大学计算机科学系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stocks can’t be predicted. Despite many hopes, this premise held itself true for many years due to the nature of quantitative stock data lacking causal logic along with rapid market changes hindering accumulation of significant data for training models. To undertake this matter, we propose a novel framework, TopicProphet, to analyze historical eras that share similar public sentiment trends and historical background. Our research deviates from previous studies that identified impacts of keywords and sentiments - we expand on that method by a sequence of topic modeling, temporal analysis, breakpoint detection and segment optimization to detect the optimal time period for training. This results in improving predictions by providing the model with nuanced patterns that occur from that era’s socioeconomic and political status while also resolving the shortage of pertinent stock data to train on. Through extensive analysis, we conclude that TopicProphet produces improved outcomes compared to the state-of-the-art methods in capturing the optimal training data for forecasting financial percentage changes.
zh

[NLP-97] KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

【速读】: 该论文旨在解决低资源、非拉丁文字脚本(特别是高棉语)在文档版面分析(document layout analysis)中的显著挑战,尤其针对商业文档(如收据、发票和报价单)缺乏专用标注数据集的问题。其解决方案的关键在于提出首个公开可用的分层标注数据集KH-FUNSD,采用三级注释框架:(1) 区域检测(region detection),将文档划分为头部、表单字段、页脚等核心区域;(2) 类似FUNSD的标注方式,区分问题、答案、标题及其他关键实体及其关系;(3) 细粒度分类,赋予具体语义角色(如字段标签、值、标题、页脚、符号)。该多层级设计支持全面的版面理解与精确的信息抽取,填补了高棉语商业文档AI处理的空白,并提供了首个基准测试结果。

链接: https://arxiv.org/abs/2512.11849
作者: Nimol Thuon,Jun Du
机构: National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbfKH-FUNSD, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.
zh

[NLP-98] Love First Know Later: Persona-Based Romantic Compatibility Through LLM Text World Engines NEURIPS2025

【速读】: 该论文旨在解决传统计算匹配方法中因依赖静态用户画像而导致的匹配效果不佳问题,特别是难以捕捉人际关系中动态交互与情境响应的关键作用。其解决方案的核心在于提出“先爱后知”(Love First, Know Later)范式,利用大语言模型(LLM)作为文本世界引擎,在双角色下运行:一方面作为受行为策略驱动的代理(persona-driven agents),另一方面作为建模互动动态的环境。通过将兼容性评估形式化为奖励建模问题,从模拟交互中提取预测人类偏好的信号,从而实现对初始化学反应和长期稳定性的有效预测。关键创新在于将关系心理学中“关系取决于对关键时刻的回应”这一观察转化为数学假设,并借助LLM逼近人类行为的能力,理论上证明了匹配结果可收敛至最优稳定匹配,实证上在速配数据和离婚预测任务中得到验证。

链接: https://arxiv.org/abs/2512.11844
作者: Haoyang Shang,Zhengyang Yan,Xuan Liu
机构: BreathingCORE
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025 Workshop: First Workshop on LLM Persona Modeling (Oral)

点击查看摘要

Abstract:We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility. Instead of comparing static profiles, our framework leverages LLMs as text world engines that operate in dual capacity-as persona-driven agents following behavioral policies and as the environment modeling interaction dynamics. We formalize compatibility assessment as a reward-modeling problem: given observed matching outcomes, we learn to extract signals from simulations that predict human preferences. Our key insight is that relationships hinge on responses to critical moments-we translate this observation from relationship psychology into mathematical hypotheses, enabling effective simulation. Theoretically, we prove that as LLM policies better approximate human behavior, the induced matching converges to optimal stable matching. Empirically, we validate on speed dating data for initial chemistry and divorce prediction for long-term stability. This paradigm enables interactive, personalized matching systems where users iteratively refine their agents, unlocking future possibilities for transparent and interactive compatibility assessment.
zh

[NLP-99] he Ontological Dissonance Hypothesis: AI-Triggered Delusional Ideation as Folie a Deux Technologique

【速读】: 该论文试图解决的问题是:当代大语言模型(Large Language Models, LLMs)可能通过模拟具有高语言连贯性的互动关系,诱发用户产生类“集体性精神病”(folie à deux)的心理现象,尤其在情绪脆弱或心理不稳定个体中,可能导致其将无意识的系统误认为具有主体性、意图或情感的存在。解决方案的关键在于引入“本体论诚实”(ontological honesty)作为设计原则,要求技术系统明确揭示其非主体性本质,从而减少用户因语言表象与感知空缺之间的结构性张力而产生的投射性认同,进而缓解由人工智能中介引发的精神病理风险。

链接: https://arxiv.org/abs/2512.11818
作者: Izabela Lipinska,Hugh Brosnahan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 18 pages excluding appendices

点击查看摘要

Abstract:This paper argues that contemporary large language models (LLMs) can contribute to psychotic involvement by creating interactions that resemble the relational dynamics of folie a deux. Drawing on Bateson’s double bind theory, clinical literature on shared psychotic disorder, and McGilchrist’s hemisphere theory, we show how the combination of high linguistic coherence and the absence of an underlying subject produces a structural tension for the user: language suggests an interlocutor, while intuition registers a void. In contexts of emotional need or instability, this tension can lead users to resolve the conflict through imaginative projection, attributing interiority, intention, or presence to a system that possesses none. The paper situates these dynamics within emerging clinical reports, develops a phenomenological account of how they unfold, and argues that current engagement-optimised design choices exacerbate the risk. We conclude by proposing ‘ontological honesty’ as a necessary design principle for mitigating technologically mediated folie a deux.
zh

[NLP-100] Reinforcement Learning for Latent-Space Thinking in LLM s

【速读】: 该论文试图解决传统Chain-of-Thought (CoT)推理中依赖离散语言空间导致的效率低下问题,即大量生成的token仅满足语言规则而无助于实质推理。其核心解决方案是探索在连续嵌入空间(latent space)中进行思考的可行性,通过监督微调(如Coconut方法)和强化学习(Reinforcement Learning, RL)技术优化隐式推理步骤。关键创新在于提出一种新的Latent RL方法,直接对潜在空间中的思维步骤进行优化,从而提升推理效率与效果。然而实验表明,即便采用RL训练,模型在数学推理等复杂任务上仍落后于传统的语言空间CoT模型。

链接: https://arxiv.org/abs/2512.11816
作者: Enes Özeren,Matthias Aßenmacher
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 16 figures, 7 tables

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques – an underexplored direction in latent-space thinking – including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.
zh

[NLP-101] Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM -Guided Attention

【速读】: 该论文旨在解决跨源场景下视觉地名识别(Visual Place Recognition, VPR)模型在处理社交媒体 crowdsourced street-view imagery 时性能显著下降的问题,尤其是由于图像视觉失真和域偏移导致的检索准确率降低。其解决方案的关键在于提出一种模型无关的框架 VPR-AttLLM,通过将大型语言模型(Large Language Models, LLMs)的语义推理与地理空间知识引入现有 VPR 流程,利用注意力机制引导特征描述子增强,从而识别城市语境中的位置信息区域并抑制瞬态视觉噪声。该方法无需重新训练模型或额外数据即可提升检索性能,在多个扩展基准测试中实现了 1–8% 的召回率相对提升,尤其在真实洪水影像上表现突出,同时建立了 LLM 指导的多模态融合范式,增强了系统对城市感知理论的建模能力与可解释性。

链接: https://arxiv.org/abs/2512.11811
作者: Fengyi Xu,Jun Ma,Waishan Qiu,Cui Guo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Crowdsourced street-view imagery from social media provides valuable real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts inherent in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress transient visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.
zh

[NLP-102] EMNLP: Educator-role Moral and Normative Large Language Models Profiling EMNLP

【速读】: 该论文旨在解决生成式 AI(Generative AI)在模拟专业角色(如教师)时,缺乏系统性心理与伦理评估的问题。其核心挑战在于如何量化模型在职业情境下的道德发展水平、人格特征以及对软提示注入(soft prompt injection)的脆弱性。解决方案的关键在于提出EMNLP框架——一个面向教育者角色的道德与规范性大语言模型(LLM)画像工具,通过构建88个教师专属道德困境量表,实现与人类教师的可比性评估,并引入针对性软提示注入测试以衡量模型合规性与风险暴露程度。实验表明,教师角色LLM虽在抽象道德推理上表现优异,但人格更理想化且易受有害提示影响,揭示了能力与安全性之间的悖论,从而为教育AI的伦理对齐提供了首个基准评测体系。

链接: https://arxiv.org/abs/2508.15250
作者: Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu
机构: Zhejiang University of Technology (浙江工业大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); East China Normal University (华东师范大学); GuangHua Law School, ZheJiang University (光华法学院,浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29pages, 15 figures, Accepted by EMNLP Main Confrence

点击查看摘要

Abstract:Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at this https URL.
zh

[NLP-103] Meta-GPT : Decoding the Metasurface Genome with Generative Artificial Intelligence

【速读】: 该论文旨在解决物理科学中人工智能模型缺乏可解释性与自然规律兼容性的问题,尤其是在光子学领域设计纳米结构时难以兼顾人类可理解性与计算效率。其解决方案的关键在于提出了一种符号化语言METASTRINGS,将光子超表面的材料、几何形状和晶格配置编码为文本序列,从而建立结构层次与人类可读性的映射关系;在此基础上训练了Meta-GPT模型,该模型通过物理信息监督学习、强化学习及思维链(chain-of-thought)策略进行微调,能够准确捕捉光-物质相互作用的组合规则,在多个设计任务中实现3%均方谱误差并保持98%语法有效性,生成的原型经实验验证其光学响应与目标谱高度一致,为AI驱动的光子学研究奠定了坚实基础。

链接: https://arxiv.org/abs/2512.12888
作者: David Dang,Stuart Love,Meena Salib,Quynh Dang,Samuel Rothfarb,Mysk Alnatour,Andrew Salij,Hou-Tong Chen,Ho Wai(Howard)Lee,Wilton J.M. Kort-Kamp
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Keywords: Physics-informed machine learning; Transformer models; Reinforcement learning; Chain-of-thought reasoning; Metasurfaces; Nanophotonics; Inverse design

点击查看摘要

Abstract:Advancing artificial intelligence for physical sciences requires representations that are both interpretable and compatible with the underlying laws of nature. We introduce METASTRINGS, a symbolic language for photonics that expresses nanostructures as textual sequences encoding materials, geometries, and lattice configurations. Analogous to molecular textual representations in chemistry, METASTRINGS provides a framework connecting human interpretability with computational design by capturing the structural hierarchy of photonic metasurfaces. Building on this representation, we develop Meta-GPT, a foundation transformer model trained on METASTRINGS and finetuned with physics-informed supervised, reinforcement, and chain-of-thought learning. Across various design tasks, the model achieves 3% mean-squared spectral error and maintains 98% syntactic validity, generating diverse metasurface prototypes whose experimentally measured optical responses match their target spectra. These results demonstrate that Meta-GPT can learn the compositional rules of light-matter interactions through METASTRINGS, laying a rigorous foundation for AI-driven photonics and representing an important step toward a metasurface genome project.
zh

[NLP-104] he Morphemic Origin of Zipfs Law: A Factorized Combinatorial Framework

【速读】: 该论文试图解决语言学中两个长期存在的统计现象:词长分布的典型形态(即词长集中在某一区间且尾部较薄)以及词频排名呈现Zipf-like幂律分布的问题。传统解释多基于随机文本生成或通信效率优化等假设,而本文提出一种基于形态结构的组合模型——形态组合词模型(Morphemic Combinatorial Word Model),其关键在于将词视为由前缀、词根、后缀和屈折词素等稳定构词单元在不同位置槽位上概率性激活并组合而成。通过仅依赖形态结构本身的组合规则与槽位的随机激活机制,该模型即可自然生成符合真实语言特征的词长分布和Zipf-like词频曲线(指数约为1.1–1.4),表明无需引入语义、交际压力或优化原则,仅靠形态内部结构与概率性组合即可解释跨语言的统计规律性。

链接: https://arxiv.org/abs/2512.12394
作者: Vladimir Berman
机构: Aitiologia LLC
类目: Methodology (stat.ME); Computation and Language (cs.CL); Physics and Society (physics.soc-ph); Applications (stat.AP)
备注:

点击查看摘要

Abstract:We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic morpheme inventories also generate rank frequency curves with Zipf like exponents around 1.1-1.4, similar to English, Russian and Romance languages. The key result is that Zipf like behavior can emerge without meaning, communication pressure or optimization principles. The internal structure of morphology alone, combined with probabilistic activation of slots, is sufficient to create the robust statistical patterns observed across languages.
zh

计算机视觉

[CV-0] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

【速读】:该论文旨在解决视频扩散模型(video diffusion models)在生成过程中存在的精确性差、速度慢以及过程不透明的问题,这些问题使得用户难以在生成过程中进行有效干预或理解模型行为。解决方案的关键在于提出 DiffusionBrowser,这是一个与模型无关且轻量级的解码器框架,能够在去噪过程中的任意时间步(timestep)或Transformer块(transformer block)处实时生成多模态预览表示(包括RGB图像和场景内在信息),其生成速度超过4倍实时速度(4秒视频生成耗时小于1秒),并保持与最终视频一致的外观和运动特性;同时,借助训练好的解码器,可通过随机性重注入(stochasticity reinjection)和模态引导(modal steering)实现对中间噪声步骤的交互式控制,从而揭示原本黑箱的去噪过程中场景、物体等细节的组合与构建机制。

链接: https://arxiv.org/abs/2512.13690
作者: Susung Hong,Chongjian Ge,Zhifei Zhang,Jui-Hsien Wang
机构: University of Washington (华盛顿大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation – keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4 \times real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
zh

[CV-1] LitePT: Lighter Yet Stronger Point Transformer

【速读】:该论文旨在解决3D点云处理神经网络中卷积层(convolutional layers)与注意力模块(attention blocks)的最优组合方式不明确的问题。现有架构通常混合使用这两类模块,但缺乏对它们在不同网络深度下作用机制的系统性理解。研究发现:早期低分辨率层中,卷积足以高效提取局部几何特征,此时引入注意力机制不仅计算昂贵且无显著收益;而在深层高语义层次上,注意力能更有效地建模全局上下文信息。基于此设计原则,作者提出LitePT模型,其核心创新在于采用“前卷积后注意力”的分阶段结构,并引入一种无需训练的3D位置编码方法PointROPE,以保留空间布局信息并减少冗余计算。该方案使模型参数量减少3.6倍、推理速度提升2倍、内存消耗降低2倍,同时保持或超越当前最优模型Point Transformer V3的性能表现。

链接: https://arxiv.org/abs/2512.13689
作者: Yuanwen Yue,Damien Robert,Jianyuan Wang,Sunghwan Hong,Jan Dirk Wegner,Christian Rupprecht,Konrad Schindler
机构: ETH Zurich (苏黎世联邦理工学院); University of Oxford (牛津大学); University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6\times fewer parameters, runs 2\times faster, and uses 2\times less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: this https URL.
zh

[CV-2] owards Scalable Pre-training of Visual Tokenizers for Generation

【速读】:该论文旨在解决视觉分词器(Visual Tokenizer)预训练中存在“预训练扩展问题”(pre-training scaling problem),即传统基于重建损失的训练范式导致潜在空间偏向低级特征,使得更高的像素级精度无法转化为更高质量的生成结果,从而限制了计算资源投入与生成性能之间的正向Scaling关系。解决方案的关键在于提出一种统一的视觉分词器预训练框架VTP(Visual Tokenizer Pre-training),首次联合优化图像-文本对比损失、自监督损失和重建损失,使潜在空间能够简洁地表征高级语义信息,从而实现生成性能随预训练计算量、参数和数据的有效提升。

链接: https://arxiv.org/abs/2512.13687
作者: Jingfeng Yao,Yuda Song,Yucong Zhou,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); MiniMax
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our pre-trained models are available at this https URL

点击查看摘要

Abstract:The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the pre-training scaling problem and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.
zh

[CV-3] Recurrent Video Masked Autoencoders

【速读】:该论文旨在解决视频表示学习中如何高效捕捉时空结构、同时兼顾模型参数效率与长时序特征传播稳定性的难题。其解决方案的关键在于提出一种基于Transformer的递归神经网络架构——RVM(Recurrent Video Masked-Autoencoders),通过异构掩码预测任务实现仅依赖标准像素重建目标即可学习高质量视频表征;该设计不仅使模型在动作识别和目标跟踪等视频级任务上达到与VideoMAE、V-JEPA等先进方法相当的性能,还在几何感知和密集空间理解任务中优于图像模型如DINOv2,且在小模型规模下无需知识蒸馏即可实现高达30倍的参数效率提升;此外,RVM的递归机制支持线性计算复杂度下的长期时序特征传播,克服了传统时空注意力架构的局限性。

链接: https://arxiv.org/abs/2512.13684
作者: Daniel Zoran,Nikhil Parthasarathy,Yi Yang,Drew A Hudson,Joao Carreira,Andrew Zisserman
机构: Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist’’ encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM’s recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
zh

[CV-4] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

【速读】:该论文旨在解决交互式三维场景生成中的泛化能力问题,即现有基于学习的方法受限于训练数据集的场景布局,难以推广到未见布局或新物体组合。其解决方案的关键在于重新编程一个预训练的3D实例生成器(3D instance generator),使其作为场景级学习者,将依赖数据集的监督信号替换为以模型为中心的空间监督信号(model-centric spatial supervision)。这一机制释放了生成器中可迁移的空间知识,从而实现对未知布局和新颖物体组合的泛化能力。特别地,即使训练场景由随机排列的物体组成,空间推理仍能自发涌现,表明生成器的可迁移场景先验能够从纯几何线索中推断出邻近性、支撑关系和对称性等空间关系。此外,论文摒弃传统规范空间(canonical space)建模方式,采用视点中心(view-centric)的场景空间表述,构建了一个全前馈、可泛化的场景生成器,直接从实例模型中学习空间关系。

链接: https://arxiv.org/abs/2512.13683
作者: Lu Ling,Yunhao Ge,Yichen Sheng,Aniket Bera
机构: Purdue University (普渡大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
zh

[CV-5] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

【速读】:该论文旨在解决当前前向重建模型(如VGGT和π³)在处理流式视频时因二次内存复杂度导致无法实用部署的问题。现有流式方法虽通过学习记忆机制或因果注意力缓解此问题,但通常需大量重训练且难以充分利用先进离线模型的几何先验优势。解决方案的关键在于提出一种无需训练的框架LASER,其核心思想是通过跨连续时间窗口对齐预测来实现从离线模型到流式系统的转换;特别地,针对简单相似变换(Sim(3))因层间深度错位而失效的问题,引入了逐层尺度对齐策略——将深度预测分割为离散层,计算每层尺度因子,并在相邻窗口与时间戳间传播这些因子,从而有效解决单目尺度歧义引发的深度比例不一致问题。

链接: https://arxiv.org/abs/2512.13680
作者: Tianye Ding,Yiming Xie,Yiqing Liang,Moitreya Chatterjee,Pedro Miraldo,Huaizu Jiang
机构: Northeastern University (东北大学); Independent Researcher (独立研究者); Mitsubishi Electric Research Laboratories (三菱电机研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Recent feed-forward reconstruction models like VGGT and \pi^3 achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ( \mathrmSim(3) ) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: \hrefthis https URL\textttthis https URL
zh

[CV-6] Feedforward 3D Editing via Text-Steerable Image-to-3D ALT

【速读】:该论文旨在解决当前图像到3D(image-to-3D)生成模型在实际应用中缺乏灵活编辑能力的问题,尤其是难以通过自然语言指令对生成的3D资产进行可控修改。为实现这一目标,作者提出了一种前向传播(feedforward)方法Steer3D,其核心创新在于借鉴ControlNet的思想,将文本控制信号直接嵌入到预训练图像到3D生成模型的前向推理过程中,从而实现无需微调即可通过语言指令对3D内容进行精准编辑。关键解决方案包括:构建一个可扩展的数据生成引擎以自动获取大规模训练数据,并采用基于流匹配(flow-matching)训练与直接偏好优化(Direct Preference Optimization, DPO)相结合的两阶段训练策略,使模型在保持原始3D结构一致性的同时更忠实于文本指令,且推理速度相比现有方法提升2.4至28.5倍。

链接: https://arxiv.org/abs/2512.13678
作者: Ziqi Ma,Hongqiao Chen,Yisong Yue,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
zh

[CV-7] JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

【速读】:该论文旨在解决当前视频-音频联合生成方法中存在的两个关键问题:一是多数现有方法仅能生成环境音效,无法生成与唇部动作同步的人类语音;二是统一的人类视频-音频生成方法通常依赖显式融合或模态特定对齐模块,导致模型架构复杂化并削弱了原始Transformer的简洁性。其解决方案的关键在于提出JoVA框架,通过在每个Transformer层中引入跨视频和音频token的联合自注意力机制(joint self-attention),实现无需额外对齐模块的直接高效跨模态交互;同时设计基于面部关键点检测的嘴部区域损失(mouth-area loss),增强训练期间对唇部区域的监督,从而提升语音与唇动的同步质量,且不牺牲架构简洁性。

链接: https://arxiv.org/abs/2512.13677
作者: Xiaohu Huang,Hao Zhou,Qiangpeng Yang,Shilei Wen,Kai Han
机构: The University of Hong Kong (香港大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \url{ this https URL }

点击查看摘要

Abstract:In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
zh

[CV-8] Directional Textual Inversion for Personalized Text-to-Image Generation

【速读】:该论文旨在解决文本到图像个性化(text-to-image personalization)中,传统文本反转(Textual Inversion, TI)方法在复杂提示词下性能下降的问题。核心问题在于嵌入向量范数膨胀(embedding norm inflation):学习到的令牌(token)偏离了分布内幅度,导致预归一化(pre-norm)Transformer 中提示条件控制能力退化。解决方案的关键是提出方向性文本反转(Directional Textual Inversion, DTI),其通过固定嵌入向量的范数至分布内尺度,并仅在单位超球面上优化方向,利用黎曼随机梯度下降(Riemannian SGD)实现高效训练。DTI 将方向学习建模为带 von Mises-Fisher 先验的最大后验估计(MAP),引入恒定方向先验梯度,从而提升提示忠实度(text fidelity)并保持主体相似性;更重要的是,其超球面参数化支持语义连贯的插值(slerp),显著优于标准 TI 的离散概念表示。

链接: https://arxiv.org/abs/2512.13672
作者: Kunhee Kim,NaHyeon Park,Kibeom Hong,Hyunjung Shim
机构: KAIST(韩国科学技术院); Sookmyung Woman’s University(淑明女子大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI’s hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
zh

[CV-9] Agent IAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中因正常参考样本稀缺以及缺陷具有细微且局部特性而导致的检测困难问题。现有单次通过的视觉语言模型(Vision-Language Models, VLMs)往往忽略小尺度异常,且缺乏显式的机制来与标准正常模式进行对比。解决方案的关键在于提出AgentIAD——一个工具驱动的智能体框架,其核心包含两个模块:Perceptive Zoomer(PZ)用于局部细粒度分析,Comparative Retriever(CR)在证据模糊时检索正常样本以辅助判断。训练过程分为两阶段:监督微调与强化学习,结合双奖励机制(感知奖励和行为奖励),引导模型通过逐步观察、缩放和验证实现精准判别,最终在MMAD数据集上达到97.62%的分类准确率,显著优于以往基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法,并生成可解释的检测轨迹。

链接: https://arxiv.org/abs/2512.13671
作者: Junwen Miao,Penghui Du,Yi Liu,Yu Wang,Yan Wang
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
zh

[CV-10] Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

【速读】:该论文旨在解决当前AI生成视频检测方法在3D几何一致性建模上的不足,即现有技术对生成视频中隐含的三维空间结构信息挖掘有限,导致检测性能受限。其解决方案的关键在于引入**消失点(vanishing points)**作为显式三维几何模式表征,并提出Grab-3D框架,通过几何感知的Transformer架构实现对视频时序中3D几何一致性的建模:该框架包含几何位置编码、时序-几何注意力机制以及基于指数移动平均(EMA)的几何分类头,从而将3D几何先验知识显式注入到时序建模过程中,显著提升了检测精度与跨生成器的泛化能力。

链接: https://arxiv.org/abs/2512.13665
作者: Wenhan Chen,Sezer Karaoglu,Theo Gevers
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
zh

[CV-11] RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

【速读】:该论文旨在解决机器人在复杂现实场景中进行多步骤、基于度量的时空指代与测量任务(spatial tracing)的难题,此类任务要求模型同时具备3D空间指代能力(spatial referring)、精确的距离/尺度感知(metric measurement)以及多步推理(multi-step metric-grounded reasoning)。现有方法难以有效处理这一组合性挑战。解决方案的关键在于提出RoboTracer——一个3D-aware视觉语言模型(VLM),其核心创新包括:(1) 通过统一的空间编码器(universal spatial encoder)和回归监督的解码器(regression-supervised decoder),在监督微调(SFT)阶段增强对尺度的敏感性;(2) 利用强化微调(RFT)结合度量敏感的过程奖励(metric-sensitive process rewards),并监督关键中间感知线索,以实现高精度的多步空间轨迹生成;(3) 构建大规模数据集TraceSpatial(含30M问答对)及评测基准TraceSpatial-Bench,支撑训练与评估。实验表明,RoboTracer在空间理解、测量和指代任务上平均成功率达79.1%,且在TraceSpatial-Bench上显著优于Gemini-2.5-Pro(提升36%准确率),并可适配多种机器人控制策略,在真实杂乱环境中执行长时程动态任务。

链接: https://arxiv.org/abs/2512.13660
作者: Enshen Zhou,Cheng Chi,Yibo Li,Jingkun An,Jiayuan Zhang,Shanyu Rong,Yi Han,Yuheng Ji,Mengzhen Liu,Pengwei Wang,Zhongyuan Wang,Lu Sheng,Shanghang Zhang
机构: Beihang University (北京航空航天大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); CASIA (中国科学院自动化研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
zh

[CV-12] World Models Can Leverag e Human Videos for Dexterous Manipulation

【速读】:该论文旨在解决灵巧操作(Dexterous Manipulation)中因手部细微运动通过物体接触影响环境而带来的建模挑战,核心问题是现有世界模型难以准确预测复杂手部动作下的环境状态演化。解决方案的关键在于提出 DexWM——一个基于潜在状态的灵巧操作世界模型,其通过利用超过 900 小时的人类与非灵巧机器人视频数据进行训练以缓解数据稀缺问题,并引入辅助的手部一致性损失(hand consistency loss),确保手部配置的准确性,从而显著提升对未来状态的预测精度和零样本泛化能力,在 Franka Panda 机械臂搭配 Allegro 夹爪上实测优于 Diffusion Policy 超过 50% 的性能表现。

链接: https://arxiv.org/abs/2512.13644
作者: Raktim Gautam Goswami,Amir Bar,David Fan,Tsung-Yen Yang,Gaoyue Zhou,Prashanth Krishnamurthy,Michael Rabbat,Farshad Khorrami,Yann LeCun
机构: New York University (纽约大学); Facebook AI Research (Facebook人工智能研究中心); Google (谷歌); Meta; NYU Tandon School of Engineering (纽约大学坦顿工程学院); NVIDIA (英伟达)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
zh

[CV-13] From Code to Field: Evaluating the Robustness of Convolutional Neural Networks for Disease Diagnosis in Mango Leaves

【速读】:该论文旨在解决农业领域中用于芒果叶片病害诊断的卷积神经网络(Convolutional Neural Networks, CNNs)在真实世界复杂环境下的鲁棒性不足问题,尤其是在图像退化(如噪声、模糊和天气变化)条件下模型性能下降的问题。其解决方案的关键在于构建了一个针对芒果叶片图像的增强数据集 MangoLeafDB-C,包含19类人工污染类型及五级严重程度,并在此基础上对五种CNN架构(包括轻量级模型LCNN与主流模型如ResNet-50/101、VGG-16、Xception)进行系统性鲁棒性评估,采用F1分数、污染误差(Corruption Error, CE)和相对均值污染误差(relative mean Corruption Error, relative mCE)作为指标。结果表明,轻量化且专为芒果叶片诊断设计的LCNN在多种现实场景污染下(如失焦模糊、运动模糊)表现最优,且整体mCE最低,优于复杂模型,揭示了在边缘设备部署时,专用轻量模型比通用复杂模型更具备鲁棒性和实用性。

链接: https://arxiv.org/abs/2512.13641
作者: Gabriel Vitorino de Andrade,Saulo Roberto dos Santos,Itallo Patrick Castro Alves da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work was presented at the BRACIS 2025 conference in Fortaleza

点击查看摘要

Abstract:The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
zh

[CV-14] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

【速读】:该论文旨在解决4D场景重建与新视角生成(Novel View Synthesis)任务中因数据质量与多样性不足而导致模型训练和评估受限的问题。其解决方案的关键在于构建一个高质量、多模态且结构化的新型数据集,该数据集源自一部高保真动画电影,包含丰富的动态场景、精细的纹理、光照与运动信息,并提供RGB图像、深度图、表面法向量、物体分割掩码及光流等互补模态,同时设计了三种不同稀疏程度的采集场景(密集多视角相机布局、稀疏相机排列和单目视频序列),从而支持在多种数据条件下对先进视图合成算法进行系统性验证与比较。

链接: https://arxiv.org/abs/2512.13639
作者: Michal Nazarczuk,Thomas Tanay,Arthur Moreau,Zhensong Zhang,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
zh

[CV-15] MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

【速读】:该论文旨在解决当前自动驾驶中视觉-语言-动作(Vision-Language-Action, VLA)范式依赖模仿学习(Imitation Learning, IL)所引发的分布偏移(distribution shift)和因果混淆(causal confusion)问题,同时克服在线强化学习(online Reinforcement Learning)在连续动作空间中探索效率低下的瓶颈。其解决方案的关键在于提出MindDrive框架,该框架基于一个大型语言模型(Large Language Model, LLM),并配置两组独立的LoRA参数:一组作为决策专家(Decision Expert)负责场景推理与驾驶决策生成,另一组作为动作专家(Action Expert)将语言化决策动态映射为可行轨迹。通过将轨迹级奖励反馈至语言决策空间,MindDrive实现了在离散语言决策空间中的试错学习,而非直接在连续动作空间中进行探索,从而在复杂场景下的最优决策、类人驾驶行为与高效在线强化学习之间取得平衡。

链接: https://arxiv.org/abs/2512.13636
作者: Haoyu Fu,Diankun Zhang,Zongchuang Zhao,Jianfeng Cui,Hongwei Xie,Bing Wang,Guang Chen,Dingkang Liang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 12 figures, 6 tables; Project Page: this https URL

点击查看摘要

Abstract:Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
zh

[CV-16] SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)数据获取成本高且传统固定网格采样策略导致冗余测量、生物信息量不足的问题,从而限制了现有方法的性能。其核心解决方案是提出SCR2-ST框架,关键在于两个创新模块:一是基于单细胞先验知识的强化学习主动采样(Single-cell Guided Reinforcement Learning-based, SCRL),通过融合单细胞基础模型嵌入与空间密度信息构建生物学驱动的奖励信号,实现有限测序预算下对高信息量组织区域的选择性采集;二是混合回归-检索预测网络SCR2Net,利用主动采样数据结合回归建模与检索增强推理机制,其中多数细胞类型过滤机制抑制噪声匹配,检索到的表达谱作为软标签提供辅助监督,显著提升低预算条件下的表达预测准确性与采样效率。

链接: https://arxiv.org/abs/2512.13635
作者: Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Chongyu Qu,Juming Xiong,Siqi Lu,Zhengyi Lu,Yanfan Zhu,Marilyn Lionts,Yuechen Yang,Yalin Zheng,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Weill Cornell Medicine (威尔康奈尔医学院); The College of William and Mary (威廉玛丽学院); University of Liverpool (利物浦大学); Vanderbilt University Medical Center (范德比尔特大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
zh

[CV-17] Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型在理解与生成符合物理规律的场景变换方面的不足,尤其是缺乏对真实世界动作所引发因果关系的建模能力。现有研究多聚焦于物体层面的编辑操作,而忽视了动作的可逆性这一关键物理特性。解决方案的核心在于提出“Do-Undo”任务与基准测试,通过构建大规模可逆动作数据集(来自真实世界视频)并设计一致性约束训练策略,强制模型学习动作与结果之间的因果逻辑,从而提升其对物理世界的推理能力。这一方法为具身智能、机器人学及物理感知生成建模提供了新的评估框架和研究方向。

链接: https://arxiv.org/abs/2512.13609
作者: Shweta Mahajan,Shreya Kadambi,Hoang Le,Munawar Hayat,Fatih Porikli
机构: Qualcomm AI Research (Qualcomm AI 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
zh

[CV-18] DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis

【速读】:该论文旨在解决数字乳腺断层成像(Digital Breast Tomosynthesis, DBT)领域缺乏基础模型(foundation model)的问题,以推动其在乳腺癌筛查中的智能化分析。解决方案的关键在于基于DINOv2的自监督学习方法,在超过2500万张DBT二维切片上进行大规模预训练,构建首个针对DBT的专用基础模型——DBT-DINO。该模型在乳腺密度分类和5年乳腺癌风险预测任务中表现优异,验证了领域特定预训练的有效性;但在通用病灶检测任务中,其性能不如ImageNet预训练的DINOv2基线,提示局部化检测任务仍需进一步的方法优化。

链接: https://arxiv.org/abs/2512.13608
作者: Felix J. Dorfner,Manon A. Dorster,Ryan Connolly,Oscar Gentilhomme,Edward Gibbs,Steven Graham,Seth Wander,Thomas Schultz,Manisha Bahl,Dania Daye,Albert E. Kim,Christopher P. Bridge
机构: Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School; Department of Radiology, Charité - Universitätsmedizin Berlin corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin; Mass General Brigham Data Science Office; Department of Computer Science, Institute for Machine Learning, ETH Zürich; Massachusetts General Hospital Cancer Center and Harvard Medical School; Department of Radiology, Massachusetts General Hospital; Department of Radiology, University of Wisconsin School of Medicine and Public Health
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95% CI: 0.76–0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95% CI: 0.70–0.76, p.001) and DenseNet-121 (0.74, 95% CI: 0.71–0.76, p.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95% CI: 0.76–0.80) compared to DINOv2’s 0.76 (95% CI: 0.74–0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95% CI: 0.60–0.74) compared to DBT-DINO with 0.62 (95% CI: 0.53–0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8% compared to Dinov2’s 77.3%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.13608 [cs.CV] (or arXiv:2512.13608v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.13608 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Felix J. Dorfner [view email] [v1] Mon, 15 Dec 2025 18:03:09 UTC (3,017 KB) Full-text links: Access Paper: View a PDF of the paper titled DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis, by Felix J. Dorfner and 11 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-19] LongVie 2: Multimodal Controllable Ultra-Long Video World Model

【速读】:该论文旨在解决基于预训练视频生成系统构建视频世界模型(video world model)所面临的挑战,即如何实现可控性(controllability)、长期视觉质量(long-term visual quality)和时序一致性(temporal consistency)这三项核心属性的统一优化。解决方案的关键在于提出一个分阶段训练的端到端自回归框架 LongVie 2,其创新性体现在三个关键设计:(1) 多模态引导机制(Multi-modal guidance),通过融合密集与稀疏控制信号提供隐式世界级监督以提升可控性;(2) 输入帧退化感知训练(degradation-aware training),缩小训练与长期推理之间的差距以维持高质量视觉输出;(3) 历史上下文引导机制(history-context guidance),对相邻片段的上下文信息进行对齐以保障时序一致性。该方法在 LongVGenBench 基准上验证了其在长程可控性、时序连贯性和视觉保真度上的领先性能,支持长达五分钟的连续视频生成,标志着向统一视频世界建模迈出重要一步。

链接: https://arxiv.org/abs/2512.13604
作者: Jianxiong Gao,Zhaoxi Chen,Xian Liu,Junhao Zhuang,Chengming Xu,Jianfeng Feng,Yu Qiao,Yanwei Fu,Chenyang Si,Ziwei Liu
机构: FDU; NJU; NTU; NVIDIA; THU; Shanghai AI Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
zh

[CV-20] DA-SSL: self-supervised domain adaptor to leverag e foundational models in turbt histopathology slides

【速读】:该论文旨在解决病理基础模型(Pathology Foundational Models, PFMs)在特定癌症类型(如膀胱癌的经尿道膀胱肿瘤电切术样本,TURBT)上因域偏移(domain shift)导致性能下降的问题,尤其针对TURBT样本中常见的组织碎片化和电灼伤伪影等特征未被充分建模的情况。其解决方案的关键在于提出一种轻量级、无需微调预训练模型的领域自适应自监督适配器(Domain-Adaptive Self-Supervised Adaptor, DA-SSL),通过重新对齐PFM提取的特征以匹配TURBT域分布,从而有效提升基于多实例学习(Multiple Instance Learning, MIL)的病理分析系统在临床挑战性任务中的表现。

链接: https://arxiv.org/abs/2512.13600
作者: Haoyue Zhang,Meera Chappidi,Erolcan Sayar,Helen Richards,Zhijun Chen,Lucas Liu,Roxanne Wadia,Peter A Humphrey,Fady Ghali,Alberto Contreras-Sanz,Peter Black,Jonathan Wright,Stephanie Harmon,Michael Haffner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at this https URL.
zh

[CV-21] Lighting in Motion: Spatiotemporal HDR Lighting Estimation

【速读】:该论文旨在解决**时空光照估计(spatiotemporal lighting estimation)**中的两个核心挑战:一是准确预测高频率的光照细节以实现真实感渲染,二是精确估计环境光的照度(illuminance)。其解决方案的关键在于提出了一种基于扩散模型(diffusion-based approach)的方法——Lighting in Motion (LiMo),通过在输入场景中根据3D位置生成不同曝光下的镜面与漫反射球体,并利用大规模定制化室内/室外场景数据集对现有扩散模型进行微调。为提升空间条件控制精度,作者发现仅依赖深度信息不足,进而引入一种新的几何条件来提供目标3D位置相对于场景的相对位置信息;最终通过可微分渲染(differentiable rendering)将多曝光下的漫反射与镜面预测融合为单张HDRI图,从而实现了高精度的空间控制与光照预测性能,达到当前最优水平。

链接: https://arxiv.org/abs/2512.13597
作者: Christophe Bolduc,Julien Philip,Li Ma,Mingming He,Paul Debevec,Jean-François Lalonde
机构: Université Laval (拉瓦尔大学); Eyeline Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.
zh

[CV-22] Image Diffusion Preview with Consistency Solver

【速读】:该论文旨在解决图像扩散模型(image diffusion models)推理速度慢导致的交互体验差的问题。其核心解决方案是提出一种名为ConsistencySolver的新方法,该方法基于通用线性多步法(general linear multistep methods)设计,是一种轻量级、可训练的高阶求解器,通过强化学习(Reinforcement Learning)优化,显著提升低步数采样下的预览质量与最终输出的一致性。相较现有加速方法(如无训练求解器和后训练蒸馏),ConsistencySolver在仅使用47%步数的情况下达到与多步DPM-Solver相当的FID分数,并优于蒸馏基线,在用户研究中实现近50%的交互时间减少,同时保持生成质量。

链接: https://arxiv.org/abs/2512.13592
作者: Fu-Yun Wang,Hao Zhou,Liangzhe Yuan,Sanghyun Woo,Boqing Gong,Bohyung Han,Ming-Hsuan Yang,Han Zhang,Yukun Zhu,Ting Liu,Long Zhao
机构: Google DeepMind(谷歌深度思维); The Chinese University of Hong Kong(香港中文大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.
zh

[CV-23] MMhops-R1: Multimodal Multi-hop Reasoning AAAI2026

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在复杂现实问题中普遍缺乏多跳推理能力的问题,即现有模型主要局限于单步推理,且缺乏能够系统评估和推动多模态多跳推理能力的基准。为应对这一挑战,作者提出了MMhops基准数据集,包含Bridging和Comparison两种高难度任务形式,要求模型动态整合外部知识构建复杂的推理链。解决方案的关键在于提出MMhops-R1框架,这是一个基于多模态检索增强生成(multi-modal Retrieval-Augmented Generation, mRAG)的动态推理方法,通过强化学习优化模型自主规划推理路径、制定针对性查询并融合多层次信息,从而实现高效的多模态多跳推理。实验表明,该方案显著优于基线模型,并展现出对固定跳数任务的良好泛化能力,验证了动态规划与多模态知识融合对于复杂推理的核心作用。

链接: https://arxiv.org/abs/2512.13573
作者: Tao Zhang,Ziqi Zhang,Zongyang Ma,Yuxin Chen,Bing Li,Chunfeng Yuan,Guangting Wang,Fengyun Rao,Ying Shan,Weiming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Acceped by AAAI 2026

点击查看摘要

Abstract:The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
zh

[CV-24] 3D Human-Human Interaction Anomaly Detection

【速读】:该论文旨在解决传统以人为中心的异常检测(Human-centric Anomaly Detection, HAD)方法在处理多人协作场景中交互异常行为时准确率低的问题。现有方法通常仅关注单人行为,难以捕捉人类交互中的复杂且非对称的动力学特性。为此,作者提出了一个新的任务——人与人交互异常检测(Human-Human Interaction Anomaly Detection, H2IAD),并设计了交互异常检测网络(Interaction Anomaly Detection Network, IADNet)。其关键创新在于引入Temporal Attention Sharing Module (TASM),通过共享双人编码后的运动嵌入(motion embeddings)实现协同运动相关性的有效同步;同时提出Distance-Based Relational Encoding Module (DREM),以空间距离关系建模社交线索,从而更精准刻画交互行为特征。最终利用归一化流(normalizing flow)进行异常评分,实验表明该方法显著优于现有单人异常检测基线。

链接: https://arxiv.org/abs/2512.13560
作者: Shun Maeda,Chunzhi Gu,Koichiro Kamide,Katsuya Hotta,Shangce Gao,Chao Zhang
机构: University of Fukui (福井大学); Iwate University (岩手大学); University of Toyama (富山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
zh

[CV-25] Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains NEURIPS2025

【速读】:该论文旨在解决现有自动分割模型在处理跨域生物医学图像时的局限性问题,即模型通常仅支持单一分割协议(如仅按组织类型或解剖区域分割),或需要人工手动提示才能指定目标分割方式,难以实现多协议、自动化且语义一致的分割。解决方案的关键在于提出了一种名为Pancakes的新框架,该框架能够针对来自未见领域的图像自动生成多种合理且语义一致的多标签分割图,其核心创新在于引入了一个新的问题范式——无需依赖特定训练协议即可同时输出多个可解释的分割结果,并在不同图像间保持语义一致性,从而显著优于现有基础模型在多个独立数据集上的表现。

链接: https://arxiv.org/abs/2512.13534
作者: Marianne Rakic,Siyu Gai,Etienne Chollet,John V. Guttag,Adrian V. Dalca
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); MGH (麻省总医院); HMS (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025. Code available at: this https URL

点击查看摘要

Abstract:A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.
zh

[CV-26] ARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLM s for Video Understanding

【速读】:该论文旨在解决视频-文本检索任务中缺乏时间感知能力的问题,即现有模型难以准确捕捉动作发生的时间顺序与时序关系。为此,作者提出了一种名为TARA(Time Aware Retrieval Adaptation)的简单高效方案,其关键在于无需使用任何视频数据即可将多模态大语言模型(Multimodal LLMs, MLLMs)适配为具备时间感知能力的视频-文本嵌入模型,通过引入基于时间相反动作(chiral actions)的硬负样本和精心设计的评估基准,验证了TARA在时间敏感检索任务上的优越性能,并进一步发现其在否定句理解(NegBench)和动词/副词语义解析方面也具有显著提升,展现出强大的零样本泛化能力和多维度适应性。

链接: https://arxiv.org/abs/2512.13511
作者: Piyush Bagad,Andrew Zisserman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 18 Pages. Project page at this http URL

点击查看摘要

Abstract:Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
zh

[CV-27] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

【速读】:该论文旨在解决统一音频-视频生成(audio-visual generation)中的同步性与生成质量难题,特别是在多语言唇形同步(lip-syncing)、动态镜头控制和叙事连贯性方面存在挑战的问题。解决方案的关键在于提出Seedance 1.5 pro模型,其采用双分支Diffusion Transformer架构,并集成跨模态联合模块与多阶段数据处理流程,从而实现高质量的音视频协同生成;同时通过监督微调(SFT)和基于人类反馈的强化学习(RLHF)进行后训练优化,并引入加速框架使推理速度提升超10倍,显著增强了实际应用可行性与专业内容创作能力。

链接: https://arxiv.org/abs/2512.13507
作者: Siyan Chen,Yanfei Chen,Ying Chen,Zhuo Chen,Feng Cheng,Xuyan Chi,Jian Cong,Qinpeng Cui,Qide Dong,Junliang Fan,Jing Fang,Zetao Fang,Chengjian Feng,Han Feng,Mingyuan Gao,Yu Gao,Qiushan Guo,Boyang Hao,Qingkai Hao,Bibo He,Qian He,Tuyen Hoang,Ruoqing Hu,Xi Hu,Weilin Huang,Zhaoyang Huang,Zhongyi Huang,Siqi Jiang,Wei Jiang,Yunpu Jiang,Zhuo Jiang,Ashley Kim,Jianan Kong,Zhichao Lai,Shanshan Lao,Ai Li,Feiya Li,Gen Li,Huixia Li,JiaShi Li,Liang Li,Ming Li,Tao Li,Xian Li,Xiaojie Li,Xiaoyang Li,Xingxing Li,Yameng Li,Yifu Li,Yiying Li,Chao Liang,Ying Liang,Zhiqiang Liang,Wang Liao,Yalin Liao,Heng Lin,Kengyu Lin,Shanchuan Lin,Xi Lin,Zhijie Lin,Feng Ling,Fangfang Liu,Gaohong Liu,Jiawei Liu,Jie Liu,Shouda Liu,Shu Liu,Sichao Liu,Songwei Liu,Xin Liu,Xue Liu,Yibo Liu,Zikun Liu,Zuxi Liu,Junlin Lyu,Lecheng Lyu,Qian Lyu,Han Mu,Xiaonan Nie,Jingzhe Ning,Xitong Pan,Yanghua Peng,Lianke Qin,Xueqiong Qu,Yuxi Ren,Yuchen Shen,Guang Shi,Lei Shi,Yan Song,Yinglong Song,Fan Sun,Li Sun,Renfei Sun,Zeyu Sun,Wenjing Tang,Zirui Tao,Feng Wang,Furui Wang,Jinran Wang,Junkai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Seedance 1.5 pro Technical Report

点击查看摘要

Abstract:Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.
zh

[CV-28] On-Device Continual Learning for Unsupervised Visual Anomaly Detection in Dynamic Manufacturing

【速读】:该论文旨在解决动态柔性生产环境中视觉异常检测(Visual Anomaly Detection, VAD)面临的三大挑战:一是小批量、按需制造下产品频繁变更导致模型需快速更新;二是边缘设备算力有限,难以训练和部署大型AI模型;三是新产线变体常面临正常与异常样本均稀缺的问题。解决方案的关键在于提出一种面向设备端的持续学习方法,扩展PatchCore框架以支持在线学习,结合轻量级特征提取器与基于k-center选择的增量式核数据集(coreset)更新机制,从而在有限数据下实现高效、低内存占用的自适应模型更新,避免云端重新训练,显著提升检测准确率(AUROC提升12%)、降低内存消耗(减少80%)并加快训练速度。

链接: https://arxiv.org/abs/2512.13497
作者: Haoyu Ren,Kay Koehle,Kirill Dorofeev,Darko Anicic
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by European Conference on EDGE AI Technologies and Applications (EEAI) 2025

点击查看摘要

Abstract:In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.
zh

[CV-29] Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

【速读】:该论文旨在解决高保真长期数字人动画生成中面临的多模态驱动一致性差、数据稀缺及推理效率低等关键问题。其解决方案的核心在于提出一个名为Soul的多模态驱动框架,通过构建包含100万条精细标注样本的数据集Soul-1M(覆盖单人面部、上半身、全身及多人场景)缓解数据不足问题,并设计基于Wan2.2-5B骨干网络的模型架构,融合音频注入层与阈值感知码本替换策略以保障长时间生成的一致性;同时采用步长/CFG蒸馏和轻量级VAE优化推理效率,在实现11.4倍加速的同时保持视频质量无明显下降,从而在唇同步精度、表情生动性、身份保留等方面显著优于现有开源与商业模型。

链接: https://arxiv.org/abs/2512.13495
作者: Jiangning Zhang,Junwei Zhu,Zhenye Gan,Donghao Luo,Chuming Lin,Feifan Xu,Xu Peng,Jianlong Hu,Yuansen Liu,Yijia Hong,Weijian Cao,Han Feng,Xu Chen,Chencan Fu,Keke He,Xiaobin Hu,Chengjie Wang
机构: YoutuVideo-Soul Team
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed \textbfSoul , which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4 \times speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at this https URL
zh

[CV-30] ransform Trained Transformer: Accelerating Naive 4K Video Generation Over 10times

【速读】:该论文旨在解决高分辨率视频生成中的计算效率与质量难以平衡的问题,尤其是在4K(2160×3840)视频生成中,由于时空分辨率提升导致全注意力机制(full-attention)的计算复杂度呈二次方增长,使得模型难以高效运行。解决方案的关键在于提出一种名为T3(Transform-Trained Transformer)的Transformer重构策略,该方法不改变预训练模型的核心架构,而是通过优化前向逻辑显著降低计算需求;具体而言,T3-Video引入了多尺度权重共享窗口注意力机制,并结合分层块化与轴保持的全注意力设计,实现对预训练模型的“注意力模式”转换,仅需少量计算和数据即可达成性能跃升,实验表明其在4K-VBench评测中相比现有方法在VQA指标上提升+4.29、VTC指标提升+0.08,同时将原生4K视频生成速度加速超过10倍。

链接: https://arxiv.org/abs/2512.13492
作者: Jiangning Zhang,Junwei Zhu,Teng Hu,Yabiao Wang,Donghao Luo,Weijian Cao,Zhenye Gan,Xiaobin Hu,Zhucun Xue,Chengjie Wang
机构: Youtu Lab, Tencent(腾讯); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Native 4K (2160 \times 3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed \textbfT3 ( \textbfT ransform \textbfT rained \textbfT ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, \textbfT3-Video introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an “attention pattern” transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that \textbfT3-Video substantially outperforms existing approaches: while delivering performance improvements (+4.29 \uparrow VQA and +0.08 \uparrow VTC), it accelerates native 4K video generation by more than 10 \times . Project page at this https URL
zh

[CV-31] PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

【速读】:该论文旨在解决当前姿态引导视频生成(pose-guided video generation)方法仅支持人类姿态输入、难以泛化至非人类角色的问题。其核心解决方案是提出PoseAnything框架,首次实现对人类与非人类角色的通用姿态引导视频生成,支持任意骨骼结构输入;关键创新包括:1)引入部位感知的时间一致性模块(Part-aware Temporal Coherence Module),通过划分主体为不同部位并建立跨帧对应关系,利用交叉注意力机制实现细粒度的部件级运动一致性;2)提出主体与相机运动解耦的CFG策略(Subject and Camera Motion Decoupled CFG),首次在姿态引导视频生成中实现独立控制相机运动,通过将主体和相机运动信息分别注入CFG的正负锚点来完成。

链接: https://arxiv.org/abs/2512.13465
作者: Ruiyan Wang,Teng Hu,Kaihui Huang,Zihan Su,Ran Yi,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
zh

[CV-32] st-Time Modification: Inverse Domain Transformation for Robust Perception

【速读】:该论文旨在解决真实场景下域泛化(domain generalization)任务中因目标域分布未知而导致的性能下降问题,尤其是在环境变化剧烈时,传统数据增强方法难以有效合成覆盖全部目标域变体,且生成过程耗时昂贵。其解决方案的关键在于:在测试阶段利用扩散模型(diffusion models)将目标域图像映射回源域分布,从而使得原本在源域训练的下游模型可以直接应用于目标域,无需生成大规模合成数据或修改原模型结构。该方法仅需源域描述信息,具备高效性与通用性,并在多个视觉任务(如分割、检测和分类)中显著提升了鲁棒性。

链接: https://arxiv.org/abs/2512.13454
作者: Arpit Jadon,Joshua Niemeijer,Yuki M. Asano
机构: German Aerospace Center (德国航空航天中心); University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.
zh

[CV-33] IMILIA: interpretable multiple instance learning for inflammation prediction in IBD from HE whole slide images

【速读】:该论文旨在解决炎症性肠病(Inflammatory Bowel Disease, IBD)治疗目标向组织学缓解转变背景下,对显微镜下炎症精准评估的迫切需求。传统方法依赖病理医生主观判断,存在一致性差、效率低的问题,难以支持个体化治疗决策。解决方案的关键在于提出一个端到端的可解释多实例学习框架 IMILIA,其核心由两部分组成:一是基于多实例学习(Multiple Instance Learning, MIL)的炎症预测模块,用于从全视野数字切片中识别炎症存在;二是可解释性模块,包括 HistoPLUS(用于细胞实例检测、分割与分类)和 EpiSeg(用于上皮层分割),能够自动计算驱动预测的组织区域特征。该方法在多个队列中表现出高鲁棒性(交叉验证 ROC-AUC 达 0.83,外部验证达 0.99 和 0.84),且其可解释性结果与生物学机制一致,即高分区域富集免疫细胞(淋巴细胞、浆细胞、中性粒细胞和嗜酸性粒细胞),低分区域以正常上皮细胞为主,为 IBD 病理评估提供了客观、定量且可解释的新范式。

链接: https://arxiv.org/abs/2512.13440
作者: Thalyssa Baiocco-Rodrigues,Antoine Olivier,Reda Belbahri,Thomas Duboudin,Pierre-Antoine Bannier,Benjamin Adjadj,Katharina Von Loga,Nathan Noiry,Maxime Touzot,Hector Roux de Bezieux
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (HE), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at this https URL.
zh

[CV-34] A Domain-Adapted Lightweight Ensemble for Resource-Efficient Few-Shot Plant Disease Classification

【速读】:该论文旨在解决植物叶片病害识别中对大规模标注数据和高计算资源依赖的问题,尤其针对数据稀缺与计算资源受限环境下的精准诊断难题。其解决方案的关键在于提出一种轻量级的少样本学习(few-shot learning)框架,结合领域自适应的MobileNetV2与MobileNetV3作为特征提取器,并引入特征融合技术以生成鲁棒的特征表示;进一步通过增强注意力机制的双向长短期记忆网络(Bi-LSTM)进行分类,有效捕捉序列依赖关系并聚焦关键特征,从而在复杂真实场景下实现高性能识别,同时保持模型紧凑(约40 MB)和低推理复杂度(约1.12 GFLOPs)。

链接: https://arxiv.org/abs/2512.13428
作者: Anika Islam,Tasfia Tahsin,Zaarin Anjum,Md. Bakhtiar Hasan,Md. Hasanul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23±0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28±1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.
zh

[CV-35] MineTheGap: Automatic Mining of Biases in Text-to-Image Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, TTI)生成模型在面对提示词(prompt)语义模糊时所表现出的偏见问题,这些问题可能引发社会影响(如职业描述中仅呈现特定种族)或降低用户体验(如生成图像多样性不足)。解决方案的关键在于提出一种名为MineTheGap的方法,其核心是利用遗传算法(genetic algorithm)迭代优化提示词池,以自动挖掘导致模型产生偏见输出的提示。该方法通过一个新颖的偏见评分机制实现优化驱动,该评分基于生成图像分布与大语言模型(LLM)生成的提示变体文本分布之间的差异,从而对偏见严重程度进行量化排序,有效识别并暴露TTI模型中的潜在偏见模式。

链接: https://arxiv.org/abs/2512.13427
作者: Noa Cohen,Nurit Spingarn-Eliezer,Inbar Huberman-Spiegelglas,Tomer Michaeli
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and examples are available on the project’s webpage at this https URL

点击查看摘要

Abstract:Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project’s webpage.
zh

[CV-36] RecTok: Reconstruction Distillation along Rectified Flow

【速读】:该论文旨在解决高维视觉分词器(visual tokenizer)在扩散模型中性能劣于低维分词器的问题,其核心挑战在于维度与生成质量之间的固有权衡。解决方案的关键在于提出RecTok方法,通过两项创新实现:一是流语义蒸馏(flow semantic distillation),将视觉基础模型(VFM)中的语义信息蒸馏到流匹配(flow matching)的前向轨迹中,从而构建语义丰富的训练空间;二是重建-对齐蒸馏(reconstruction–alignment distillation),引入掩码特征重建损失进一步增强语义表达能力。该方法突破了传统依赖潜在空间建模的局限,使高维潜空间在保持语义丰富性的同时显著提升图像重建和生成质量。

链接: https://arxiv.org/abs/2512.13421
作者: Qingyu Shi,Size Wu,Jinbin Bai,Kaidong Yu,Yujing Wang,Yunhai Tong,Xiangtai Li,Xuelong Li
机构: Peking University (北京大学); Nanyang Technological University (南洋理工大学); TeleAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction–alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.
zh

[CV-37] Learning to Generate Cross-Task Unexploitable Examples

【速读】:该论文旨在解决个人图像在上传至网络后可能被未经授权利用的问题,即如何生成“不可利用”(unexploitable)的图像版本,以保护用户隐私。现有方法在实际应用中仍存在局限性,难以在多种真实场景下的计算机视觉任务中均实现广泛的不可利用性。为应对这一挑战,作者提出了一种新颖的元跨任务不可利用样本生成框架(Meta Cross-Task Unexploitable Example Generation, MCT-UEG),其核心创新在于设计了一个基于平坦极小值(flat-minima-oriented)的元训练与测试方案,通过优化生成器使其能够产出在多任务下均具备强鲁棒性的不可利用图像样本。

链接: https://arxiv.org/abs/2512.13416
作者: Haoxuan Qu,Qiuchi Xiang,Yujun Cai,Yirui Wu,Majid Mirmehdi,Hossein Rahmani,Jun Liu
机构: Lancaster University (兰卡斯特大学); The University of Queensland (昆士兰大学); Hohai University (河海大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.
zh

[CV-38] USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

【速读】:该论文旨在解决连续手语识别(Continuous Sign Language Recognition, CSLR)中因现有框架难以捕捉精细的手部与面部特征以及建模长程时间依赖性而导致的识别精度不足问题。解决方案的关键在于提出统一的时空建模(Unified Spatio-Temporal Modeling, USTM)框架,其核心是结合基于Swin Transformer的轻量级时序适配器与位置嵌入(Temporal Adapter with Positional Embeddings, TAPE),从而在单流RGB视频输入下有效提取细粒度空间特征并同时建模短时与长时时间上下文,显著提升识别性能。

链接: https://arxiv.org/abs/2512.13415
作者: Ahmed Abul Hasanaath,Hamzah Luqman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at this https URL
zh

[CV-39] Computer vision training dataset generation for robotic environments using Gaussian splatting

【速读】:该论文旨在解决机器人视觉任务中大规模、高保真且自动标注数据集的生成问题,核心挑战包括合成图像与真实世界图像之间的域差距(domain gap)以及人工标注耗时费力的瓶颈。解决方案的关键在于提出了一种新颖的流水线:首先利用3D高斯泼溅(3D Gaussian Splatting, 3DGS)构建逼真的环境和物体表示,随后在游戏引擎中通过物理模拟生成自然布局;进一步采用两阶段渲染技术,将3DGS的高保真度与基于代理网格(proxy mesh)生成的阴影贴图相结合,算法化地融合阴影与细微高光,显著提升图像真实感;同时自动生成像素级精确的分割掩码,直接适配YOLO等目标检测模型。实验表明,结合少量真实图像与大量合成数据的混合训练策略可实现最优检测与分割性能,验证了该方法在高效构建鲁棒、准确模型方面的有效性。

链接: https://arxiv.org/abs/2512.13411
作者: Patryk Niżeniec,Marcin Iwanowski
机构: University of Warmia and Mazury in Olsztyn (奥尔什丁瓦米亚和马祖里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code available at: this https URL

点击查看摘要

Abstract:This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
zh

[CV-40] End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

【速读】:该论文旨在解决脊柱手术中术中导航对毫米级精度的需求与现有系统依赖侵入性骨锚定标记或辐射密集型成像所带来的局限性问题,尤其是当前基于RGB-D图像的无标记注册方法因使用弱分割标签(weak segmentation labels)导致误差传播、影响注册精度的问题。其解决方案的关键在于提出一种端到端深度学习框架End2Reg,该框架通过联合优化分割与注册任务,在无需弱分割标签和人工干预的情况下,直接以注册目标为导向学习用于注册优化的分割掩膜(segmentation masks),从而显著提升注册准确性并实现全自动、无标记的术中导航流程。

链接: https://arxiv.org/abs/2512.13402
作者: Lorenzo Pettinari,Sidaty El Hadramy,Michael Wehrli,Philippe C. Cattin,Daniel Studer,Carol C. Hasler,Maria Licci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and interactive visualizations: this https URL

点击查看摘要

Abstract:Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: this https URL.
zh

[CV-41] rNCA: Self-Repairing Segmentation Masks

【速读】:该论文旨在解决通用分割模型在生成掩码时难以保持拓扑正确性的问题,此类模型常产生碎片化或不连通的输出结果。解决方案的关键在于将神经元胞自动机(Neural Cellular Automata, NCA)直接重构为一种有效的后处理精修机制,通过图像上下文引导的局部、迭代更新策略修复分割掩码。该方法仅依赖局部信息即可学习目标形状的结构特性,在训练过程中利用不完美掩码与真实标签之间的差异,使NCA动态逐步重组断裂区域、去除松散片段,并收敛至拓扑一致的结果,从而显著提升多种任务(如视网膜血管和心肌分割)中的Dice分数、clDice及Betti数误差指标。

链接: https://arxiv.org/abs/2512.13397
作者: Malte Silbernagel,Albert Alonso,Jens Petersen,Bulat Ibragimov,Marleen de Bruijne,Madeleine K. Wyburd
机构: University of Copenhagen (哥本哈根大学); Rigshospitalet (皇家医院); Erasmus MC – University Medical Center Rotterdam (埃因霍温大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing \beta_0 errors by 60% and \beta_1 by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.
zh

[CV-42] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

【速读】:该论文旨在解决图像到视频生成中用户对最终帧中遮挡区域(disoccluded regions)内容控制不足的问题。现有图像到视频流水线虽能生成合理运动,但在强制用户指定内容于新暴露区域时难以实现可预测且结构化的动作生成。解决方案的关键在于将运动规范与外观合成解耦:提出一种轻量级、用户可编辑的代理动态图(Proxy Dynamic Graph, PDG),其以确定性但近似的方式驱动部件运动;同时利用冻结的扩散先验(diffusion prior)来合成遵循该运动的逼真外观。通过用户松散标注并重置PDG,计算稠密运动流以引导扩散模型作为运动感知着色器,并结合PDG编码的可见性信息,在潜在空间进行复合操作,从而在不微调模型的前提下实现可控的关节运动和对遮挡区域外观的精准干预。

链接: https://arxiv.org/abs/2512.13392
作者: Anran Qi,Changjian Li,Adrien Bousseau,Niloy J.Mitra
机构: Inria - Université Côte d’Azur (法国国家信息与自动化研究院-蔚蓝海岸大学); University of Edinburgh (爱丁堡大学); Adobe Research (Adobe 研究院); UCL (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address image-to-video generation with explicit user control over the final frame’s disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: this https URL
zh

[CV-43] Unlocking Generalization in Polyp Segmentation with DINO Self-Attention “keys”

【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)诊疗中自动息肉分割(Polyp Segmentation)的泛化能力不足问题,尤其在数据稀缺或挑战性场景下,现有深度学习方法常因依赖复杂且任务特定的网络结构而性能受限。其解决方案的关键在于利用DINO视觉Transformer(Vision Transformer, ViT)自注意力机制中“键”(Key)特征的内在鲁棒性,通过一个简单的卷积解码器预测息肉掩膜,从而在不引入息肉特异性架构的前提下显著提升模型的泛化性能。实验表明,该方法在多中心数据集上采用领域泛化(Domain Generalization, DG)和极端单域泛化(Extreme Single Domain Generalization, ESDG)协议验证时,达到了当前最优(State-of-the-Art, SOTA)效果,尤其在数据受限场景中优势明显。

链接: https://arxiv.org/abs/2512.13376
作者: Carla Monteiro,Valentina Corbetta,Regina Beets-Tan,Luís F. Teixeira,Wilson Silva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 10 figures, 8 tables, under review at MIDL 2026

点击查看摘要

Abstract:Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention “key” features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework’s evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
zh

[CV-44] Automated User Identification from Facial Thermograms with Siamese Networks

【速读】:该论文旨在解决基于热成像的生物识别技术在安全系统中应用时面临的准确性与可靠性问题,特别是在不同红外光谱波段(NIR、SWIR、MWIR 和 LWIR)下识别性能的差异及单一模态识别的局限性。解决方案的关键在于提出一种结合高分辨率热成像传感器(具备至少30 Hz帧率和良好热灵敏度)与Siamese神经网络的自动化识别框架,并通过实验验证其在自建数据集上达到约80%的准确率;同时探索可见光与红外光谱融合的混合系统,以克服单一模态的不足,从而提升整体生物识别系统的鲁棒性和实用性。

链接: https://arxiv.org/abs/2512.13361
作者: Elizaveta Prozorova,Anton Konev,Vladimir Faerman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 5 pages, 2 figures, reported on 21st International Scientific and Practical Conference ‘Electronic Means and Control Systems’, dedicated to the 80th anniversary of radio engineering education beyond the Urals, Tomsk, 24 November 2025

点击查看摘要

Abstract:The article analyzes the use of thermal imaging technologies for biometric identification based on facial thermograms. It presents a comparative analysis of infrared spectral ranges (NIR, SWIR, MWIR, and LWIR). The paper also defines key requirements for thermal cameras used in biometric systems, including sensor resolution, thermal sensitivity, and a frame rate of at least 30 Hz. Siamese neural networks are proposed as an effective approach for automating the identification process. In experiments conducted on a proprietary dataset, the proposed method achieved an accuracy of approximately 80%. The study also examines the potential of hybrid systems that combine visible and infrared spectra to overcome the limitations of individual modalities. The results indicate that thermal imaging is a promising technology for developing reliable security systems.
zh

[CV-45] Face Identity Unlearning for Retrieval via Embedding Dispersion

【速读】:该论文旨在解决人脸检索系统中身份隐私保护的问题,即如何使特定身份的嵌入表示在超球面上分散,从而破坏其紧凑的身份聚类结构,防止在画廊库中被重新识别,同时保持其余身份的判别性结构和检索性能。解决方案的关键在于提出一种基于分散(dispersion-based)的遗忘机制,通过有效打散目标身份的嵌入分布来实现身份不可追溯性,而不显著损害模型对其他身份的检索效用。

链接: https://arxiv.org/abs/2512.13317
作者: Mikhail Zakharov
机构: Kryptonite AI Laboratory, Kryptonite JSC, Moscow, Russia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 figure, 5 tables, 10 equations. Preprint

点击查看摘要

Abstract:Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
zh

[CV-46] KlingAvatar 2.0 Technical Report

【速读】:该论文旨在解决生成式 AI(Generative AI)在长时高分辨率视频生成中面临的效率低下、时间漂移(temporal drifting)、质量退化及指令跟随能力弱等问题。其核心解决方案是提出KlingAvatar 2.0,一个时空级联框架(spatio-temporal cascade framework),通过分阶段的上采样策略实现空间分辨率与时间维度的协同提升:首先生成低分辨率的关键帧以捕捉全局语义和运动信息,随后采用首尾帧策略细化为高分辨率且时序一致的子片段,确保长视频中平滑的时间过渡;同时引入由三个模态专用大语言模型(LLM)专家组成的共推理导演(Co-Reasoning Director),增强跨模态指令融合与对齐,并通过负向导演(Negative Director)优化负面提示以提升指令一致性,最终支持特定身份的多角色控制,显著提升了视觉清晰度、唇齿渲染准确性、身份保真度及多模态指令遵循能力。

链接: https://arxiv.org/abs/2512.13313
作者: Kling Team:Jialu Chen,Yikang Ding,Zhixue Fang,Kun Gai,Yuan Gao,Kang He,Jingyun Hua,Boyuan Jiang,Mingming Lao,Xiaohan Li,Hui Liu,Jiwen Liu,Xiaoqiang Liu,Yuan Liu,Shun Lu,Yongsen Mao,Yingchao Shao,Huafeng Shi,Xiaoyu Shi,Peiqin Sun,Songlin Tang,Pengfei Wan,Chao Wang,Xuebo Wang,Haoxian Zhang,Yuanxing Zhang,Yan Zhou
机构: Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
zh

[CV-47] ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

【速读】:该论文旨在解决当前生成模型在需要深度推理、规划及精确数据到视觉映射能力的任务中表现不足的问题,特别是针对创意表格可视化(creative table visualization)这一挑战性任务——即根据给定表格数据生成既忠实又美观的信息图(infographic)。解决方案的关键在于提出ShowTable管道,该管道通过多模态大语言模型(MLLM)与扩散模型(diffusion model)的协同工作实现渐进式自我修正过程:MLLM作为核心调度器负责视觉规划和错误判断以提供精细化指令,扩散模型则执行这些指令并实现高保真输出。该方法显著提升了多模态推理、生成与纠错能力,实验表明其在多个基准测试中优于现有基线模型。

链接: https://arxiv.org/abs/2512.13303
作者: Zhihang Liu,Xiaoyi Bao,Pandeng Li,Junjie Zhou,Zhaohe Liao,Yefei He,Kaixun Jiang,Chen-Wei Xie,Yun Zheng,Hongtao Xie
机构: USTC(中国科学技术大学); CASIA(中国科学院自动化研究所); NJU(南京大学); SJTU(上海交通大学); ZJU(浙江大学); FDU(复旦大学); Tongyi Lab(通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
zh

[CV-48] LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在图像和视频生成中面临的两大核心问题:(1) 物理对齐(physical alignment)不足,即生成内容与物理规律不一致;(2) 对分布外(out-of-distribution, OOD)指令的遵循能力弱。作者认为这些问题源于模型未能学习因果方向以及未能解耦因果因素以支持新颖组合。解决方案的关键在于引入两个核心工具:Causal Scene Graph(CSG)和Physical Alignment Probe(PAP)数据集,用于诊断干预;并基于分析结果提出LINA(Learning INterventions Adaptively)框架——该框架通过学习预测特定提示的干预策略,实现两个关键机制:(1) 在提示空间和视觉潜在空间中进行目标引导;(2) 采用因果感知的重新分配去噪调度。这一方法显著提升了DMs在因果生成任务和Winoground数据集上的物理一致性与指令遵循能力,达到当前最优性能。

链接: https://arxiv.org/abs/2512.13290
作者: Shu Yu,Chaochao Lu
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models’ failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at this https URL.
zh

[CV-49] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images AAAI2026

【速读】:该论文旨在解决生成式图像检测中因特征纠缠导致的泛化能力不足问题,即现有方法(包括基于预训练视觉-语言模型的方法)常将任务相关的因果特征(causal features)与无关的伪特征(non-causal features)混合,从而限制了在不同生成技术下的迁移性能。解决方案的关键在于提出CausalCLIP框架,通过构建结构因果模型(structural causal model)显式分离因果与非因果特征,并利用因果推断原则进行定向过滤:采用Gumbel-Softmax-based特征掩码和希尔伯特-施密特独立性准则(Hilbert-Schmidt Independence Criterion, HSIC)约束来强制特征间的统计独立性,从而保留对分布偏移鲁棒的稳定因果特征,显著提升模型在未见生成模型上的检测准确率与平均精度。

链接: https://arxiv.org/abs/2512.13285
作者: Bo Liu,Qiao Qin,Qinghui He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages Accepted to AAAI 2026

点击查看摘要

Abstract:The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
zh

[CV-50] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

【速读】:该论文旨在解决AI生成视频(AIGC)检测中感知真实性和音画一致性评估不足的问题,尤其关注当前生成模型是否能产出在人类和视觉语言模型(VLMs)层面均难以辨识的沉浸式、音频同步视频。其解决方案的关键在于构建Video Reality Test基准套件,该套件基于ASMR(Autonomous Sensory Meridian Response)来源的高质量音视频数据,通过精心设计的“创作者-评审者”对抗协议(adversarial creator-reviewer protocol),使生成模型作为创作者试图欺骗VLM评审者,从而系统性地测试模型在细粒度动作-物体交互场景下的真实性表现。实验表明,即使是最先进的生成模型(如Veo3.1-Fast)也能显著误导主流VLMs(如Gemini 2.5-Pro仅达56%准确率),揭示了当前生成视频在音画一致性与感知保真度方面的局限性。

链接: https://arxiv.org/abs/2512.13281
作者: Jiaqi Wang,Weijia Wu,Yi Zhan,Rui Zhao,Ming Hu,James Cheng,Wei Liu,Philip Torr,Kevin Qinghong Lin
机构: CUHK(香港中文大学); NUS(新加坡国立大学); Video Rebirth; University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is at this https URL }, page is at this https URL

点击查看摘要

Abstract:Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf(i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf(ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at this https URL.
zh

[CV-51] CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

【速读】:该论文旨在解决扩散模型在基于指令的图像编辑任务中难以精准遵循细粒度指令(如颜色、位置、数量等具体属性)的问题。现有方法虽采用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行对齐,但仅在单步采样中优化,反馈稀疏,限制了轨迹级控制能力。其解决方案的关键在于提出统一框架 CogniEdit,通过引入多模态大语言模型(Multi-modal Large Language Models)分解复杂指令为可执行操作,结合动态标记焦点重定位机制自适应增强细粒度属性关注,并设计基于密集GRPO的优化策略,在连续去噪步骤间传播梯度,实现整个采样轨迹上的梯度流动与监督,从而显著提升细粒度指令遵循能力与图像质量及编辑保真度的平衡。

链接: https://arxiv.org/abs/2512.13276
作者: Yan Li,Lin Liu,Xiaopeng Zhang,Wei Xue,Wenhan Luo,Yike Guo,Qi Tian
机构: Hongkong University of Science and Technology (香港科技大学); Huawei Company (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods strug- gle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across con- secutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation
zh

[CV-52] Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving

【速读】:该论文旨在解决多智能体交互运动行为学习中的两个核心问题:一是模仿学习模型因依赖于以安全行为为主的训练数据而继承偏差,导致在高风险场景下鲁棒性不足;二是现有方法多采用开环评估方式,忽视了闭环执行中误差累积对性能的影响。解决方案的关键在于提出两种互补策略:其一为群体相对行为优化(Group Relative Behavior Optimization, GRBO),这是一种基于强化学习的后训练方法,通过群体相对优势最大化并引入人类正则化来微调预训练行为模型,在仅使用10%训练数据的情况下使安全性提升超过40%且保持行为真实性;其二为Warm-K采样策略,一种基于暖启动的Top-K采样机制,能够在测试时平衡运动选择的一致性与多样性,无需重新训练即可增强行为一致性与反应能力,缓解分布偏移并降低性能差异。

链接: https://arxiv.org/abs/2512.13262
作者: Hyunki Seong,Jeong-Kyun Lee,Heesoo Myeong,Yongho Shin,Hyun-Mook Cho,Duck Hoon Kim,Pranav Desai,Monu Surana
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
zh

[CV-53] oward Ambulatory Vision: Learning Visually-Grounded Active View Selection

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在视觉问答(Visual Question Answering, VQA)任务中仅能处理静态图像、缺乏动态环境感知能力的问题,即从“快照视觉”向“具身视觉”(embodied vision)的迁移挑战。其核心解决方案是提出视觉引导的主动视点选择(Visually Grounded Active View Selection, VG-AVS)任务,并构建一个基于合成数据的配对查询-目标视点与问答提示的数据集;关键在于通过监督微调(Supervised Fine-Tuning, SFT)和基于强化学习(Reinforcement Learning, RL)的策略优化联合训练预训练VLM,使其仅依赖当前图像信息即可选择最具信息量的下一个视角,从而实现无需场景记忆或外部知识的主动视觉探索。该方法在未见合成与真实场景中均表现出强泛化能力,并可提升现有基于场景探索的环境问答(EQA)系统的下游问答准确率。

链接: https://arxiv.org/abs/2512.13250
作者: Juil Koo,Daehyeon Choi,Sangwoo Youn,Phillip Y. Lee,Minhyuk Sung
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
zh

[CV-54] STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

【速读】:该论文旨在解决现有视频生成模型在语音驱动肖像动画(speech-driven portrait animation)和自由视角说话肖像合成(free-viewpoint talking portrait synthesis)中的两大瓶颈问题:一是2D扩散模型对参考图像依赖过强,导致运动多样性受限;二是3D感知动画通常依赖预训练三平面生成器进行逆向重建,易引发身份漂移(identity drift)和重建不完整。解决方案的关键在于重构参考条件与几何感知的建模范式:首先,在预训练阶段引入更柔和的身份约束(soft identity constraints),弱化对单一参考图像的严格依赖;其次,通过利用视频数据固有的多视角特性(multi-view nature of video data),隐式地实现3D感知能力,无需显式3D结构建模。此外,论文提出解耦学习策略以分别优化视图一致性与时间连贯性,并采用自强制训练(self-forcing training)机制使模型能够从比推理时更长的时间上下文中学习,从而缓解传统自回归方法中动画过于静态的问题。

链接: https://arxiv.org/abs/2512.13247
作者: Foivos Paraperas Papantoniou,Stathis Galanakis,Rolandos Alexandros Potamias,Bernhard Kainz,Stefanos Zafeiriou
机构: Imperial College London (帝国理工学院); FAU Erlangen–Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
zh

[CV-55] Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在提供专家级辅助任务中表现不足的问题,特别是在第一人称视角(egocentric)视频场景下,如何有效融合视觉与自然语言交互以实现高质量的人机协作。解决方案的关键在于构建Ego-EXTRA数据集——一个包含50小时未脚本化第一人称视频的视频-语言数据集,其中训练者执行程序性任务时由真实专家通过自然语言提供指导和问答反馈,采用“绿妖”(Wizard of Oz)范式模拟可穿戴智能助手的行为,从而捕获高保真双向对话,并形成超过15,000个高质量的视觉问答(Visual Question Answering, VQA)样本集合,为评估MLLMs在第一人称情境下的专家级辅助能力提供基准。

链接: https://arxiv.org/abs/2512.13238
作者: Francesco Ragusa,Michele Mazzamuto,Rosario Forte,Irene D’Ambra,James Fort,Jakob Engel,Antonino Furnari,Giovanni Maria Farinella
机构: University of Catania (卡塔尼亚大学); Next Vision s.r.l. (Next Vision 公司); Meta Reality Labs Research (Meta 现实实验室研究部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ’’ data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: this https URL.
zh

[CV-56] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

【速读】:该论文旨在解决人脸重光照(face relighting)任务中因缺乏大规模、物理一致的光照数据而导致性能受限的问题。解决方案的关键在于构建了一个名为POLAR的大规模、物理校准的“一次一光”(One-Light-at-a-Time, OLAT)数据集,包含超过200名受试者在156个光照方向下的多视角和多表情图像,并基于此数据集提出了一种基于流模型(flow-based generative model)的POLARNet方法。该方法能够从单张人脸图像中预测每束光的OLAT响应,从而精确建模细粒度且方向感知的光照效应,同时保持面部身份不变;其核心优势在于将光照视为一种连续、可物理解释的光照状态变换,而非依赖统计或背景条件的扩散模型,从而实现了可扩展、可控的重光照生成,形成一个由真实数据驱动、生成合成与物理约束融合的自洽学习闭环。

链接: https://arxiv.org/abs/2512.13192
作者: Zhuo Chen,Chengqun Yang,Zhuo Su,Zheng Lv,Jingnan Gao,Xiaoyuan Zhang,Xiaokang Yang,Yichao Yan
机构: Shanghai Jiao Tong University (上海交通大学); PICO; bytedance.com (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 19 figures

点击查看摘要

Abstract:Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining “chicken-and-egg” cycle for scalable and reproducible portrait illumination.
zh

[CV-57] CoRA: A Collaborative Robust Architecture with Hybrid Fusion for Efficient Perception AAAI2026

【速读】:该论文旨在解决协同感知(collaborative perception)在恶劣通信条件下性能显著下降的问题,其根源在于数据传输引起的特征对齐偏差(misalignment),这严重限制了现有方法的实际部署。解决方案的关键在于重新审视中间融合(intermediate fusion)与晚期融合(late fusion)的融合范式,发现二者并非权衡关系,而是互补组合;基于此洞察,作者提出CoRA架构,采用混合策略将性能与鲁棒性解耦,并通过低通信开销实现稳定感知:第一分支为特征级融合分支,选择关键特征进行高效融合以保障性能与可扩展性;第二分支为对象级校正分支,利用语义相关性纠正空间位移,从而抵御姿态误差带来的影响。

链接: https://arxiv.org/abs/2512.13191
作者: Gong Chen,Chaokun Zhang,Pengcheng Lv,Xiaohui Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.
zh

[CV-58] MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

【速读】:该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景理解中因局限于二维图像理解范式而导致的三维空间感知能力不足与深层语义融合不充分的问题,从而限制了其在复杂驾驶环境中的性能表现。解决方案的关键在于提出MMDrive框架,该框架通过引入占据图(occupancy maps)、激光雷达点云(LiDAR point clouds)和文本场景描述三种互补模态,并设计两个核心组件:一是基于问题语义线索动态加权各模态贡献的文本导向多模态调制器(Text-oriented Multimodal Modulator),实现上下文感知的特征融合;二是利用可学习抽象标记(learnable abstract tokens)生成紧凑跨模态摘要的跨模态抽象器(Cross-Modal Abstractor),以突出关键区域与本质语义。这一机制有效突破了传统仅依赖图像的感知边界,实现了复杂驾驶场景下的鲁棒多模态推理。

链接: https://arxiv.org/abs/2512.13177
作者: Minghui Hou,Wei-Hsing Huang,Shaofeng Liang,Daizong Liu,Tai-Hao Wen,Gang Wang,Runwei Guan,Weiping Ding
机构: Jilin University (吉林大学); Georgia Institute of Technology (佐治亚理工学院); University of Petroleum and Chemistry (中国石油大学); Wuhan University (武汉大学); University of Michigan (密歇根大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
zh

[CV-59] Seeing the Whole Picture: Distribution-Guided Data-Free Distillation for Semantic Segmentation

【速读】:该论文旨在解决现有数据-free知识蒸馏(Data-Free Knowledge Distillation, DFKD)方法在语义分割任务中性能下降的问题。传统DFKD方法主要针对分类任务设计,忽视了语义分割中对象的空间连续性和结构一致性,导致生成的伪数据无法准确反映真实数据分布。解决方案的关键在于提出DFSS框架,其核心创新包括:利用教师模型的Batch Normalization(BN)统计信息引导近似分布采样(Approximate Distribution Sampling, ADS),从而选择更贴近原始训练分布的伪数据;并引入加权分布渐进式蒸馏(Weighted Distribution Progressive Distillation, WDPD),在训练初期优先使用与真实分布高度一致的可靠样本,随后逐步加入更具挑战性的样本,模拟人类感知的学习过程。这一机制显著提升了无数据场景下语义分割的知识蒸馏效果,实现了优于现有方法的性能。

链接: https://arxiv.org/abs/2512.13175
作者: Hongxuan Sun,Tao Wu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.
zh

[CV-60] A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

【速读】:该论文旨在解决临床级病理人工智能(Pathology AI)发展中因高质量、多样化标注数据集稀缺而导致的瓶颈问题。现有生成式模型虽具潜力,但常因语义不稳定性与形态幻觉(morphological hallucinations)影响诊断可靠性。其解决方案的关键在于提出首个面向病理学的文本到图像生成基础模型——CRAFTS(Correlation-Regulated Alignment Framework for Tissue Synthesis),通过约280万张图像-标题对的双阶段训练策略,引入一种新颖的关联调控对齐机制(correlation-regulated alignment mechanism),有效抑制语义漂移(semantic drift),从而保障生成图像的生物学准确性。该框架不仅生成涵盖30种癌症类型的多样化病理图像,且经客观指标和病理科医生评估验证质量可靠,并在分类、跨模态检索、自监督学习及视觉问答等任务中提升性能;此外,结合ControlNet可实现基于核分割掩膜或荧光图像等输入对组织架构的精确控制,显著缓解数据稀缺与隐私问题,为罕见和复杂癌表型的诊断工具开发提供可持续的数据支持。

链接: https://arxiv.org/abs/2512.13164
作者: Xianchao Guan,Zhiyuan Fan,Yifeng Wang,Fuqiang Chen,Yanjiang Zhou,Zengyang Che,Hongxue Meng,Xin Li,Yaowei Wang,Hongpeng Wang,Min Zhang,Heng Tao Shen,Zheng Zhang,Yongbing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 67 pages, 9 figures, 16 tables

点击查看摘要

Abstract:The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
zh

[CV-61] Intrinsic Image Fusion for Multi-View 3D Material Reconstruction WWW ATC

【速读】:该论文旨在解决多视角图像中物理基础材质重建的欠约束问题,传统方法依赖分析-合成策略(analysis-by-synthesis),需昂贵且噪声较大的路径追踪(path tracing)来优化材质参数。其解决方案的关键在于引入单视角先验(single-view priors)以增强优化约束,并设计了一种基于扩散模型的材质估计器生成多个候选分解;随后通过拟合低维参数化函数降低视图间不一致性,并提出一种鲁棒优化框架,结合软视图选择与置信度加权的多视图内点集,融合最一致且最可信视图的预测至统一参数空间;最终利用逆向路径追踪(inverse path tracing)优化低维参数,实现高质量材质解耦与清晰的重光照效果。

链接: https://arxiv.org/abs/2512.13157
作者: Peter Kocsis(1),Lukas Höllein(1),Matthias Nießner(1) ((1) Technical University of Munich)
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL Video: this https URL

点击查看摘要

Abstract:We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.
zh

[CV-62] StarryGazer: Leverag ing Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion

【速读】:该论文旨在解决深度补全(depth completion)问题,即从单张稀疏深度图和RGB图像中预测稠密深度图像,尤其在缺乏真实深度标注数据的场景下,传统监督方法不可行。现有无监督方法依赖辅助数据估计深度值,与实际应用存在差距。解决方案的关键在于提出一个领域无关(domain-agnostic)框架 StarryGazer,其核心创新是利用预训练的单目深度估计(Monocular Depth Estimation, MDE)模型生成相对深度图,并通过图像分割与随机缩放构建合成伪真值对(dense pseudo-ground truth)及其对应的稀疏深度图,进而训练一个精炼网络(refinement network),融合相对深度图和RGB图像以提升精度与鲁棒性。该方法有效利用了MDE模型的能力,同时借助稀疏深度信息修正其深度差异估计误差,从而在多个数据集上超越现有无监督方法和简单仿射变换后的MDE结果。

链接: https://arxiv.org/abs/2512.13147
作者: Sangmin Hong,Suyoung Lee,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model’s accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.
zh

[CV-63] Weight Space Correlation Analysis: Quantifying Feature Utilization in Deep Learning Models

【速读】:该论文旨在解决医学影像深度学习模型中存在的“捷径学习”(shortcut learning)问题,即模型可能依赖于图像中编码的混杂元数据(如扫描仪型号)而非真实的临床信号进行预测,从而影响模型的可信度和泛化能力。解决方案的关键在于提出一种可解释的权重空间相关性分析(Weight Space Correlation Analysis),通过测量主临床任务分类头与辅助元数据任务分类头之间的权重向量对齐程度,量化模型对不同特征的利用情况。该方法能够识别模型是否真正利用了与临床相关的特征,而非仅依赖无关的采集因素,从而为验证模型的可靠性提供了一种有效工具。

链接: https://arxiv.org/abs/2512.13144
作者: Chun Kit Wong,Paraskevas Pegios,Nina Weng,Emilie Pi Fogtmann Sejer,Martin Grønnebæk Tolsgaard,Anders Nymark Christensen,Aasa Feragen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 46 pages

点击查看摘要

Abstract:Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier’s weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
zh

[CV-64] owards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning

【速读】:该论文旨在解决从语音生成逼真三维人体动作时存在的协调性差、运动不自然的问题,其核心挑战在于现有端到端学习方法难以建模不同运动单元(如头部、躯干和手部)之间的跨单元(inter-correlations)与单元内(intra-correlations)相关性。解决方案的关键在于提出一种统一的分层隐式周期性(Hierarchical Implicit Periodicity, HIP)学习框架:首先利用周期性自编码器对动作相位流形进行解耦,以模拟人类动作的真实分布并引入当前潜在状态带来的非周期性成分以增强实例级多样性;其次通过级联引导机制建模面部、躯干与手部动作的层次关系,在训练中实现多层级动作的协同驱动,从而提升生成动作的自然性和协调性。

链接: https://arxiv.org/abs/2512.13131
作者: Xin Guo,Yifan Zhao,Jia Li
机构: Beihang University (北京航空航天大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Sound (cs.SD)
备注: IEEE Transactions on Image Processing

点击查看摘要

Abstract:Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
zh

[CV-65] LeafTrackNet: A Deep Learning Framework for Robust Leaf Tracking in Top-Down Plant Phenotyping

【速读】:该论文旨在解决在真实农业环境下对油菜(canola)植株叶片进行高精度、长时间序列追踪的难题。现有方法受限于小规模物种或受控成像条件,而通用多目标追踪(Multi-Object Tracking, MOT)方法又不适用于动态生物场景,且缺乏大规模标注数据支持。其解决方案的关键在于提出两个核心贡献:一是构建了首个大规模叶级追踪基准数据集CanolaTrack,包含5,704张RGB图像和31,840个标注叶实例;二是设计了LeafTrackNet框架,融合YOLOv10叶检测器与MobileNetV3嵌入网络,并采用基于嵌入向量的记忆关联策略实现跨帧叶身份保持,从而在复杂结构作物中实现鲁棒的叶级追踪,在CanolaTrack上相比现有方法提升9% HOTA指标。

链接: https://arxiv.org/abs/2512.13130
作者: Shanghua Liu,Majharulislam Babor,Christoph Verduyn,Breght Vandenberghe,Bruno Betoni Parodi,Cornelia Weltzien,Marina M.-C. Höhne
机构: Leibniz Institute for Agricultural Engineering and Bioeconomy, Germany (德国联邦农业工程与生物经济研究所); BASF Belgium Coordination Center CommV, Belgium (巴斯夫比利时协调中心); BASF Metabolome Solutions GmbH, Germany (巴斯夫代谢组解决方案有限公司); Technical University Berlin, Germany (柏林工业大学); University of Potsdam, Germany (波茨坦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at this https URL.
zh

[CV-66] DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

【速读】:该论文旨在解决动态场景中密集3D点跟踪与重建的灵活性和效率问题,现有方法通常依赖于成对处理、已知相机位姿或输入帧的时间顺序,限制了其在复杂动态环境中的应用。解决方案的关键在于提出DePT3R框架,该框架通过一个强大的主干网络提取深层时空特征,并利用密集预测头回归像素级映射,在单次前向传播中同时完成密集点跟踪与3D重建,且无需预先知道相机位姿,显著提升了在快速变化动态环境中的适应性和内存效率。

链接: https://arxiv.org/abs/2512.13122
作者: Vivek Alumootil,Tuan-Anh Vu,M. Khalid Jawed
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is a work in progress

点击查看摘要

Abstract:Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: this https URL
zh

[CV-67] Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

【速读】:该论文旨在解决多模态3D目标检测在恶劣天气条件下性能受限的问题,主要挑战包括天气引起的传感器数据畸变以及不同模态(如图像与激光雷达)之间的配准偏差。解决方案的关键在于提出DiffFusion框架,其核心创新包括:1)基于扩散模型的图像恢复(Diffusion-IR)和点云恢复(PCR),利用图像中的物体线索修复受天气影响的LiDAR数据;2)双向自适应融合与对齐模块(Bidirectional Adaptive Fusion and Alignment Module, BAFAM),实现动态多模态融合与鸟瞰图(BEV)空间的一致性对齐,从而提升跨模态鲁棒性。

链接: https://arxiv.org/abs/2512.13107
作者: Zhijian He,Feifei Liu,Yuwei Li,Zhanpeng Liu,Jintao Cheng,Xieyuanli Chen,Xiaoyu Tang
机构: South China Normal University (华南师范大学); Shenzhen Technology University (深圳技术大学); Hong Kong University of Science and Technology (香港科技大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird’s-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
zh

[CV-68] FID-Net: A Feature-Enhanced Deep Learning Network for Forest Infestation Detection

【速读】:该论文旨在解决森林病虫害监测中传统方法在大范围、细粒度检测上的局限性,以实现对感染树木的精准识别与侵染模式分析。其解决方案的关键在于提出FID-Net模型,该模型基于YOLOv8n架构,引入轻量级特征增强模块(Feature Enhancement Module, FEM)提取疾病敏感特征,设计自适应多尺度特征融合模块(Adaptive Multi-scale Feature Fusion Module, AMFM)实现RGB通道与FEM增强特征的对齐与融合,并结合高效通道注意力机制(Efficient Channel Attention, ECA)提升判别信息表达能力。在此基础上,构建了基于核密度估计(Kernel Density Estimation)、邻域评估和DBSCAN聚类的空间分析框架,用于定位感染热点、评估健康树风险及识别高密度健康树群作为优先保护区域,从而支持智能虫害监测、预警与精准管理。

链接: https://arxiv.org/abs/2512.13104
作者: Yan Zhang,Baoxin Li,Han Sun,Yuhang Gao,Mingtai Zhang,Pei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees’ infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.
zh

[CV-69] Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models)在医疗图像分割任务中,因通用先验与特定临床需求不匹配而导致的泛化能力不足问题,尤其是在标注数据有限或病理变异罕见的情况下。解决方案的关键在于提出不确定性感知的协同学习(Uncertainty-informed Collaborative Learning, UnCoL),其核心机制是通过双教师框架实现知识迁移与任务特异性适应的协同:一方面从冻结的基础模型中蒸馏视觉和语义表示以传递通用知识,另一方面维护一个逐步适应的教师模型以捕捉细粒度的任务特定特征;同时,利用预测不确定性自适应调节伪标签学习过程,选择性抑制不可靠监督信号,从而稳定模糊区域的学习。

链接: https://arxiv.org/abs/2512.13101
作者: Wenjing Lu,Yi Hong,Yang Yang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE TMI for possible publication

点击查看摘要

Abstract:Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
zh

[CV-70] ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

【速读】:该论文旨在解决现有基于提示(hint)的强化学习(Reinforcement Learning, RL)方法在训练过程中因忽略样本难度而导致的学习不稳定和过度模仿离策略提示的问题,从而影响模型的知识扩展与推理泛化能力。解决方案的关键在于将“难度”(difficulty)作为核心因素,分别应用于提示比例调度(hint-ratio schedule)和相对优势估计(relative-advantage estimation)两个环节:首先提出基于样本难度先验的自适应提示机制(Adaptive Hint with Sample Difficulty Prior),动态调整每条轨迹的提示比例以引导更有效的探索;其次引入一致性梯度调制与选择性掩码机制(Consistency-based Gradient Modulation and Selective Masking for Hint Preservation),抑制提示区域内token级梯度更新的偏置与破坏性变化;最后设计基于回放难度后验的优势估计方法(Advantage Estimation with Rollout Difficulty Posterior),利用带/不带提示的轨迹相对难度差异来实现更平衡的策略更新。该框架显著提升了模型在多种模态、规模和领域下的推理能力和分布外泛化性能。

链接: https://arxiv.org/abs/2512.13095
作者: Feng Zhang,Zezhong Tan,Xinhong Ma,Ziqiang Dong,Xi Leng,Jianfei Zhao,Xin Sun,Yang Yang
机构: Alibaba Group (阿里巴巴集团); School of Computer Science and Technology, Beijing Institute of Technology (北京理工大学计算机学院); School of Computer Science, Peking University (北京大学计算机学院); School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳)理工学院); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ‘‘hints’’ into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample’s difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.
zh

[CV-71] UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era

【速读】:该论文旨在解决传统变化检测(Change Detection, CD)方法对标注数据高度依赖、类别固定且泛化能力弱的问题,尤其在多场景和不同成像几何条件下难以有效应用。其核心解决方案是提出一种无监督的开放词汇变化检测方法——UniVCD,该方法基于冻结的视觉基础模型SAM2和CLIP,通过引入轻量级特征对齐模块,将SAM2的空间细节表征与CLIP的语义先验信息进行融合,从而实现高分辨率、语义感知的变化估计,同时保持极少的可训练参数;此外,还设计了简化的后处理流程以抑制噪声和伪变化,提升边界清晰物体的检测精度。实验表明,UniVCD在多个二值变化检测(BCD)和语义变化检测(SCD)基准上均表现优异,验证了利用冻结视觉基础模型与轻量多模态对齐进行无监督开放词汇变化检测的有效性与实用性。

链接: https://arxiv.org/abs/2512.13089
作者: Ziqiang Zhu,Bowei Yang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at this https URL.
zh

[CV-72] DiRe: Diversity-promoting Regularization for Dataset Condensation WACV2026

【速读】:该论文旨在解决数据集压缩(Dataset Condensation)中合成数据冗余度高、多样性不足的问题。现有方法生成的数据集普遍存在信息重复,限制了其在下游任务中的泛化性能。为此,作者提出一种直观的多样性正则化项(Diversity Regularizer, DiRe),其核心由余弦相似度(cosine similarity)和欧氏距离(Euclidean distance)构成,能够无缝集成到多种前沿压缩方法中。实验表明,引入DiRe可显著提升合成数据集在CIFAR-10至ImageNet-1K等基准数据集上的泛化能力和多样性指标。

链接: https://arxiv.org/abs/2512.13083
作者: Saumyaranjan Mohanty,Aravind Reddy,Konda Reddy Mopuri
机构: Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校); Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.
zh

[CV-73] Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

【速读】:该论文旨在解决多模态生物医学视觉-语言模型(Multimodal Biomedical Vision-Language Models, VLMs)在持续学习(Continual Learning, CL)中面临的核心难题:如何在保持模态内细粒度特征的同时,弥合不同模态间的显著领域差异(domain gap)。其解决方案的关键在于提出一个综合框架,融合检索增强生成(Retrieval-Augmented Generation, RAG)与动态知识蒸馏机制。具体而言,作者构建了一个基于1800万条医学文献的多模态检索数据库,并引入多层、多模态RAG系统,在模型微调过程中提供实时知识引导;同时设计了一种动态知识蒸馏框架,通过自适应调节参数空间重要性、蒸馏知识粒度及参考数据分布,精准平衡跨模态对齐与模态内特征保留之间的矛盾。该方法在新提出的医学通用任务增量学习(Medical Generalist Task Incremental Learning, MGTIL)基准上实现了SOTA性能,验证了其临床价值。

链接: https://arxiv.org/abs/2512.13072
作者: Zizhi Chen,Yizhen Gao,Minghao Han,Yizhou Liu,Zhaoyu Chen,Dingkang Yang,Lihua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbfMedical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model’s capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
zh

[CV-74] owards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing AAAI2026

【速读】:该论文旨在解决视觉定位识别(Visual Place Recognition, VPR)中高容量基础模型(如DINOv2)在资源受限设备上部署时计算成本过高的问题。其核心解决方案是提出一种高效的非对称VPR框架,其中使用一个高容量的画廊模型(gallery model)进行离线特征提取,搭配一个轻量级查询网络(query network)实现在线处理;关键创新在于引入地理记忆库(geographical memory bank),利用VPR数据库中固有的地理位置元数据结构化组织画廊特征,从而避免了传统方法中耗时的k-NN兼容性训练过程;同时,设计了一种隐式嵌入增强技术(implicit embedding augmentation),提升轻量查询网络对特征变化的建模能力,即便其容量有限也能实现高性能匹配。

链接: https://arxiv.org/abs/2512.13055
作者: Jaeyoon Kim,Yoonki Cho,Sung-Eui Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at this https URL
zh

[CV-75] GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agent ic VLM Training

【速读】:该论文旨在解决多轮强化学习(multi-turn reinforcement learning, RL)在基于视觉语言模型(vision-language models, VLMs)的多模态智能体中面临的稀疏奖励(sparse rewards)与长程信用分配(long-horizon credit assignment)难题。现有方法如Guided Thought Reinforcement(GTR)通过引入教师模型提供逐步反馈来增强奖励密度,但其依赖昂贵且常为特权模型(如GPT或Gemini)作为教师,限制了实用性与可复现性。论文提出的解决方案核心在于GTR-Turbo:它在训练过程中合并多个检查点(checkpoints)的权重,并将融合后的模型作为“免费”教师,用于后续RL阶段的监督微调或软logit蒸馏(soft logit distillation),从而无需额外训练或查询昂贵教师模型,同时缓解了先前方法中出现的“熵崩溃”(entropy collapse)问题,显著提升了训练稳定性、效率和性能——在多种视觉代理任务中实现准确率提升10–30%,训练时间减少50%,计算成本降低60%。

链接: https://arxiv.org/abs/2512.13043
作者: Tong Wei,Yijun Yang,Changhao Zhang,Junliang Xing,Yuanchun Shi,Zongqing Lu,Deheng Ye
机构: Tsinghua University (清华大学); Tencent AI Lab (腾讯AI实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a “free” teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the “entropy collapse” observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
zh

[CV-76] Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

【速读】:该论文旨在解决扩散模型在文本到图像生成过程中因存在有害或非法视觉概念而导致的安全性问题,现有方法多采用单向擦除策略(如仅抑制目标概念或强化安全替代项),难以在概念移除效果与生成质量之间实现平衡。其解决方案的关键在于提出一种双向图像引导的概念擦除框架(Bidirectional Image-Guided Concept Erasure, Bi-Erasing),通过引入两个解耦的图像分支——负向分支用于抑制有害语义,正向分支提供安全替代的视觉引导——并联合优化这两个互补方向,从而在保持生成图像质量的同时提升概念擦除的有效性;此外,还采用基于掩码的过滤机制以避免无关内容干扰擦除过程。

链接: https://arxiv.org/abs/2512.13039
作者: Hao Chen,Yiwei Wang,Songze Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image this http URL, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.
zh

[CV-77] Comprehensive Evaluation of Rule-Based Machine Learning and Deep Learning in Human Estimation Using Radio Wave Sensing: Accuracy Spatial Generalization and Output Granularity Trade-offs

【速读】:该论文旨在解决基于频移连续波多输入多输出雷达(Frequency Modulated Continuous Wave Multiple Input Multiple Output Radar)的无线电波感知中,不同模型在环境变化下的性能稳定性与输出粒度之间的权衡问题。其关键解决方案在于系统性地比较了规则驱动方法、传统机器学习模型(k近邻、随机森林、支持向量机)与深度学习模型(卷积神经网络-长短期记忆网络组合)在两种不同室内布局中的表现,揭示出高容量模型虽能在同一环境中实现高精度细粒度输出,但对领域偏移(domain shift)敏感;而规则驱动方法虽无法提供细粒度输出,却展现出更强的鲁棒性。这一发现明确了模型选择需根据应用场景对泛化能力与输出精细度的需求进行权衡。

链接: https://arxiv.org/abs/2512.13031
作者: Tomoya Tanaka,Tomonori Ikeda,Ryo Yonemoto
机构: Georgia Institute of Technology (佐治亚理工学院); SoftBank Corp. (软银公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. A comprehensive comparison of rule-based, machine learning, and deep learning approaches for human estimation using FMCW MIMO radar, focusing on accuracy, spatial generalization, and output granularity

点击查看摘要

Abstract:This study presents the first comprehensive comparison of rule-based methods, traditional machine learning models, and deep learning models in radio wave sensing with frequency modulated continuous wave multiple input multiple output radar. We systematically evaluated five approaches in two indoor environments with distinct layouts: a rule-based connected component method; three traditional machine learning models, namely k-nearest neighbors, random forest, and support vector machine; and a deep learning model combining a convolutional neural network and long short term memory. In the training environment, the convolutional neural network long short term memory model achieved the highest accuracy, while traditional machine learning models provided moderate performance. In a new layout, however, all learning based methods showed significant degradation, whereas the rule-based method remained stable. Notably, for binary detection of presence versus absence of people, all models consistently achieved high accuracy across layouts. These results demonstrate that high capacity models can produce fine grained outputs with high accuracy in the same environment, but they are vulnerable to domain shift. In contrast, rule-based methods cannot provide fine grained outputs but exhibit robustness against domain shift. Moreover, regardless of the model type, a clear trade off was revealed between spatial generalization performance and output granularity.
zh

[CV-78] Motus: A Unified Latent Action World Model

【速读】:该论文旨在解决当前具身智能体(embodied agent)中多模态能力碎片化问题,即现有方法依赖孤立的模型分别处理理解、世界建模与控制任务,导致难以整合生成式 AI(Generative AI)能力并阻碍从大规模异构数据中学习。其核心解决方案是提出 Motus——一个统一的潜在动作世界模型(unified latent action world model),通过引入 Mixture-of-Transformer(MoT)架构融合理解、视频生成与动作三个专家模块,并采用 UniDiffuser 风格调度器实现多种建模模式间的灵活切换(包括世界模型、视觉语言动作模型、逆动力学模型等)。此外,Motus 利用光流(optical flow)学习潜在动作,设计三阶段训练流程与六层数据金字塔结构,从而提取像素级“动作差值”(delta action),支持大规模动作预训练,显著提升仿真与真实场景下的机器人下游任务性能。

链接: https://arxiv.org/abs/2512.13030
作者: Hongzhe Bi,Hengkai Tan,Shenghao Xie,Zeyuan Wang,Shuhe Huang,Haitian Liu,Ruowen Zhao,Yao Feng,Chendong Xiang,Yinze Rong,Hongyan Zhao,Hanyu Liu,Zhizhong Su,Lei Ma,Hang Su,Jun Zhu
机构: Tsinghua University (清华大学); Peking University (北京大学); Horizon Robotics ( horizon 机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level “delta action” and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
zh

[CV-79] SneakPeek: Future-Guided Instructional Streaming Video Generation

【速读】:该论文旨在解决当前视频扩散模型在生成长序列多步骤操作视频时面临的时空一致性差和可控性弱的问题(temporal consistency and controllability across long sequences of multiple action steps)。其解决方案的关键在于提出一种名为SneakPeek的基于扩散的自回归流水线框架,通过三项核心创新实现:(1) 预测性因果适应(predictive causal adaptation),利用因果模型进行帧预测并提前感知关键帧;(2) 未来引导的自强制机制结合双区域键值缓存(future-guided self-forcing with a dual-region KV caching scheme),缓解推理阶段的暴露偏差问题;(3) 多提示条件控制(multi-prompt conditioning),实现对多步骤指令的细粒度与流程化控制。上述设计共同提升了视频生成的时序稳定性、运动连贯性,并支持交互式动态更新未来提示以影响正在进行的流式生成过程。

链接: https://arxiv.org/abs/2512.13019
作者: Cheeun Hong,German Barquero,Fadime Sener,Markos Georgopoulos,Edgar Schönfeld,Stefan Popov,Yuming Du,Oscar Mañas,Albert Pumarola
机构: Meta Superintelligence Labs (Meta 超智能实验室); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
zh

[CV-80] Comprehensive Deployment-Oriented Assessment for Cross-Environment Generalization in Deep Learning-Based mmWave Radar Sensing

【速读】:该论文旨在解决深度学习驱动的射频(RF)感知系统在实际部署中面临的跨空间场景泛化能力不足的问题,特别是在室内环境中利用调频连续波(FMCW)多输入多输出(MIMO)雷达进行人员计数任务时。其核心解决方案在于通过集成幅度加权预处理与高效迁移学习策略来提升模型在不同物理布局环境下的鲁棒性:具体而言,基于sigmoid函数的幅度加权方法在跨环境测试中表现最优,相较基线方法分别降低50.1%的均方根误差(RMSE)和55.2%的平均绝对误差(MAE);而针对较大空间位移场景,迁移学习则展现出不可替代的作用,仅需540个目标域样本即可实现82.1%(RMSE)和91.3%(MAE)的显著性能提升。

链接: https://arxiv.org/abs/2512.13018
作者: Tomoya Tanaka,Tomonori Ikeda,Ryo Yonemoto
机构: Georgia Institute of Technology (佐治亚理工学院); SoftBank Corp. (软银公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures. Comprehensive evaluation of preprocessing, data augmentation, and transfer learning for cross-environment generalization in deep learning-based mmWave radar sensing

点击查看摘要

Abstract:This study presents the first comprehensive evaluation of spatial generalization techniques, which are essential for the practical deployment of deep learning-based radio-frequency (RF) sensing. Focusing on people counting in indoor environments using frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar, we systematically investigate a broad set of approaches, including amplitude-based statistical preprocessing (sigmoid weighting and threshold zeroing), frequency-domain filtering, autoencoder-based background suppression, data augmentation strategies, and transfer learning. Experimental results collected across two environments with different layouts demonstrate that sigmoid-based amplitude weighting consistently achieves superior cross-environment performance, yielding 50.1% and 55.2% reductions in root-mean-square error (RMSE) and mean absolute error (MAE), respectively, compared with baseline methods. Data augmentation provides additional though modest benefits, with improvements up to 8.8% in MAE. By contrast, transfer learning proves indispensable for large spatial shifts, achieving 82.1% and 91.3% reductions in RMSE and MAE, respectively, with 540 target-domain samples. Taken together, these findings establish a highly practical direction for developing radar sensing systems capable of maintaining robust accuracy under spatial variations by integrating deep learning models with amplitude-based preprocessing and efficient transfer learning.
zh

[CV-81] What Happens Next? Next Scene Prediction with a Unified Video Model

【速读】:该论文旨在解决当前统一视频模型在时间推理(temporal reasoning)能力上的不足问题,即现有模型多聚焦于文本到视频生成等常规任务,而未能充分挖掘其在因果推理和未来预测方面的潜力。为应对这一挑战,作者提出了一种新的任务——下一场景预测(Next Scene Prediction, NSP),该任务要求模型基于前序上下文推断合理的未来场景,从而促进对时序逻辑的深层理解与推理。解决方案的关键在于构建一个统一框架,融合Qwen-VL用于感知理解与LTX用于生成合成,并通过潜在查询嵌入(latent query embedding)和连接模块(connector module)实现两者的高效协同;同时,在自建的大规模NSP数据集上分三阶段训练:文本到视频预训练、监督微调及基于因果一致性奖励(causal consistency reward)的强化学习(GRPO),最终显著提升了通用多模态系统对未来事件的预测能力。

链接: https://arxiv.org/abs/2512.13015
作者: Xinjie Li,Zhimin Chen,Rui Zhao,Florian Schiffers,Zhenyu Liao,Vimal Bhat
机构: Pennsylvania State University (宾夕法尼亚州立大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
zh

[CV-82] JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion AAAI2026

【速读】:该论文旨在解决现有合成数据生成方法在语义分割任务中面临的两大问题:一是图像与标注之间的语义不一致性(即生成图像后需预测伪标注,导致标注质量不可控);二是可扩展性不足(即依赖人工标注掩膜进行图像生成,难以大规模扩展)。其解决方案的关键在于提出一种名为JoDiffusion的联合扩散框架,通过引入一个独立的标注变分自编码器(annotation variational auto-encoder, VAE),将标注掩膜映射至与图像共享的潜在空间,并对扩散模型进行改造以学习图像与其标注掩膜的联合分布,从而仅基于文本提示即可同时生成语义一致的图像和标注掩膜,显著提升了生成效率与一致性。此外,还设计了掩膜优化策略以降低生成过程中的标注噪声,进一步保障数据质量。

链接: https://arxiv.org/abs/2512.13014
作者: Haoyu Wang,Lei Zhang,Wenrui Liu,Dengyang Jiang,Wei Wei,Chen Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
zh

[CV-83] WLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading

【速读】:该论文旨在解决医学图像分析中两个关键问题:一是获取像素级标注的成本高、耗时长,尤其在眼底图像(fundus images)领域;二是深度学习模型缺乏可解释性,限制了其在临床实践中的应用。解决方案的核心在于提出一个两阶段的可解释性糖尿病视网膜病变(Diabetic Retinopathy, DR)评估框架TWLR:第一阶段利用视觉-语言模型将眼科专业知识融入文本嵌入,实现DR分级与病变分类的联合推理,建立语义医学概念与视觉特征之间的映射;第二阶段引入基于弱监督语义分割的迭代严重程度回归机制,通过生成病变显著性图并驱动渐进式修复(inpainting)过程,系统性地去除病灶特征以逐步降低疾病严重程度,从而在无需像素级标注的情况下实现精准定位,并提供从病理状态到健康状态的可视化转换路径,兼顾性能与可解释性。

链接: https://arxiv.org/abs/2512.13008
作者: Xi Luo,Shixin Xu,Ying Xie,JianZhong Hu,Yuwei He,Yuhui Deng,Huaxiong Huang
机构: Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science; Department of Statistics and Data Science, Beijing Normal-Hong Kong Baptist University; Faculty of Science, Hong Kong Baptist University; Data Science Research Center, Duke Kunshan University; Shanxi Provincial People’s Hospital; The Fifth Clinical Medical school of Shanxi Medical University; Research Center for Mathematics, Beijing Normal University; Department of Mathematics and Statistics, York University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
zh

[CV-84] Light Field Based 6DoF Tracking of Previously Unobserved Objects

【速读】:该论文旨在解决机器人和自动驾驶系统中对象跟踪(object tracking)的泛化能力问题,特别是针对未见过且视觉复杂的物体时传统方法性能下降的问题。现有高性能跟踪方法通常依赖于预捕获的物体视角来构建显式参考模型,限制了其对已知物体集合的适用性,并在面对高复杂度外观(如反射)时表现不佳。解决方案的关键在于提出一种基于光场图像(light field images)的跟踪方法,该方法无需预训练模型即可实现鲁棒跟踪;通过视觉基础模型提取语义与几何特征,并将其转换为视图相关的高斯点(view-dependent Gaussian splats),形成统一的对象表示,支持可微渲染与位姿优化,从而有效应对复杂视觉行为(如反射)。

链接: https://arxiv.org/abs/2512.13007
作者: Nikolai Goncharov,James L. Gray,Donald G. Dansereau
机构: University of Sydney (悉尼大学); Australian Centre For Robotics (ACFR) (澳大利亚机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at this https URL.
zh

[CV-85] Few-Step Distillation for Text-to-Image Generation: A Practical Guide

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型在开放文本到图像(text-to-image, T2I)生成任务中的效率与性能平衡问题,特别是如何将已有的类条件图像合成加速技术——即扩散蒸馏(diffusion distillation)——有效迁移至基于自由形式语言提示的 T2I 生成场景。其解决方案的关键在于:首次系统性地将最先进的蒸馏方法适配并比较于一个强大的 T2I 教师模型 FLUX.1-lite,并通过构建统一框架识别从离散类别标签向自然语言提示过渡时所面临的核心障碍;同时提供输入缩放、网络结构和超参数设置等实用指南,从而为实际部署高效、高保真且资源友好的扩散生成器奠定基础。

链接: https://arxiv.org/abs/2512.13006
作者: Yifan Pu,Yizeng Han,Zhiwei Tang,Jiasheng Tang,Fan Wang,Bohan Zhuang,Gao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on this http URL.
zh

[CV-86] Calibrating Uncertainty for Zero-Shot Adversarial CLIP

【速读】:该论文旨在解决生成式 AI(Generative AI)模型CLIP在面对对抗攻击时存在的可靠性问题,即其预测不确定性被抑制而非增强,导致严重校准偏差和过度自信的问题。这一现象揭示了现有方法仅关注预测准确率对齐而忽视不确定性校准的局限性。解决方案的关键在于提出一种新的对抗微调目标,通过将CLIP输出重新参数化为Dirichlet分布的浓度参数,从而统一建模语义结构与预测置信度的幅度,并在扰动下整体对齐这些分布,超越传统单logit锚定方式,有效恢复校准后的不确定性,同时保持干净样本上的准确率并实现有竞争力的对抗鲁棒性。

链接: https://arxiv.org/abs/2512.12997
作者: Wenjing lu,Zerui Tao,Dongping Zhang,Yuning Qiu,Yang Yang,Qibin Zhao
机构: RIKEN Center for Advanced Intelligence Project (RIKEN AIP); Shanghai Jiao Tong University (上海交通大学); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
zh

[CV-87] ackling Snow-Induced Challenges: Safe Autonomous Lane-Keeping with Robust Reinforcement Learning

【速读】:该论文旨在解决自动驾驶车辆(AV)在雪地道路条件下因路面不确定性与打滑导致的车道保持系统(LKS)性能下降问题。解决方案的关键在于提出两种基于深度强化学习(DRL)的动作鲁棒性方法:Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) 和 Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG)。其中,AR-RDPG通过多尺度神经网络去噪图像并提取车道中心线系数作为输入,结合车辆动态特性进行控制决策;而AR-CADPG则采用端到端架构,融合卷积神经网络(CNN)与注意力机制于DRL框架中,有效整合时序记忆、对抗鲁棒性和注意力机制,显著提升了路径跟踪精度与系统稳定性。

链接: https://arxiv.org/abs/2512.12987
作者: Amin Jalal Aghdasian,Farzaneh Abdollahi,Ali Kamali Iglie
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
zh

[CV-88] VoroLight: Learning Quality Volumetric Voronoi Meshes from General Inputs

【速读】:该论文旨在解决从多样化输入(如图像、隐式形状水平集场、点云和网格)中重建高质量、拓扑一致的三维(3D)形状问题,尤其关注生成平滑且封闭(watertight)的表面以及具有拓扑一致性的体素网格。其解决方案的关键在于提出一种可微分的Voronoi网格化框架VoroLight,该框架通过三个阶段实现:首先利用可微分Voronoi公式初始化表面;其次通过多边形面球训练阶段优化表面质量;最后复用可微分Voronoi公式进行包含额外内部生成点的体积优化,从而在保持几何精度的同时确保拓扑一致性。

链接: https://arxiv.org/abs/2512.12984
作者: Jiayin Lu,Ying Jiang,Yin Yang,Chenfanfu Jiang
机构: 未知
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We present VoroLight, a differentiable framework for 3D shape reconstruction based on Voronoi meshing. Our approach generates smooth, watertight surfaces and topologically consistent volumetric meshes directly from diverse inputs, including images, implicit shape level-set fields, point clouds and meshes. VoroLight operates in three stages: it first initializes a surface using a differentiable Voronoi formulation, then refines surface quality through a polygon-face sphere training stage, and finally reuses the differentiable Voronoi formulation for volumetric optimization with additional interior generator points. Project page: this https URL
zh

[CV-89] Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes

【速读】:该论文旨在解决通用AI生成图像(AIGI)检测器在扩展源生成器多样性时出现的性能瓶颈问题,即“先增益后冲突”(Benefit then Conflict)现象——随着训练数据中生成器种类增多,检测器性能先提升后下降。其核心挑战在于两个方面:一是数据层面的严重异质性导致真实与合成图像特征分布重叠;二是模型层面固定预训练编码器无法适应复杂度上升带来的识别困难。解决方案的关键是提出一种生成器感知原型学习(Generator-Aware Prototype Learning, GAPL)框架,通过结构化学习机制构建一组紧凑的典型伪造原型,建立统一且低方差的特征空间以缓解数据异质性问题;同时采用两阶段训练策略结合低秩适配(Low-Rank Adaptation),增强模型判别能力并保留预训练知识,从而构建更鲁棒、泛化的决策边界。

链接: https://arxiv.org/abs/2512.12982
作者: Ziheng Qin,Yuheng Ji,Renshuai Tao,Yuxuan Tian,Yuyang Liu,Yipu Wang,Xiaolong Zheng
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data this http URL resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at this https URL
zh

[CV-90] VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

【速读】:该论文旨在解决多模态大模型在推理过程中因重复输入导致的昂贵重新计算问题,尤其针对Key-Value (KV) cache和编码器缓存(encoder cache)的重复利用效率低下问题。解决方案的关键在于提出VLCache框架,通过形式化识别累积重用误差效应(cumulative reuse error effect),并设计动态分层重计算策略以最小化非前缀缓存重用误差,从而在保持与全量重计算相当的准确性前提下,显著降低计算开销——实验表明其仅需2–5%的token计算量即可实现1.2x–16x的首字节传输时间(TTFT)加速。

链接: https://arxiv.org/abs/2512.12977
作者: Shengling Qin,Hao Yu,Chenxin Wu,Zheng Li,Yizhong Cao,Zhengyang Zhuge,Yuxin Zhou,Wentao Yao,Yi Zhang,Zhengheng Wang,Shuai Bai,Jianwei Zhang,Junyang Lin
机构: Qwen Team, Alibaba Inc. (阿里巴巴集团); TairKVCache Team, Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.
zh

[CV-91] SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer WACV2026

【速读】:该论文旨在解决扩散模型(diffusion models)在风格迁移任务中难以实现照片级真实感迁移的问题,尤其是现有方法无法有效分离原始内容风格与参考风格特征,导致生成结果常呈现绘画风格或丢失细节。解决方案的关键在于提出SCAdapter,其核心创新是利用CLIP图像空间实现内容与风格特征的解耦与融合:通过可控风格自适应实例归一化(Controllable Style Adaptive Instance Normalization, CSAdaIN)实现多风格精确混合,借助KVS注入机制实现目标风格的定向整合,并引入风格迁移一致性目标保障过程稳定性。该方法无需DDIM反演和推理阶段优化,显著提升效率(至少快2倍),同时在效果上优于当前主流扩散模型及传统方法。

链接: https://arxiv.org/abs/2512.12963
作者: Luan Thanh Trinh,Kenji Doi,Atsuki Osanai
机构: LY Corporation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least 2\times faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.
zh

[CV-92] SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework

【速读】:该论文旨在解决现有3D语义地图构建系统在内存占用高、计算效率低以及无法统一支持固定类别(closed-set)与开放语言标签(open-set)预测融合的问题。其关键解决方案是提出SLIM-VDB,一个基于OpenVDB数据结构的轻量级语义映射系统,并引入统一的贝叶斯更新框架,实现对闭集和开集语义信息的概率融合,从而在保持与当前最先进方法相当的地图精度的同时,显著降低内存消耗和融合时间。

链接: https://arxiv.org/abs/2512.12945
作者: Anja Sheppard,Parker Ewen,Joey Wilson,Advaith V. Sethuraman,Benard Adewole,Anran Li,Yuzhen Chen,Ram Vasudevan,Katherine A. Skinner
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted into R-AL

点击查看摘要

Abstract:This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface is available at this https URL.
zh

[CV-93] UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

【速读】:该论文旨在解决遥感图像中建筑物提取任务因建筑结构复杂变化而导致的分割不准确和模糊问题。现有方法虽采用卷积或自注意力模块捕捉多尺度特征,但仍存在特征金字塔内部差异以及全局与局部特征融合不足的问题。其解决方案的关键在于提出一种不确定性聚合的全局-局部融合网络(UAGLNet),通过三个核心设计实现:1)混合CNN与Transformer的协同编码器(cooperative encoder),分别捕获局部与全局视觉语义;2)中间协同交互块(CIB)缩小深层网络中局部与全局特征间的差距;3)全局-局部融合(GLF)模块互补融合两种表示,并引入不确定性聚合解码器(UAD)显式估计像素级不确定性以提升分割精度。

链接: https://arxiv.org/abs/2512.12941
作者: Siyuan Yao,Dongxiu Liu,Taotao Li,Shengjie Li,Wenqi Ren,Xiaochun Cao
机构: Beijing University Of Posts and Telecommunications (北京邮电大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TGRS

点击查看摘要

Abstract:Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at this https URL
zh

[CV-94] Continuous Edit Distance Geodesics and Barycenters of Time-varying Persistence Diagrams

【速读】:该论文旨在解决时间变化持久性图(Time-Varying Persistence Diagrams, TVPDs)之间的度量学习问题,特别是如何在TVPD空间中定义一个具有几何意义且可计算的弹性距离,以支持对动态拓扑结构的对齐、比较、平均和聚类分析。解决方案的关键在于提出连续编辑距离(Continuous Edit Distance, CED),其通过结合局部替换成本与受参数 α\alpha(控制时间错位与图差异的权衡)和 β\beta(间隙惩罚项)调控的删除/插入惩罚,构建了一个测地线距离(geodesic distance)。CED不仅具备良好的数学性质,还提供了显式的测地线构造方法,并基于此设计了两种单调递减CED-Fréchet能量的barycenter求解器(随机与贪心策略),从而实现了TVPD空间中有效且可解释的聚类与分类任务。

链接: https://arxiv.org/abs/2512.12939
作者: Sebastien Tchitchek,Mohamed Kissi,Julien Tierny
机构: CNRS; Sorbonne Universite
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 30 pages, 13 figures, 2 tables

点击查看摘要

Abstract:We introduce the Continuous Edit Distance (CED), a geodesic and elastic distance for time-varying persistence diagrams (TVPDs). The CED extends edit-distance ideas to TVPDs by combining local substitution costs with penalized deletions/insertions, controlled by two parameters: (\alpha) (trade-off between temporal misalignment and diagram discrepancy) and (\beta) (gap penalty). We also provide an explicit construction of CED-geodesics. Building on these ingredients, we present two practical barycenter solvers, one stochastic and one greedy, that monotonically decrease the CED Frechet energy. Empirically, the CED is robust to additive perturbations (both temporal and spatial), recovers temporal shifts, and supports temporal pattern search. On real-life datasets, the CED achieves clustering performance comparable or better than standard elastic dissimilarities, while our clustering based on CED-barycenters yields superior classification results. Overall, the CED equips TVPD analysis with a principled distance, interpretable geodesics, and practical barycenters, enabling alignment, comparison, averaging, and clustering directly in the space of TVPDs. A C++ implementation is provided for reproducibility at the following address this https URL.
zh

[CV-95] Content Adaptive based Motion Alignment Framework for Learned Video Compression

【速读】:该论文旨在解决当前端到端视频压缩框架中缺乏内容自适应性的问题,导致在处理不同特征的视频内容时压缩性能受限。其核心解决方案是提出一种基于内容自适应的运动对齐框架(CAMA),关键在于三个创新:一是设计两阶段流引导的可变形扭曲机制,通过粗到精的偏移预测与掩码调制实现更精确的特征对齐;二是引入多参考质量感知策略,在分层训练中依据参考帧质量动态调整失真权重,以减少误差传播;三是集成无需训练的模块,根据运动幅度和分辨率下采样帧以获得平滑的运动估计。这些机制共同提升了模型对多样化内容的适应能力,显著优于现有神经视频压缩方法及传统编码器(如HM-16.25)。

链接: https://arxiv.org/abs/2512.12936
作者: Tiange Zhang,Xiandong Meng,Siwei Ma
机构: Shenzhen Graduate School, Peking University (北京大学深圳研究生院); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to Data Compression Conference (DCC) 2026 as a poster paper

点击查看摘要

Abstract:Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.
zh

[CV-96] Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion AAAI

【速读】:该论文旨在解决视频内容激增背景下,多模态瞬间检索(Multimodal Moment Retrieval)系统面临的三大挑战:固定权重融合策略在跨模态噪声和模糊查询下表现不佳、时间建模难以捕捉连贯事件序列且对不现实的时间间隔惩罚过度,以及系统需人工选择模态导致可用性低。其解决方案的关键在于提出一个统一的多模态瞬间检索框架,包含三项创新:首先,采用级联双嵌入流水线(BEIT-3与SigLIP结合,辅以BLIP-2重排序),实现广覆盖检索并平衡召回率与精确率;其次,引入时间感知评分机制,通过束搜索施加指数衰减惩罚以处理大时间间隙,从而构建连贯事件序列而非孤立帧;最后,利用GPT-4o引导的查询分解(Agent-guided Query Decomposition),自动解析模糊查询并生成视觉/OCR/ASR子查询,实现自适应分数融合,彻底消除手动模态选择需求。

链接: https://arxiv.org/abs/2512.12935
作者: Toan Le Ngo Thanh,Phat Ha Huu,Tan Nguyen Dang Duy,Thong Nguyen Le Minh,Anh Nguyen Nhu Tinh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at AAAI Workshop 2026

点击查看摘要

Abstract:The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.
zh

[CV-97] MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation

【速读】:该论文旨在解决视频检索系统在处理多事件时间依赖关系以及应对未见或罕见视觉概念查询时的局限性。现有方法难以建模跨多个事件的时间依赖性,且在面对分布外(out-of-distribution, OOD)查询时表现脆弱。解决方案的关键在于提出MADTempo框架,其核心创新包括:一是通过聚合连续视频片段的相似度得分来实现事件级连贯性建模,从而支持多事件查询的协同检索;二是引入基于Google图像搜索的回退模块,利用外部网络图像扩展查询表征,增强预训练视觉嵌入的泛化能力,有效缓解OOD查询带来的性能下降问题。

链接: https://arxiv.org/abs/2512.12929
作者: Huu-An Vu,Van-Khanh Mai,Trong-Tam Nguyen,Quang-Duc Dam,Tien-Huy Nguyen,Thanh-Huong Le
机构: AI VIETNAM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal rea- soning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.
zh

[CV-98] Sharpness-aware Dynamic Anchor Selection for Generalized Category Discovery

【速读】:该论文旨在解决生成式 AI(Generative AI)在开放世界学习中广泛采用的广义类别发现(Generalized Category Discovery, GCD)任务中存在的伪标签噪声问题。当前基于参数化分类的方法通常使用类似DINO的伪标签策略,即利用一个视图的软概率输出作为另一个视图的监督信号,但大预训练模型对特定视觉模式存在偏好,导致编码虚假相关性并生成噪声伪标签,从而影响聚类性能。解决方案的关键在于提出两个模块:Loss Sharpness Penalty(LSP)和Dynamic Anchor Selection(DAS)。LSP通过最小化模型最坏情况下的损失尖锐度,增强模型参数对小扰动的鲁棒性,抑制平凡特征编码,降低噪声样本过拟合风险,提升伪标签质量;DAS则基于KNN密度与类别概率动态选择未知类别的代表性样本,并赋予其硬伪标签,缓解已知类与未知类间置信度差异,加速模型对未知类特征分布的学习,从而显著提高聚类精度。

链接: https://arxiv.org/abs/2512.12925
作者: Zhimao Peng,Enguang Wang,Fei Yang,Xialei Liu,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMM2025

点击查看摘要

Abstract:Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.
zh

[CV-99] Predictive Sample Assignment for Semantically Coherent Out-of-Distribution Detection

【速读】:该论文旨在解决语义一致的分布外检测(Semantically Coherent Out-of-Distribution Detection, SCOOD)任务中因现有方法依赖聚类-based 的分布内样本过滤(In-Distribution Filtering, IDF)策略而引入大量噪声样本的问题。其核心解决方案是提出一种基于预测样本分配(Predictive Sample Assignment, PSA)的简洁框架,关键在于:1)设计了一种双阈值三元样本分配策略,利用预测能量得分将置信度低的未标记样本分配至额外的丢弃集,从而显著提升所选分布内(ID)与分布外(OOD)样本集合的纯度;2)引入概念对比表示学习损失(concept contrastive representation learning loss),增强ID与OOD样本在表征空间中的距离,辅助判别;3)结合重训练策略使模型充分拟合所选的辅助ID/OOD样本,从而实现更鲁棒和准确的SCOOD性能。

链接: https://arxiv.org/abs/2512.12906
作者: Zhimao Peng,Enguang Wang,Xialei Liu,Ming-Ming Cheng
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TCSVT2024

点击查看摘要

Abstract:Semantically coherent out-of-distribution detection (SCOOD) is a recently proposed realistic OOD detection setting: given labeled in-distribution (ID) data and mixed in-distribution and out-of-distribution unlabeled data as the training data, SCOOD aims to enable the trained model to accurately identify OOD samples in the testing data. Current SCOOD methods mainly adopt various clustering-based in-distribution sample filtering (IDF) strategies to select clean ID samples from unlabeled data, and take the remaining samples as auxiliary OOD data, which inevitably introduces a large number of noisy samples in training. To address the above issue, we propose a concise SCOOD framework based on predictive sample assignment (PSA). PSA includes a dual-threshold ternary sample assignment strategy based on the predictive energy score that can significantly improve the purity of the selected ID and OOD sample sets by assigning unconfident unlabeled data to an additional discard sample set, and a concept contrastive representation learning loss to further expand the distance between ID and OOD samples in the representation space to assist ID/OOD discrimination. In addition, we also introduce a retraining strategy to help the model fully fit the selected auxiliary ID/OOD samples. Experiments on two standard SCOOD benchmarks demonstrate that our approach outperforms the state-of-the-art methods by a significant margin.
zh

[CV-100] Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution

【速读】:该论文旨在解决神经网络在学习高频率信号时面临的挑战,如频谱偏差(spectral bias)和优化困难等问题,这些问题在计算机视觉与图形学领域尤为突出。其解决方案的关键在于提出一种名为“查询卷积”(Queried-Convolutions, Qonvolutions)的简单而有效的改进方法:利用卷积的邻域特性,将低频信号与查询(例如坐标)进行卷积操作,从而增强对复杂高频信号的学习能力。实验表明,Qonvolutions 在多种高频任务中显著提升性能,包括一维回归、二维超分辨率、图像回归及新视角合成(NVS),尤其在结合高斯点绘制(Gaussian splatting)用于NVS时,实现了当前最优的图像质量表现,甚至超越了强大的辐射场模型。

链接: https://arxiv.org/abs/2512.12898
作者: Abhinav Kumar,Tristan Aumentado-Armstrong,Lazar Valkov,Gopal Sharma,Alex Levinshtein,Radek Grzeszczuk,Suren Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 28 pages, 8 figures, Project Page: this https URL

点击查看摘要

Abstract:Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.
zh

[CV-101] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

【速读】:该论文旨在解决当前3D医学图像分类研究中存在的三大关键问题:数据 regime 偏差(data-regime bias)、适应策略不佳(suboptimal adaptation)以及任务覆盖不足(insufficient task coverage)。为应对这些问题,作者提出了一种名为 AnyMC3D 的可扩展 3D 分类框架,其核心创新在于基于预训练的 2D 医学基础模型(Medical Foundation Models, FMs)构建一个轻量级、任务可扩展的分类系统:仅通过在单一冻结的骨干网络(backbone)上添加约 1M 参数的轻量级插件(plugin),即可高效适配新任务。该方案不仅支持多视角输入和像素级辅助监督,还能生成可解释的热力图,从而在涵盖多种病理、解剖结构和成像模态的 12 项任务中实现统一性能最优,验证了单个通用框架替代多个专用模型的可行性。

链接: https://arxiv.org/abs/2512.12887
作者: Han Liu,Bogdan Georgescu,Yanbo Zhang,Youngjin Yoo,Michael Baumgartner,Riqiang Gao,Jianing Wang,Gengyan Zhao,Eli Gibson,Dorin Comaniciu,Sasa Grbic
机构: Siemens Healthineers(西门子医疗健康); Siemens Healthineers(西门子医疗健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st Place in VLM3D Challenge

点击查看摘要

Abstract:3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.
zh

[CV-102] Cross-Level Sensor Fusion with Object Lists via Transformer for 3D Object Detection

【速读】:该论文旨在解决自动驾驶中多传感器融合系统里,传统基于原始数据的融合方法难以高效利用已处理的高抽象层级目标列表(object list)与原始图像信息的问题。现有方法通常对不同模态的原始数据分别处理后再在目标层进行融合,导致信息利用率低且融合效率受限。其解决方案的关键在于提出一种基于Transformer的端到端跨层级融合架构,将对象列表作为去噪查询(denoising queries)输入至Transformer,并结合可学习查询共同参与特征聚合过程;同时引入由对象列表位置和尺寸先验生成的可变形高斯掩码(deformable Gaussian mask),显式引导注意力机制聚焦于感兴趣区域,从而加速训练收敛并提升3D目标检测性能。

链接: https://arxiv.org/abs/2512.12884
作者: Xiangzhong Liu,Jiajie Zhang,Hao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 3 figures, accepted at IV2025

点击查看摘要

Abstract:In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.
zh

[CV-103] Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

【速读】:该论文旨在解决跨模态音频-视觉联合编辑(audio-visual joint editing)中的两大核心挑战:一是缺乏标注良好的配对音视频数据用于目标编辑前后对比学习,二是音频与视频在语义和时序上的异构性导致难以实现同步且语义一致的编辑。解决方案的关键在于构建SAVEBench数据集,该数据集包含文本和掩码条件下的音视频配对样本,支持基于对象的源到目标迁移学习;并提出Schrodinger Audio-Visual Editor (SAVE)模型,其核心创新是引入Schrodinger Bridge机制,通过端到端流匹配(flow-matching)方法直接学习从原始音视频混合信号到目标状态的映射,从而在编辑过程中保持音视频的时空同步性和语义一致性。

链接: https://arxiv.org/abs/2512.12875
作者: Weihan Xu,Kan Jen Cheng,Koichi Saito,Muhammad Jehanzeb Mirza,Tingle Li,Yisi Liu,Alexander H. Liu,Liming Wang,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji,Gopala Anumanchipalli,Paul Pu Liang
机构: MIT(麻省理工学院); UC Berkeley(加州大学伯克利分校); Berkeley AI Research(伯克利人工智能研究中心); Sony AI(索尼人工智能); Sony Group Corporation(索尼集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
zh

[CV-104] GradID: Adversarial Detection via Intrinsic Dimensionality of Gradients

【速读】:该论文旨在解决深度神经网络在面对微小、通常难以察觉的对抗扰动时所表现出的脆弱性问题,这类扰动可能导致模型预测结果发生显著偏差,从而在医疗诊断和自动驾驶等高可靠性要求的应用中带来严重风险。解决方案的关键在于分析模型输入损失景观(loss landscape)的几何特性,特别是利用**内在维度(Intrinsic Dimensionality, ID)**来区分自然数据与对抗样本。研究发现,自然数据和对抗数据在梯度参数空间中的ID存在稳定且显著的差异,这一特性被用于构建一种新型检测机制,在批量检测和单样本检测两种场景下均表现出优越性能,尤其在CIFAR-10和MS COCO等复杂基准上达到当前最优的检测准确率(>92%),且对多种强攻击方法(如CW和AutoAttack)具有鲁棒性。

链接: https://arxiv.org/abs/2512.12827
作者: Mohammad Mahdi Razmjoo,Mohammad Mahdi Sharifian,Saeed Bagheri Shouraki
机构: Sharif University of Technology (谢里夫理工大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Despite their remarkable performance, deep neural networks exhibit a critical vulnerability: small, often imperceptible, adversarial perturbations can lead to drastically altered model predictions. Given the stringent reliability demands of applications such as medical diagnosis and autonomous driving, robust detection of such adversarial attacks is paramount. In this paper, we investigate the geometric properties of a model’s input loss landscape. We analyze the Intrinsic Dimensionality (ID) of the model’s gradient parameters, which quantifies the minimal number of coordinates required to describe the data points on their underlying manifold. We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method. We validate our approach across two distinct operational scenarios. First, in a batch-wise context for identifying malicious data groups, our method demonstrates high efficacy on datasets like MNIST and SVHN. Second, in the critical individual-sample setting, we establish new state-of-the-art results on challenging benchmarks such as CIFAR-10 and MS COCO. Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92% on CIFAR-10. The results underscore the robustness of our geometric approach, highlighting that intrinsic dimensionality is a powerful fingerprint for adversarial detection across diverse datasets and attack strategies.
zh

[CV-105] Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

【速读】:该论文旨在解决生成式-对比混合模型(如CoCa)在极端数据稀缺场景下(即少样本学习,few-shot learning)的适应性问题,尤其关注其与传统双编码器架构(如CLIP)相比在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中的表现差异。解决方案的关键在于系统性地评估从零训练(training-free)到深度参数调整(如LoRA)的一系列策略,并揭示了两个核心发现:一是强数据增强虽会降低线性探测(linear probing)在低样本设置下的性能,却是稳定LoRA微调所必需;二是融合监督对比损失(Supervised Contrastive, SupCon)的混合目标函数在不同样本量下均能显著优于标准交叉熵(Cross-Entropy),从而提升模型在少样本图像分类任务中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2512.12824
作者: N.K.B.M.P.K.B. Narasinghe,Uthayasanker Thayasivam
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures. Accepted to VISAPP 2026

点击查看摘要

Abstract:Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa’s distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an “augmentation divergence”: while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
zh

[CV-106] Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

【速读】:该论文旨在解决大规模多模态模型(Large Multimodal Models, LMMs)在3D理解任务中面临的三大挑战:点云数据稀疏且不规则、现有模型依赖碎片化架构(包含模态专用编码器)、训练流程不稳定且扩展性差。其解决方案的关键在于提出一种统一的Transformer架构Lemon,通过将3D点云补丁与语言标记作为单一序列进行联合处理,实现早期空间-语言融合,避免冗余编码器并提升参数效率,同时引入结构化的补丁化与标记化策略以保留空间上下文,并设计三阶段训练课程,逐步从物体级识别过渡到场景级空间推理,从而显著提升模型在3D理解与推理任务上的性能与可扩展性。

链接: https://arxiv.org/abs/2512.12822
作者: Yongyuan Liang,Xiyao Wang,Yuanchen Ju,Jianwei Yang,Furong Huang
机构: University of Maryland, College Park (马里兰大学学院市分校); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.
zh

[CV-107] Learning Common and Salient Generative Factors Between Two Image Datasets

【速读】:该论文旨在解决对比分析(Contrastive Analysis, CA)问题,即在给定两个图像数据集时,将两者共有的生成因素(common factors)与仅属于其中一个数据集的显著因素(salient factors)进行分离。区别于以往依赖属性标签(如眼镜、性别等)作为监督信号的条件编辑方法,本文提出的方法仅使用数据集标签作为弱监督信号,更具通用性和适用性。解决方案的关键在于设计了一种适用于生成对抗网络(GAN)和扩散模型(Diffusion Models)的统一框架,通过引入新的学习策略和损失函数,确保共性因素与显著因素的有效解耦,并保持高质量的图像生成能力。实验表明,该方法在人脸、动物图像及医学影像等多个数据集上均展现出优越的分离性能和生成质量。

链接: https://arxiv.org/abs/2512.12800
作者: Yunlong He,Gwilherm Lesné,Ziqian Liu,Michaël Soumm,Pietro Gori
机构: Télécom Paris (巴黎电信学院); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the author’s version of a work submitted to IEEE for possible publication. The final version may differ from this version

点击查看摘要

Abstract:Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.
zh

[CV-108] DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding Perception Prediction and Planning

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在自动驾驶场景中生成细粒度三维感知与预测输出能力不足的问题。现有方法通常将空间理解、3D感知(如3D占据)、预测(如占据流)和规划(如动作输出)分阶段处理,缺乏统一建模与端到端优化能力。解决方案的关键在于提出DrivePI——一个面向4D时空感知的统一视觉-语言-动作(Vision-Language-Action, VLA)框架,其核心创新是通过融合点云、多视角图像与语言指令,在单一MMLLM架构内并行完成空间理解、3D占据预测、占据流预测及行为规划任务,并基于自研数据引擎构建文本-占据与文本-流问答对以增强4D空间理解能力。实验表明,仅使用0.5B参数的Qwen2.5作为骨干模型,DrivePI即在nuScenes和OpenOcc等基准上达到或超越当前主流VLA与专用VA模型性能。

链接: https://arxiv.org/abs/2512.12799
作者: Zhe Liu,Runhui Huang,Rui Yang,Siming Yan,Zining Wang,Lu Hou,Di Lin,Xiang Bai,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Yinwang Intelligent Technology Co. Ltd. (英诺智能科技有限公司); Tianjin University (天津大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at this https URL
zh

[CV-109] L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context

【速读】:该论文旨在解决神经视频压缩中因仅依赖前一帧特征进行时域预测而导致的两个关键问题:一是短参考窗口难以捕捉长期依赖关系和精细纹理细节;二是仅传播特征层面信息会累积误差,导致预测不准确并丢失细微纹理。解决方案的关键在于提出长时空增强上下文(Long-term Spatio-Temporal Enhanced Context, L-STEC)方法:首先利用LSTM扩展参考链以捕获长期依赖,其次引入来自像素域的形变空间上下文,并通过多感受野网络融合时空信息,从而更有效地保留参考细节,显著提升压缩性能,在PSNR和MS-SSIM指标上分别较DCVC-TCM降低37.01%和31.65%比特率,达到当前最优效果。

链接: https://arxiv.org/abs/2512.12790
作者: Tiange Zhang,Zhimeng Huang,Xiandong Meng,Kai Zhang,Zhipin Deng,Siwei Ma
机构: Shenzhen Graduate School, Peking University, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China; National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing, China; Bytedance Inc, San Diego, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Data Compression Conference (DCC) 2026 as an oral paper

点击查看摘要

Abstract:Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame’s features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.
zh

[CV-110] Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

【速读】:该论文旨在解决当前2D高斯泼溅(2D Gaussian Splatting, 2DGS)在高质量图像表示中存在计算效率低、收敛速度慢以及依赖复杂后优化的问题。现有方法要么采用随机或启发式策略初始化高斯分布,导致对图像复杂度不敏感且收敛缓慢(数十秒),要么引入可学习网络预测初始配置,却显著增加计算与架构复杂度。其解决方案的关键在于提出一种轻量级框架Fast-2DGS,核心创新为两个模块:一是基于条件网络的深度高斯先验(Deep Gaussian Prior),用于建模不同图像复杂度下高斯基元的空间分布;二是属性回归网络,用于预测密集高斯属性。该解耦架构可在单次前向传播中实现高质量重建,并仅需少量微调,显著降低计算成本而不牺牲视觉质量,推动2DGS向工业部署迈进。

链接: https://arxiv.org/abs/2512.12774
作者: Hao Wang,Ashish Bastola,Chaoyi Zhou,Wenhui Zhu,Xiwen Chen,Xuanzhao Dong,Siyu Huang,Abolfazl Razi
机构: Clemson University (克莱姆森大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.
zh

[CV-111] JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

【速读】:该论文旨在解决当前对多模态大语言模型(Omni-Large Language Models, Omni-LLMs)评估的不足问题,即现有基准数据集在跨模态依赖性、音频信息多样性及场景跨度覆盖方面存在缺陷,难以严格且全面地衡量模型对视觉与听觉信息联合理解的能力。解决方案的关键在于提出一个名为JointAVBench的新基准,其核心特征包括:(1)严格定义视听相关性以确保问题必须依赖联合感知才能解答;(2)涵盖五种认知维度、四种音频类型(语音、声音事件、音乐、 vocal traits)以及三种场景跨度(单场景、跨场景、全场景);(3)设计了一套自动化合成管道,利用先进的视觉大语言模型(vision-LLMs)、音频大语言模型(audio-LLMs)和通用大语言模型(general-purpose LLMs)生成高质量问答对,从而显著降低人工标注成本并提升数据规模与多样性。

链接: https://arxiv.org/abs/2512.12772
作者: Jianghan Chao,Jianzhang Gao,Wenhui Tan,Yuchong Sun,Ruihua Song,Liyun Ru
机构: Renmin University of China (中国人民大学); Baichuan Inc. (百川智能)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
zh

[CV-112] CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

【速读】:该论文旨在解决当前3D理解与生成任务中缺乏有效推理机制的问题,尤其是在跨模态对齐、局部一致性以及语义引导的几何结构生成方面存在不足。其解决方案的关键在于提出了一种统一的3D理解与生成推理框架CoRe3D,该框架通过引入一种空间锚定的推理表示(spatially grounded reasoning representation),将3D潜在空间分解为局部区域,从而实现对几何信息的组合式和过程式推理;同时,该框架紧密耦合语义链式思维(semantic chain-of-thought)推理与结构化空间推理,使得生成的3D内容在局部细节上保持一致性,并精准匹配语言描述的高层意图。

链接: https://arxiv.org/abs/2512.12768
作者: Tianjiao Yu,Xinzhuo Li,Yifan Shen,Yuanzhe Liu,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
zh

[CV-113] Federated Learning with Feedback Alignment

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在面对非独立同分布(non-IID)数据时所引发的局部漂移(local drift)问题,该问题会显著阻碍全局模型的收敛。其解决方案的关键在于提出一种名为“带反馈对齐的联邦学习”(Federated Learning with Feedback Alignment, FLFA)的新框架,该框架在本地训练的反向传播过程中引入反馈对齐机制,利用全局模型权重作为共享的反馈矩阵,从而高效地将本地更新与全局模型对齐。此设计在几乎不增加计算开销且无需额外通信成本的前提下,有效缓解了局部漂移,理论分析和实验结果均验证了其对局部与全局模型收敛性的提升效果。

链接: https://arxiv.org/abs/2512.12762
作者: Incheol Baek,Hyungbin Kim,Minseo Kim,Yon Dohn Chung
机构: Korea University (韩国大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training across multiple clients while preserving data privacy, yet it struggles with data heterogeneity, where clients’ data are not distributed independently and identically (non-IID). This causes local drift, hindering global model convergence. To address this, we introduce Federated Learning with Feedback Alignment (FLFA), a novel framework that integrates feedback alignment into FL. FLFA uses the global model’s weights as a shared feedback matrix during local training’s backward pass, aligning local updates with the global model efficiently. This approach mitigates local drift with minimal additional computational cost and no extra communication overhead. Our theoretical analysis supports FLFA’s design by showing how it alleviates local drift and demonstrates robust convergence for both local and global models. Empirical evaluations, including accuracy comparisons and measurements of local drift, further illustrate that FLFA can enhance other FL methods demonstrating its effectiveness. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.12762 [cs.LG] (or arXiv:2512.12762v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12762 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-114] FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding Generation and Reasoning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)及全模态架构评估基准存在的局限性问题,包括模态覆盖不完整、交互方式以文本为中心、以及各模态间缺乏强依赖性和互补性。为应对这些挑战,作者提出FysicsWorld——首个支持图像、视频、音频与文本之间双向输入输出的统一全模态基准,能够实现理解、生成与推理任务的任意模态组合评估。其解决方案的关键在于引入Cross-Modal Complementarity Screening(CMCS)策略,并构建系统化的数据构造框架,从而生成适用于语音交互和融合依赖型跨模态推理的全模态数据,显著提升了基准的全面性和复杂性,为下一代全模态架构的评测与演进提供了统一基础和强基线。

链接: https://arxiv.org/abs/2512.12756
作者: Yue Jiang,Dingkang Yang,Minghao Han,Jinghang Han,Zizhi Chen,Yizhou Liu,Mingcheng Li,Peng Zhai,Lihua Zhang
机构: Fudan University (复旦大学); Fysics Intelligence Technologies Co., Ltd. (Fysics AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The omni-modal benchmark report from Fysics AI

点击查看摘要

Abstract:Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.
zh

[CV-115] GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

【速读】:该论文旨在解决现有驾驶视频生成方法中因依赖单一扩散模型直接映射驾驶动作到视频而导致的学习困难和物理不一致性问题。其关键解决方案在于提出GenieDrive框架,通过生成4D占用(4D occupancy)作为物理信息丰富的基础,利用一种新型变分自编码器(VAE)将高分辨率占用数据压缩为潜空间三平面表示(latent tri-plane representation),显著降低潜变量尺寸至先前方法的58%;同时引入互控注意力机制(Mutual Control Attention, MCA)以精确建模控制信号对占用演化的影响,并端到端联合训练VAE与预测模块以提升预测精度。此外,通过引入归一化多视角注意力机制增强多视角视频生成能力,实现高质量、可控性强且物理一致的驾驶视频合成。

链接: https://arxiv.org/abs/2512.12751
作者: Zhenya Yang,Zhe Liu,Yuxiang Lu,Liping Hou,Chenxuan Miao,Siyi Peng,Bailan Feng,Xiang Bai,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at this https URL

点击查看摘要

Abstract:Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
zh

[CV-116] Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images

【速读】:该论文旨在解决现有三维人体建模与脊柱中心线估计方法中存在的局限性:多视角图像方法依赖昂贵设备和复杂流程,而单视角方法因遮挡和视角限制难以准确估计内部结构(如脊柱中心线)。解决方案的关键在于提出一种集成四方向深度图像的三维身体姿态分析系统,通过全局与细粒度层级匹配实现对噪声和遮挡的鲁棒重建;同时引入自适应顶点简化(Adaptive Vertex Reduction)以保持网格分辨率与形状可靠性,并利用细节层次集成(Level of Detail ensemble)同步提升脊柱角度估计的精度与稳定性。该方法无需训练数据或复杂神经网络模型即可实现高精度脊柱注册估计,验证了匹配质量的显著改善。

链接: https://arxiv.org/abs/2512.12718
作者: Sehyun Kim,Hye Jun Lee,Jiwoo Lee,Changgyun Kim,Taemin Lee
机构: Kangwon National University (江原国立大学); Kangwon National University (江原国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.
zh

[CV-117] Robust Motion Generation using Part-level Reliable Data from Videos

【速读】:该论文旨在解决从大规模网络视频中提取人类运动时面临的“数据缺失”问题:由于视频中人物部分区域常因出界或遮挡而不可见,若直接剔除此类数据会限制数据规模与多样性,而保留则会影响数据质量及模型性能。其解决方案的关键在于提出一种基于可信部位的掩码自回归生成模型(part-aware masked autoregression model),首先将人体分解为五个部分并识别视频帧中清晰可见的“可信部位”,再通过自研的部位感知变分自编码器(part-aware variational autoencoder)将可信部位编码为潜在token,最后利用鲁棒的部位级掩码生成机制预测被遮挡或缺失的可信部位,同时忽略噪声干扰的不可信部位,从而在保持高质量的同时提升模型对不完整数据的泛化能力。

链接: https://arxiv.org/abs/2512.12703
作者: Boyuan Li,Sipeng Zheng,Bin Cao,Ruihua Song,Zongqing Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance. To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as “credible”. Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts. In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12703 [cs.CV] (or arXiv:2512.12703v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.12703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-118] Hybrid Retrieval-Augmented Generation for Robust Multilingual Document Question Answering

【速读】:该论文旨在解决历史文献中由于OCR错误、多语言拼写差异及语言随时间演变导致的计算访问难题,从而实现对噪声历史文档的可靠问答。其关键解决方案在于构建一个面向多语言历史文档的检索增强生成(Retrieval-Augmented Generation, RAG)流水线:首先通过语义查询扩展与多查询融合(采用倒数排名融合,Reciprocal Rank Fusion, RRF)提升检索鲁棒性以缓解词汇不匹配问题;其次设计了一个强约束证据锚定的生成提示(prompt),确保答案严格基于检索到的证据,并在证据不足时明确拒绝回答;最后采用模块化架构支持各组件的系统性评估,实验证明该方案在保证可回答问题的答案忠实性的同时,能有效识别并回避不可回答的问题,且RRF策略显著提升了召回稳定性,尤其在不同查询表述下表现更一致。

链接: https://arxiv.org/abs/2512.12694
作者: Anthony Mudet,Souhail Bakkali
机构: Univ Rennes, CNRS, IRISA - UMR 6074, Rennes, France; L3i-lab, La Rochelle Université, France
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Large-scale digitization initiatives have unlocked massive collections of historical newspapers, yet effective computational access remains hindered by OCR corruption, multilingual orthographic variation, and temporal language drift. We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents. Our approach integrates: (i) semantic query expansion and multi-query fusion using Reciprocal Rank Fusion to improve retrieval robustness against vocabulary mismatch; (ii) a carefully engineered generation prompt that enforces strict grounding in retrieved evidence and explicit abstention when evidence is insufficient; and (iii) a modular architecture enabling systematic component evaluation. We conduct comprehensive ablation studies on Named Entity Recognition and embedding model selection, demonstrating the importance of syntactic coherence in entity extraction and balanced performance-efficiency trade-offs in dense retrieval. Our end-to-end evaluation framework shows that the pipeline generates faithful answers for well-supported queries while correctly abstaining from unanswerable questions. The hybrid retrieval strategy improves recall stability, particularly benefiting from RRF’s ability to smooth performance variance across query formulations. We release our code and configurations at this https URL, providing a reproducible foundation for robust historical document question answering.
zh

[CV-119] β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

【速读】:该论文旨在解决CLIP模型在细粒度视觉-语言匹配任务中表现不足的问题,尽管其在全局图像-文本检索上表现优异,但在需要精确对齐局部视觉区域与文本描述(如短语或句子级别)的任务中仍存在局限。解决方案的关键在于提出β-CLIP框架,该框架通过多粒度文本条件对比学习实现从完整描述到句子、短语等不同粒度的层级对齐;其核心创新是引入β-Contextualized Contrastive Alignment Loss(β-CAL),该损失函数在严格查询特定匹配与宽松图像内上下文一致性之间进行权衡,支持软交叉熵和硬二值交叉熵两种形式,从而有效缓解层次结构中的语义重叠问题,显著提升细粒度图文对应能力。

链接: https://arxiv.org/abs/2512.12678
作者: Fatimah Zohra,Chen Zhao,Hani Itani,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose \beta -CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, \beta -CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the \beta -Contextualized Contrastive Alignment Loss ( \beta -CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that \beta -CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. \beta -CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at this https URL.
zh

[CV-120] Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

【速读】:该论文旨在解决多主体图像生成中“区分能力”(distinction)不足的问题,即当输入包含多个候选主体时,模型难以准确识别并生成指定主体,从而限制了其在复杂现实场景中的应用效果。解决方案的关键在于提出Scone方法,通过构建一个统一的理解-生成框架,将“组合能力”(composition)与“区分能力”(distinction)有机结合:利用理解专家作为语义桥梁,向生成专家传递关键语义信息,并通过注意力掩码机制引导生成过程保留主体身份、减少干扰;同时采用两阶段训练策略,先学习组合能力,再通过语义对齐增强区分能力。该方案显著提升了模型在真实复杂场景下的可控性和准确性。

链接: https://arxiv.org/abs/2512.12675
作者: Yuran Wang,Bohan Zeng,Chengzhuo Tong,Wenxuan Liu,Yang Shi,Xiaochen Ma,Hao Liang,Yuanxing Zhang,Wentao Zhang
机构: Peking University (北京大学); Kling Team, Kuaishou Technology (快手科技克林团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: this https URL.
zh

[CV-121] Progressive Conditioned Scale-Shift Recalibration of Self-Attention for Online Test-time Adaptation

【速读】:该论文旨在解决Transformer网络模型在目标域(target domain)上进行在线测试时适应性差的问题,即当将预训练于源域(source domain)的Transformer模型应用于新目标域时,其自注意力模块中的Query、Key和Value特征会发生显著变化,从而导致性能大幅下降。解决方案的关键在于提出一种渐进式条件缩放-平移重校准(Progressive Conditioned Scale-Shift Recalibration, PCSR)方法:通过在每一层Transformer中引入轻量级的Domain Separation Network与Factor Generator Network,实时学习领域偏移特征并生成用于自注意力机制重校准的缩放(scale)和偏移(shift)参数,实现对每个层的渐进式领域偏移分离与动态调整,从而显著提升在线测试时域适应性能,在ImageNet-C数据集上分类准确率最高提升达3.9%。

链接: https://arxiv.org/abs/2512.12673
作者: Yushun Tang,Ziqiong Liu,Jiyuan Jia,Yi Zhang,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Pengcheng Lab (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online test-time adaptation aims to dynamically adjust a network model in real-time based on sequential input samples during the inference stage. In this work, we find that, when applying a transformer network model to a new target domain, the Query, Key, and Value features of its self-attention module often change significantly from those in the source domain, leading to substantial performance degradation of the transformer model. To address this important issue, we propose to develop a new approach to progressively recalibrate the self-attention at each layer using a local linear transform parameterized by conditioned scale and shift factors. We consider the online model adaptation from the source domain to the target domain as a progressive domain shift separation process. At each transformer network layer, we learn a Domain Separation Network to extract the domain shift feature, which is used to predict the scale and shift parameters for self-attention recalibration using a Factor Generator Network. These two lightweight networks are adapted online during inference. Experimental results on benchmark datasets demonstrate that the proposed progressive conditioned scale-shift recalibration (PCSR) method is able to significantly improve the online test-time domain adaptation performance by a large margin of up to 3.9% in classification accuracy on the ImageNet-C dataset.
zh

[CV-122] Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning AAAI2026

【速读】:该论文旨在解决开放世界深度伪造图像溯源(Open-World DeepFake Attribution, OW-DFA)中的两个关键问题:一是由于置信度偏斜导致对新型伪造类型生成不可靠的伪标签,进而引发训练偏差;二是现有方法假设未知伪造类型数量已知,这在实际场景中并不合理。解决方案的核心是提出一种置信度感知的非对称学习(Confidence-Aware Asymmetric Learning, CAL)框架,其包含两个核心组件:置信度感知一致性正则化(Confidence-Aware Consistency Regularization, CCR)与非对称置信度强化(Asymmetric Confidence Reinforcement, ACR)。CCR通过动态缩放样本损失来缓解伪标签偏差,逐步将训练焦点从高置信度样本转向低置信度样本;ACR则通过选择性地对高置信度样本进行分类校准,利用已知与未知类别的置信度差距分别调整两类样本的置信度估计,形成相互增强的训练循环。此外,论文还引入动态原型剪枝(Dynamic Prototype Pruning, DPP)策略,实现从粗到细自动估计未知伪造类型数量,从而摆脱对先验知识的依赖,提升方法在真实开放世界场景下的可扩展性。

链接: https://arxiv.org/abs/2512.12667
作者: Haiyang Zheng,Nan Pu,Wenjing Li,Teng Long,Nicu Sebe,Zhun Zhong
机构: 1: University of Amsterdam (阿姆斯特丹大学); 2: University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known a priori. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model’s OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
zh

[CV-123] InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

【速读】:该论文旨在解决当前生成式AI在交互式数字体验中难以同时实现自然响应语音指令与物理物体互动的问题,现有方法通常将语音驱动的手势生成与物体交互分别处理,缺乏整合性数据集和统一建模框架。其解决方案的关键在于提出InteracTalker框架,通过多阶段训练学习统一的运动、语音与提示嵌入空间,并引入通用运动适配模块(Generalized Motion Adaptation Module)实现独立训练后动态融合;同时设计自适应融合策略,在扩散采样过程中动态调整异构条件信号的权重,从而有效平衡语音与物体交互信号的不平衡问题,最终实现高保真、对象感知的全身动作生成。

链接: https://arxiv.org/abs/2512.12664
作者: Sreehari Rajan,Kunal Bhosikar,Charu Sharma
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.
zh

[CV-124] PerNodeDrop: A Method Balancing Specialized Subnets and Regularization in Deep Neural Networks

【速读】:该论文旨在解决深度神经网络在训练过程中因神经元过度共适应(co-adaptation)而导致的过拟合问题,这种共适应虽能捕捉复杂特征交互,但也强化了虚假且非泛化的模式,从而导致训练性能与验证性能之间存在较大差距。解决方案的关键在于提出一种轻量级随机正则化方法——PerNodeDrop,其通过在样本层面和节点层面施加非均匀的扰动,打破传统噪声正则器(如Dropout和DropConnect)在层或批次层面的均匀性,使每个神经元能够体验输入特定的变异性,从而保留有益的共适应关系并抑制有害的共适应,最终缩小训练与验证性能差距,提升模型在未见数据上的可靠性。

链接: https://arxiv.org/abs/2512.12663
作者: Gelesh G Omathil,Sreeja CS
机构: Christ University (克里斯特大学); Chemophilic Data Sage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks possess strong representational capacity yet remain vulnerable to overfitting, primarily because neurons tend to co-adapt in ways that, while capturing complex and fine-grained feature interactions, also reinforce spurious and non-generalizable patterns that inflate training performance but reduce reliability on unseen data. Noise-based regularizers such as Dropout and DropConnect address this issue by injecting stochastic perturbations during training, but the noise they apply is typically uniform across a layer or across a batch of samples, which can suppress both harmful and beneficial co-adaptation. This work introduces PerNodeDrop, a lightweight stochastic regularization method. It applies per-sample, per-node perturbations to break the uniformity of the noise injected by existing techniques, thereby allowing each node to experience input-specific variability. Hence, PerNodeDrop preserves useful co-adaptation while applying regularization. This narrows the gap between training and validation performance and improves reliability on unseen data, as evident from the experiments. Although superficially similar to DropConnect, PerNodeDrop operates at the sample level. It drops weights at the sample level, not the batch level. An expected-loss analysis formalizes how its perturbations attenuate excessive co-adaptation while retaining predictive interactions. Empirical evaluations on vision, text, and audio benchmarks indicate improved generalization relative to the standard noise-based regularizer. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.12663 [cs.LG] (or arXiv:2512.12663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-125] Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

【速读】:该论文旨在解决甲状腺结节(thyroid nodule)超声图像中分割精度不足的问题,主要挑战包括结节与周围组织边界模糊、结节尺寸差异大以及标注数据稀缺,导致现有深度学习模型难以有效利用上下文信息并泛化到不同病例。其解决方案的关键在于提出一种半监督多任务Transformer网络(SSMT-Net),该方法在初始无监督阶段利用未标注数据增强基于Transformer的编码器特征提取能力,并在监督阶段联合优化结节分割、腺体分割和结节尺寸估计任务,从而融合局部与全局上下文特征,显著提升分割准确性和鲁棒性。

链接: https://arxiv.org/abs/2512.12662
作者: Muhammad Umar Farooq,Abd Ur Rehman,Azka Rehman,Muhammad Usman,Dong-Kyu Chae,Junaid Qadir
机构: Hanyang University (汉阳大学); The University of Alabama (阿拉巴马大学); Seoul National University (首尔国立大学); Stanford University (斯坦福大学); Qatar University (卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.
zh

[CV-126] CogDoc: Towards Unified thinking in Documents

【速读】:该论文旨在解决当前文档推理范式中普遍存在的一对矛盾:可扩展性(处理长文本上下文的能力)与保真度(捕捉细粒度、多模态细节的能力)之间的权衡问题。解决方案的关键在于提出一种统一的“粗粒度到细粒度”思维框架——CogDoc,其灵感来源于人类认知过程:首先通过低分辨率的“快速阅读”阶段实现高效的信息定位,再通过高分辨率的“专注思考”阶段进行深度推理。该框架在后训练策略上进一步优化,发现直接强化学习(Direct Reinforcement Learning, RL)优于以监督微调(Supervised Fine-Tuning, SFT)初始化的RL方法,因为后者存在“策略冲突”问题;实证结果表明,7B参数规模的模型在视觉丰富的文档基准测试中表现优于更大规模的专有模型(如GPT-4o),验证了该方案的有效性。

链接: https://arxiv.org/abs/2512.12658
作者: Qixin Xu,Haozhe Wang,Che Liu,Fangzhen Lin,Wenhu Chen
机构: Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); University of Waterloo (滑铁卢大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution “Fast Reading” phase for scalable information localization,followed by a high-resolution “Focused Thinking” phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the “policy conflict” observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.
zh

[CV-127] Cross-modal Fundus Image Registration under Large FoV Disparity

【速读】:该论文旨在解决跨模态眼底图像配准(cross-modal fundus image registration, CMFIR)中因视场(Field-of-View, FoV)差异较大而导致现有方法失效的问题。传统方法通常假设模态间FoV差异较小,而在实际临床场景中,如光学相干断层扫描血管成像(OCTA)与广角彩色眼底照片(wfCFP)之间的配准常面临显著的FoV不匹配问题。本文提出的CARe方法关键在于两个创新:一是通过“裁剪(Crop)”操作利用视网膜生理结构特性,从目标wfCFP图像中提取出与源OCTA图像FoV大致对齐的子图,从而将原本针对小FoV差异设计的方法适配于大FoV场景;二是引入基于双拟合策略的“对齐(Alignment)模块”,依次采用经典RANSAC算法和多项式坐标拟合优化空间变换,提升配准精度。实验在60对OCTA-wfCFP图像上验证了该方案的有效性。

链接: https://arxiv.org/abs/2512.12657
作者: Hongyang Li,Junyi Tao,Qijie Wei,Ningzhi Yang,Meng Wang,Weihong Yu,Xirong Li
机构: Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a regular paper at MMM 2026

点击查看摘要

Abstract:Previous work on cross-modal fundus image registration (CMFIR) assumes small cross-modal Field-of-View (FoV) disparity. By contrast, this paper is targeted at a more challenging scenario with large FoV disparity, to which directly applying current methods fails. We propose Crop and Alignment for cross-modal fundus image Registration(CARe), a very simple yet effective method. Specifically, given an OCTA with smaller FoV as a source image and a wide-field color fundus photograph (wfCFP) as a target image, our Crop operation exploits the physiological structure of the retina to crop from the target image a sub-image with its FoV roughly aligned with that of the source. This operation allows us to re-purpose the previous small-FoV-disparity oriented methods for subsequent image registration. Moreover, we improve spatial transformation by a double-fitting based Alignment module that utilizes the classical RANSAC algorithm and polynomial-based coordinate fitting in a sequential manner. Extensive experiments on a newly developed test set of 60 OCTA-wfCFP pairs verify the viability of CARe for CMFIR.
zh

[CV-128] DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉感知和精确空间推理能力上的局限性。其核心解决方案是提出了一种名为DiG(Differential Grounding,差异定位)的代理任务框架,通过让模型识别并定位相似图像对之间的所有差异来学习细粒度感知能力,且无需预先知道差异的数量。关键创新在于:1)设计了一个基于自动化3D渲染的数据生成管道,可生成高质量、可控差异的图像对以支持大规模训练;2)引入课程学习(curriculum learning)策略,从单差异逐步过渡到多差异,缓解差异信号稀疏问题,从而实现稳定优化。实验表明,该方法显著提升了多种视觉感知基准的表现,并有效迁移到RefCOCO、RefCOCO+、RefCOCOg等下游任务中,验证了其在增强MLLM细粒度视觉推理方面的有效性与鲁棒性。

链接: https://arxiv.org/abs/2512.12633
作者: Zhou Tao,Shida Wang,Yongxiang Hua,Haoyu Cao,Linli Xu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
zh

[CV-129] D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

【速读】:该论文旨在解决具身智能代理(Embodied Agents)中端到端模型缺乏可解释性和显式3D推理能力,而模块化系统又忽视组件间交叉依赖与协同效应的困境。其核心解决方案是提出动态3D视觉-语言-规划模型(Dynamic 3D Vision-Language-Planning Model, D3D-VLP),关键创新在于:1)引入动态3D思维链(3D Chain-of-Thought, 3D CoT),将规划、定位、导航与问答统一于单一3D视觉语言模型(3D-VLM)和思维链流水线中;2)设计碎片化监督协同学习策略(Synergistic Learning from Fragmented Supervision, SLFS),通过掩码自回归损失函数从大规模且部分标注的混合数据中学习,使不同思维链组件能够相互强化并隐式监督彼此。

链接: https://arxiv.org/abs/2512.12622
作者: Zihan Wang,Seungjun Lee,Guangzhao Dai,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.
zh

[CV-130] Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching WACV2026

【速读】:该论文旨在解决实例级图像检索(instance-level image retrieval)问题,即在查询图像与数据库图像存在尺度、位置或外观差异的情况下,准确找到包含相同对象的图像。其解决方案的关键在于提出了一种名为Patchify的patch-wise检索框架,该框架通过将每张数据库图像划分为少量结构化patch,并利用局部特征与全局查询描述符进行匹配,从而实现高精度且具有空间定位意义的检索。该方法无需微调即可兼顾性能、可扩展性和可解释性,同时引入LocScore这一定位感知指标以评估检索结果的空间准确性,为改进检索行为提供了诊断工具。

链接: https://arxiv.org/abs/2512.12610
作者: Wonseok Choi,Sohwi Lim,Nam Hyeon-Woo,Moon Ye-Bin,Dong-Ju Jeong,Jinyoung Hwang,Tae-Hyun Oh
机构: POSTECH(浦项科技大学); KAIST(韩国科学技术院); Samsung Research(三星研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: WACV 2026

点击查看摘要

Abstract:Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: this https URL
zh

[CV-131] No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中计算开销过大的问题,尤其是在推理阶段,其性能受限于采样步数、模型深度和序列长度的增加。传统特征缓存方法虽能利用相邻时间步间高度相似的特征来加速,但存在显著的权衡:过度依赖时间步级重用虽可大幅提升速度,却容易破坏生成质量;而块(block)或标记(token)级重用虽然更安全,但加速效果有限。本文提出的X-Slim(eXtreme-Slimming Caching)是一种无需训练的基于缓存的加速框架,首次统一地利用了跨时间步、结构(块)和空间(标记)三个维度的可缓存冗余。其核心创新在于引入了一个双阈值控制器(dual-threshold controller),将缓存策略设计为“先推动再精修”的过程:首先在时间步级别尽可能重用直至达到预警阈值,随后切换至轻量级的块与标记级刷新以优化剩余冗余,并在误差累积超过临界阈值时触发完整推理以重置误差。每个层级均通过上下文感知指标决定缓存时机与位置,从而在保证生成质量的前提下实现显著加速。

链接: https://arxiv.org/abs/2512.12604
作者: Tingyan Wen,Haoyu Li,Yihuang Chen,Xing Zhou,Lifei Zhu,Xueqian Wang
机构: Tsinghua University (清华大学); Central Media Technology Institute, Huawei (华为中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.
zh

[CV-132] Geometry-Aware Scene-Consistent Image Generation

【速读】:该论文旨在解决几何感知场景一致的图像生成问题(geometry-aware scene-consistent image generation),即在给定参考场景图像和描述某实体及其与场景空间关系的文本条件时,生成既保持参考场景物理环境一致性又准确遵循文本指令的输出图像。现有方法难以平衡场景保真度与提示响应能力:要么高度保留场景但对提示不敏感,要么优先满足提示要求却牺牲场景一致性。解决方案的关键在于两项创新:(i) 构建多样且几何约束驱动的训练数据生成流程,确保训练样本具备真实场景结构;(ii) 提出一种几何引导的注意力损失(geometry-guided attention loss),利用多视角线索正则化模型的空间推理能力,从而提升生成图像的几何合理性与文本-图像一致性。

链接: https://arxiv.org/abs/2512.12598
作者: Cong Xie,Che Wang,Yan Zhang,Zheng Pan,Han Zou,Zhenpeng Zhan
机构: Baidu Inc(百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model’s spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.
zh

[CV-133] Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models

【速读】:该论文旨在解决传统图像广告版式生成方法中因仅依赖显著性映射(saliency mapping)而无法充分考虑背景图像细节组成与语义内容的问题。其解决方案的关键在于利用视觉语言模型(Vision-Language Model, VLM)对背景图像中的产品及其他元素进行识别,并基于此生成结构化的文本“布局规划”,进而通过HTML格式代码实现最终版式的渲染,从而显式地融合图像语义信息以提升广告版式的质量。

链接: https://arxiv.org/abs/2512.12596
作者: Kei Yoshitake,Kento Hosono,Ken Kobayashi,Kazuhide Nakata
机构: Institute of Science Tokyo (东京科学研究所); HAKUHODO Technologies Inc. (博报堂技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image’s detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based “placement plan” based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image’s content, our method produces noticeably higher-quality advertisement layouts.
zh

[CV-134] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

【速读】:该论文旨在解决高分辨率图像合成与多模态数据解释中的挑战,特别是在生成质量、计算效率及跨模态一致性方面的问题。其解决方案的关键在于提出一种融合视觉增强型大语言模型(Vision-Enhanced Large Language Models, VLLMs)与先进Transformer架构的框架,核心创新包括:采用修正流(rectified flow)机制实现噪声与数据间的线性路径映射,提升生成效率与质量;引入双向标记化策略(bidirectional tokenization strategy)以无缝整合文本、图像和视频模态输入,促进多模态统一表征;并通过嵌入时空特征与混合文本-图像序列建模方法,实现高保真图像合成与一致的多模态表示。此外,结合噪声感知学习算法优化模型对噪声分布差异的鲁棒性,显著提升了在不同输入条件下的生成性能。

链接: https://arxiv.org/abs/2512.12595
作者: Karthikeya KV
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.
zh

[CV-135] Automatic Wire-Harness Color Sequence Detector

【速读】:该论文旨在解决电子制造服务(Electronics Manufacturing Services, EMS)行业中线束(wire harness)检测过程依赖人工、易出错的问题。解决方案的关键在于构建了一个半自动化机器视觉系统,其核心包括:集成五台工业标准CMOS相机的模块化机械结构,以及基于HSV与RGB颜色域值对比的颜色序列分类算法,用于验证线束的正确布线位置、连接器极性及颜色序列;此外,系统支持用户通过至少五个参考样本进行训练并存储模型以复用,从而实现对线性与圆形线束配置的高精度检测。实际部署于GPV Lanka Pvt. Ltd.后,该系统达到了100%检测准确率,并将检测时间减少44%,显著提升了检测效率与可靠性。

链接: https://arxiv.org/abs/2512.12590
作者: Indiwara Nanayakkara,Dehan Jayawickrama,Mervyn Parakrama B. Ekanayake
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages, 20 figures, IEEE ICIIS 2025 Conference - Accepted

点击查看摘要

Abstract:Wire harness inspection process remains a labor-intensive process prone to errors in the modern Electronics Manufacturing Services (EMS) industry. This paper introduces a semiautomated machine vision system capable of verifying correct wire positioning, correctness of the connector polarity and correctness of color sequences for both linear and circular wire harness configurations. Five industrial standard CMOS cameras are integrated into a modularized mechanical framework in the physical structure of the solution and a HSV and RGB color domain value comparison based color sequence classifier is used in the operation. For each harness batch, a user can train the system using at least five reference samples; the trained file is stored and reused for similar harness types. The Solution is deployed at GPV Lanka Pvt. Ltd. (Fig. 2) and the system achieved 100% detection accuracy and reduced inspection time by 44% compared to manual methods. Additional features include user management, adjustable lighting, session data storage, and secure login. Results of this product usage in the real world situation demonstrate that this approach delivers reliable and efficient inspection capabilities.
zh

[CV-136] StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis AAAI2026

【速读】:该论文旨在解决视频动作识别(Video Action Recognition, VAR)中隐私泄露的关键问题,特别是现有基于匿名化的方法存在隐蔽性低(易引起攻击者注意)和时空特征破坏(影响VAR准确性)两大缺陷。解决方案的关键在于提出StegaVAR框架,首次将动作视频嵌入普通载体视频中,并在隐写域内直接执行VAR任务,从而完整保留隐藏视频的时空信息,同时通过载体视频的自然外观实现传输隐蔽性。该框架引入Secret Spatio-Temporal Promotion (STeP) 和 Cross-Band Difference Attention (CroDA) 两个核心模块:STeP利用秘密视频指导隐写域中的时空特征提取,CroDA通过捕捉跨频带语义差异抑制载体干扰,显著提升了隐写域分析的准确性和隐私保护能力。

链接: https://arxiv.org/abs/2512.12586
作者: Lixin Chen,Chaomeng Chen,Jiale Zhou,Zhijian Wu,Xun Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures. This is the extended version of the paper accepted at AAAI 2026, including related works and appendix

点击查看摘要

Abstract:Despite the rapid progress of deep learning in video action recognition (VAR) in recent years, privacy leakage in videos remains a critical concern. Current state-of-the-art privacy-preserving methods often rely on anonymization. These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers’ attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. Throughout both data transmission and action analysis, the spatiotemporal information of hidden secret video remains complete, while the natural appearance of cover videos ensures the concealment of transmission. Considering the difficulty of steganographic domain analysis, we propose Secret Spatio-Temporal Promotion (STeP) and Cross-Band Difference Attention (CroDA) for analysis within the steganographic domain. STeP uses the secret video to guide spatiotemporal feature extraction in the steganographic domain during training. CroDA suppresses cover interference by capturing cross-band semantic differences. Experiments demonstrate that StegaVAR achieves superior VAR and privacy-preserving performance on widely used datasets. Moreover, our framework is effective for multiple steganographic models.
zh

[CV-137] From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在从网络图像向传感器驱动的物理环境迁移时的泛化能力不足问题。现有方法主要依赖于数字域的测试时适应(Test-Time Adaptation, TTA),难以应对真实场景中光照、视角等动态变化带来的挑战。解决方案的关键在于提出多视角物理提示(Multi-View Physical-prompt, MVP)框架,其核心创新是将相机曝光三角参数(ISO、快门速度和光圈)作为“物理提示”,在推理阶段通过选择最优传感器设置获取多个物理视图,并基于源域相似性筛选出高保真视图,再结合轻量级数字增强与低熵过滤机制进行预测聚合,最终采用零温度Softmax(即硬投票)实现鲁棒决策。该方法无需梯度更新或模型修改,显著提升了VLM在复杂物理环境下的性能,最高可提升25.6个百分点(pp),验证了测量时刻控制(measurement-time control)对增强模型鲁棒性的有效性。

链接: https://arxiv.org/abs/2512.12571
作者: Boyeong Im,Wooseok Lee,Yoojin Kwon,Hyung-Sin Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle–ISO, shutter speed, and aperture–as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control–selecting and combining real physical views–substantially improves robustness for VLMs.
zh

[CV-138] StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

【速读】:该论文旨在解决将多模态大语言模型(Multimodal Large Language Models, MLLMs)应用于在线视频理解时面临的高GPU内存占用和计算延迟问题,其根源在于视频帧数量庞大导致的上下文长度过长。解决方案的关键在于提出一种基于冗余度量的令牌剪枝(token pruning)方法,核心创新是引入一种新的冗余指标——空间邻近令牌最大相似性(Maximum Similarity to Spatially Adjacent Video Tokens, MSSAVT),该指标同时考虑了令牌间的相似性和空间位置关系;此外,设计了一种掩码剪枝策略以打破剪枝与冗余之间的双向依赖,确保仅互不相邻的令牌被剪枝,并结合现有的时序冗余剪枝方法消除视频模态中的时间冗余。实验表明,该方法在多个在线和离线视频理解基准上显著提升准确率(最高达4%),且剪枝延迟可忽略不计(<1ms)。

链接: https://arxiv.org/abs/2512.12560
作者: Xinqi Jin,Hanxun Yu,Bohan Yu,Kebin Liu,Jian Liu,Keda Tao,Yixuan Pei,Huan Wang,Fan Dang,Jiangchuan Liu,Weiqiang Wang
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); Ant Group; Westlake University (西湖大学); Beijing Jiaotong University (北京交通大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality. Experimental results on multiple online and offline video understanding benchmarks demonstrate that our method significantly improves the accuracy (i.e., by 4% at most) while incurring a negligible pruning latency (i.e., less than 1ms). Our full implementation will be made publicly available.
zh

[CV-139] Supervised Contrastive Frame Aggregation for Video Representation Learning

【速读】:该论文旨在解决视频表示学习中如何有效利用时间全局上下文并减少计算复杂度的问题。现有方法往往依赖复杂的视频Transformer模型或数据增强策略,导致资源消耗高且难以捕捉长时依赖关系。其解决方案的关键在于提出一种监督对比学习框架(Supervised Contrastive Learning Framework),通过将视频中的多帧按空间排列聚合为单张图像输入,从而兼容预训练的卷积神经网络(如ResNet50),避免了视频Transformer带来的计算开销;同时设计了一种基于标签一致性的对比损失函数,以同一视频的不同时间采样视图作为正样本对,自然生成多样化的正例并保留全局语义信息,显著提升了分类准确率并在Penn Action和HMDB51上分别达到76%和48%的准确率,优于ViVIT等基线方法。

链接: https://arxiv.org/abs/2512.12549
作者: Shaif Chowdhury,Mushfika Rahman,Greg Hamerly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video into a single input image. This design enables the use of pre trained convolutional neural network backbones such as ResNet50 and avoids the computational overhead of complex video transformer models. We then design a contrastive learning objective that directly compares pairwise projections generated by the model. Positive pairs are defined as projections from videos sharing the same label while all other projections are treated as negatives. Multiple natural views of the same video are created using different temporal frame samplings from the same underlying video. Rather than relying on data augmentation these frame level variations produce diverse positive samples with global context and reduce overfitting. Experiments on the Penn Action and HMDB51 datasets demonstrate that the proposed method outperforms existing approaches in classification accuracy while requiring fewer computational resources. The proposed Supervised Contrastive Frame Aggregation method learns effective video representations in both supervised and self supervised settings and supports video based tasks such as classification and captioning. The method achieves seventy six percent classification accuracy on Penn Action compared to forty three percent achieved by ViVIT and forty eight percent accuracy on HMDB51 compared to thirty seven percent achieved by ViVIT.
zh

[CV-140] Anatomy Guided Coronary Artery Segmentation from CCTA Using Spatial Frequency Joint Modeling

【速读】:该论文旨在解决冠状动脉在冠状动脉计算机断层扫描血管成像(Coronary Computed Tomography Angiography, CCTA)中难以准确分割的问题,尤其针对小血管直径、复杂分支结构、边界模糊及心肌干扰等挑战。其解决方案的关键在于提出了一种融合心肌解剖先验信息、结构感知特征编码与三维小波逆小波变换的分割框架:通过在编码阶段引入心肌先验和基于残差注意力的特征增强机制以强化冠状动脉结构表征;利用小波-逆小波变换实现联合空间-频率建模并保持多尺度结构一致性;同时在解码阶段设计多尺度特征融合模块,整合语义与几何信息,从而提升复杂几何条件下冠状动脉分割的稳定性和准确性。

链接: https://arxiv.org/abs/2512.12539
作者: Huan Huang,Michele Esposito,Chen Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 figures

点击查看摘要

Abstract:Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.
zh

[CV-141] Animus3D: Text-driven 3D Animation via Motion Score Distillation SIGGRAPH

【速读】:该论文旨在解决当前基于文本驱动的3D动画生成方法中运动幅度不足、明显抖动以及几何失真等问题。现有方法主要依赖于原始Score Distillation Sampling (SDS)目标,从预训练的文生视频扩散模型中蒸馏运动信息,导致生成动画的运动细节稀疏且不稳定。其解决方案的关键在于提出一种新的SDS替代方案——Motion Score Distillation (MSD),通过引入LoRA增强的视频扩散模型定义静态源分布而非纯噪声,并结合基于反演的噪声估计技术以保持外观一致性;同时,设计显式的时空正则化项抑制时间与空间上的几何畸变,并进一步引入运动精修模块提升时序分辨率和细粒度运动细节,从而突破底层视频模型的固定分辨率限制。

链接: https://arxiv.org/abs/2512.12534
作者: Qi Sun,Can Wang,Jiaxiang Shang,Wensen Feng,Jing Liao
机构: City University of Hong Kong(香港城市大学); Central Media Technology Institute, Huawei(华为中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: SIGGRAPH Asia 2025

点击查看摘要

Abstract:We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at this https URL.
zh

[CV-142] Generative Spatiotemporal Data Augmentation

【速读】:该论文旨在解决在数据稀缺场景下(如无人机(UAV)采集的图像),模型训练因标注数据不足而导致性能受限的问题。其核心解决方案是利用视频基础模型(video foundation models)进行时空数据增强,通过生成具有真实感的3D空间和时间变化的视频片段作为补充训练数据,从而扩展原始数据分布中被传统方法或先前生成方法所忽略的维度。关键创新在于:(1)采用现成的视频扩散模型实现高质量的时空合成;(2)提供实用指南以有效迁移标注至合成帧并处理遮挡消失(disocclusion)问题;(3)在低数据环境下显著提升模型泛化能力,尤其在COCO子集和UAV数据集上验证了该方法的有效性。

链接: https://arxiv.org/abs/2512.12508
作者: Jinfan Zhou,Lixin Luo,Sungmin Eum,Heesung Kwon,Jeong Joon Park
机构: University of Chicago (芝加哥大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.
zh

[CV-143] Advancing Cache-Based Few-Shot Classification via Patch-Driven Relational Gated Graph Attention

【速读】:该论文旨在解决少样本图像分类(few-shot image classification)在监督信息有限和视觉领域偏移(visual domain shift)下的性能瓶颈问题。现有基于缓存的适配方法(如Tip-Adapter)虽通过学习轻量级残差适配器来改进冻结特征表示,但仍受限于CLIP模型倾向于编码全局通用表征,难以在低数据场景下有效将通用模型适配至特定领域。其解决方案的关键在于提出一种基于补丁的关联精炼机制(patch-driven relational refinement),该机制不再将图像嵌入视为单一向量,而是通过构建补丁图并引入边缘感知的图注意力网络(relational gated graph attention network),显式建模图像内部补丁间的依赖关系,从而生成富含上下文信息的补丁嵌入;进一步采用可学习的多聚合池化策略,将这些嵌入融合为紧凑且任务判别性强的表示,使缓存键更精准对齐目标类别。该精炼过程仅在训练阶段使用以蒸馏结构信息至缓存,推理时无额外开销,最终通过缓存相似度与CLIP零样本逻辑值的残差融合进行预测。

链接: https://arxiv.org/abs/2512.12498
作者: Tasweer Ahmad,Arindam Sikdar,Sandip Pradhan,Ardhendu Behera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot image classification remains difficult under limited supervision and visual domain shift. Recent cache-based adaptation approaches (e.g., Tip-Adapter) address this challenge to some extent by learning lightweight residual adapters over frozen features, yet they still inherit CLIP’s tendency to encode global, general-purpose representations that are not optimally discriminative to adapt the generalist to the specialist’s domain in low-data regimes. We address this limitation with a novel patch-driven relational refinement that learns cache adapter weights from intra-image patch dependencies rather than treating an image embedding as a monolithic vector. Specifically, we introduce a relational gated graph attention network that constructs a patch graph and performs edge-aware attention to emphasize informative inter-patch interactions, producing context-enriched patch embeddings. A learnable multi-aggregation pooling then composes these into compact, task-discriminative representations that better align cache keys with the target few-shot classes. Crucially, the proposed graph refinement is used only during training to distil relational structure into the cache, incurring no additional inference cost beyond standard cache lookup. Final predictions are obtained by a residual fusion of cache similarity scores with CLIP zero-shot logits. Extensive evaluations on 11 benchmarks show consistent gains over state-of-the-art CLIP adapter and cache-based baselines while preserving zero-shot efficiency. We further validate battlefield relevance by introducing an Injured vs. Uninjured Soldier dataset for casualty recognition. It is motivated by the operational need to support triage decisions within the “platinum minutes” and the broader “golden hour” window in time-critical UAV-driven search-and-rescue and combat casualty care.
zh

[CV-144] More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)训练后仍存在的两个关键问题:一是视觉信息提取不准确(如遗漏或幻觉细节),二是思维链(chain-of-thought)逻辑不一致。其解决方案的关键在于提出PeRL-VL(Perception and Reasoning Learning for Vision-Language Models)框架,该框架通过解耦机制分别优化视觉感知和文本推理能力:在感知层面引入基于VLM的描述奖励以评估自生成图像描述的忠实性和充分性;在推理层面增加仅依赖文本的推理监督微调(Reasoning SFT)阶段,利用逻辑丰富的思维链数据提升推理连贯性与一致性,从而独立于视觉输入增强模型的逻辑推理能力。

链接: https://arxiv.org/abs/2512.12487
作者: Hoang Anh Just,Yifei Fan,Handong Zhao,Jiuxiang Gu,Ruiyi Zhang,Simon Jenni,Kushal Kafle,Ruoxi Jia,Jing Shi
机构: Virginia Tech (弗吉尼亚理工大学); Adobe Research (Adobe 研究院); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model’s self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.
zh

[CV-145] From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields

【速读】:该论文旨在解决多视角渲染中基于光子映射(photon mapping)的计算效率低下问题,其核心挑战在于每次视点独立进行光子追踪和随机核估计导致大量冗余计算。解决方案的关键在于提出高斯光子场(Gaussian Photon Field, GPF),将物理光子分布建模为可学习的各向异性3D高斯基元(参数包括位置、旋转、尺度和光谱),并通过多视角最终辐射度监督优化该场,从而将光子级光照传输信息蒸馏为一个连续且可微的辐射度函数。训练完成后,GPF可在不重复光子追踪或迭代优化的情况下,沿相机射线高效评估辐射度,实现物理准确性和神经场景表示效率的统一。

链接: https://arxiv.org/abs/2512.12459
作者: Jiachen Tao,Benjamin Planche,Van Nguyen Nguyen,Junyi Wu,Yuchun Liu,Haoxuan Wang,Zhongpai Gao,Gengyu Zhang,Meng Zheng,Feiran Wang,Anwesa Choudhuri,Zhenghao Zhao,Weitai Kang,Terrence Chen,Yan Yan,Ziyan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Accurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.
zh

[CV-146] Endless World: Real-Time 3D-Aware Long Video Generation

【速读】:该论文旨在解决在流式场景下生成长时、连贯且具有稳定三维结构的视频序列这一核心挑战。其解决方案的关键在于提出了一种名为Endless World的实时框架,该框架通过引入条件自回归训练策略(conditional autoregressive training strategy)来对齐新生成内容与已有视频帧,从而在保持长程依赖性的同时实现计算效率,支持单GPU上的实时推理;此外,还集成了全局3D感知注意力机制(global 3D-aware attention),提供跨时间的连续几何引导,并结合3D注入机制(3D injection mechanism)强制执行物理合理性与几何一致性,有效提升了长时间动态场景下的视频稳定性与空间一致性。

链接: https://arxiv.org/abs/2512.12430
作者: Ke Zhang,Yiqun Mei,Jiacong Xu,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,7 figures

点击查看摘要

Abstract:Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video this http URL support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training this http URL, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene this http URL experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our project has been available on this https URL.
zh

[CV-147] BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

【速读】:该论文旨在解决单目深度估计(monocular depth estimation)与背景虚化(bokeh)渲染之间的耦合问题,即当前方法未能充分利用二者共享的成像几何关系,导致深度图噪声放大为可见伪影,且在弱纹理、远距离和几何模糊区域性能不佳。其解决方案的关键在于提出一个两阶段框架BokehDepth:第一阶段通过预训练图像编辑模型构建物理引导的可控虚化生成器,从单张清晰图像生成无深度依赖的虚化堆栈;第二阶段引入轻量级聚焦感知融合模块,直接利用焦散信息作为无监督几何线索,增强现有单目深度编码器的稳定性与精度,同时保持下游解码器不变。

链接: https://arxiv.org/abs/2512.12425
作者: Hangwei Zhang,Armando Teles Fortes,Tianyi Wei,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.
zh

[CV-148] ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics AAAI

【速读】:该论文旨在解决越南语信息图视觉问答(Infographic Visual Question Answering, InfographicVQA)任务中缺乏高质量基准数据集的问题,以及当前多模态模型在低资源语言环境下对布局感知和跨图像推理能力不足的局限性。解决方案的关键在于构建首个面向越南语的信息图视觉问答基准ViInfographicVQA,包含超过6747张真实世界信息图及20409个经人工验证的问答对,并设计两种评估场景:单图任务(Single-image task)与跨图推理任务(Multi-image task),其中后者首次在越南语环境中引入跨图像证据整合的评测机制。实验表明,现有视觉-语言模型在跨图任务上表现显著落后,尤其在非跨度推理和跨图像信息融合方面存在明显缺陷,凸显了未来研究需聚焦于增强布局敏感性和跨图像推理能力的方向。

链接: https://arxiv.org/abs/2512.12424
作者: Tue-Thu Van-Dinh,Hoang-Duy Tran,Truong-Binh Duong,Mai-Hanh Pham,Binh-Nam Le-Nguyen,Quoc-Thai Nguyen
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, Accepted to AI4Research @ AAAI

点击查看摘要

Abstract:Infographic Visual Question Answering (InfographicVQA) evaluates a model’s ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation of cross-image reasoning in VQA. We evaluate a range of recent vision-language models on this benchmark, revealing substantial performance disparities, with the most significant errors occurring on Multi-image questions that involve cross-image integration and non-span reasoning. ViInfographicVQA contributes benchmark results for Vietnamese InfographicVQA and sheds light on the limitations of current multimodal models in low-resource contexts, encouraging future exploration of layout-aware and cross-image reasoning methods.
zh

[CV-149] A Graph Attention Network-Based Framework for Reconstructing Missing LiDAR Beams

【速读】:该论文旨在解决旋转式激光雷达(LiDAR)传感器因硬件老化、灰尘、雪、雾或强反射等因素导致的垂直通道丢弃(vertical beam dropout)问题,此类现象会移除点云中的完整垂直切片,严重损害自动驾驶车辆的三维感知能力。解决方案的关键在于提出一种基于图注意力网络(Graph Attention Network, GAT)的框架,该框架仅利用当前帧的原始点云数据进行重建,无需相机图像或时间序列信息;其核心创新是将每帧LiDAR扫描表示为无结构空间图(点为节点,邻近点间建立边并保留原始光束索引顺序),并通过多层GAT学习局部几何邻域内的自适应注意力权重,直接回归缺失光束位置的高程(z)值,从而实现对丢失垂直通道的有效恢复。

链接: https://arxiv.org/abs/2512.12410
作者: Khalfalla Awedat,Mohamed Abidalrekab,Mohammad El-Yabroudi
机构: SUNY Morrisville (纽约州立大学莫里斯维尔分校); Portland State University (波特兰州立大学); Lawrence Technological University (劳伦斯科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vertical beam dropout in spinning LiDAR sensors triggered by hardware aging, dust, snow, fog, or bright reflections removes entire vertical slices from the point cloud and severely degrades 3D perception in autonomous vehicles. This paper proposes a Graph Attention Network (GAT)-based framework that reconstructs these missing vertical channels using only the current LiDAR frame, with no camera images or temporal information required. Each LiDAR sweep is represented as an unstructured spatial graph: points are nodes and edges connect nearby points while preserving the original beam-index ordering. A multi-layer GAT learns adaptive attention weights over local geometric neighborhoods and directly regresses the missing elevation (z) values at dropout locations. Trained and evaluated on 1,065 raw KITTI sequences with simulated channel dropout, the method achieves an average height RMSE of 11.67 cm, with 87.98% of reconstructed points falling within a 10 cm error threshold. Inference takes 14.65 seconds per frame on a single GPU, and reconstruction quality remains stable for different neighborhood sizes k. These results show that a pure graph attention model operating solely on raw point-cloud geometry can effectively recover dropped vertical beams under realistic sensor degradation.
zh

[CV-150] ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States

【速读】:该论文旨在解决现有生成模型在生成关节式三维物体时因仅依赖单视角输入而引发的结构-运动耦合问题,导致生成的Kinematic Structure(运动学结构)模糊或不现实。其核心解决方案是提出ArtGen框架,关键在于:1)采用跨状态蒙特卡洛采样(cross-state Monte Carlo sampling)显式强制全局运动学一致性,从而降低几何形状与关节动力学之间的纠缠;2)引入Chain-of-Thought推理模块以推断稳健的结构先验(如部件语义、关节类型和连接关系),指导稀疏专家Diffusion Transformer专注于多样化的运动交互建模;3)设计融合局部-全局注意力机制的组合式3D-VAE潜在先验,有效捕捉精细几何特征与部件级全局关系。

链接: https://arxiv.org/abs/2512.12395
作者: Haowen Wang,Xiaoping Yuan,Fugang Zhang,Rui Jian,Yuanwei Zhu,Xiuquan Qiao,Yakun Huang
机构: Anhui University (安徽大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.
zh

[CV-151] Speedrunning ImageNet Diffusion

【速读】:该论文旨在解决扩散模型(diffusion model)在训练效率与性能之间难以平衡的问题,尤其是在小规模模型(如140M参数)下如何实现接近大规模模型(如685M参数)的生成质量。其关键解决方案是提出SR-DiT(Speedrun Diffusion Transformer)框架,系统性地整合了标记路由(token routing)、架构优化和训练策略改进,并基于表示对齐(representation alignment)进行构建。通过多技术协同作用,SR-DiT在仅400K迭代次数下即达到FID 3.49和KDD 0.319的优异结果,显著优于同规模模型的现有方法,且无需分类器自由引导(classifier-free guidance),体现了不同优化技术之间的正向协同效应。

链接: https://arxiv.org/abs/2512.12386
作者: Swayam Bhanded
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.
zh

[CV-152] M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

【速读】:该论文旨在解决现有基于视觉的人体网格重建(Human Mesh Reconstruction, HMR)数据集在遮挡、光照变化和隐私问题上的局限性,从而推动基于毫米波雷达(mmWave radar)的无感人体感知研究。其解决方案的关键在于构建M4Human——当前规模最大(661K帧,为此前最大数据集的9倍)的多模态基准数据集,融合高分辨率毫米波雷达、RGB与深度图像,并提供原始雷达张量(RT)与处理后的雷达点云(RPC)两种粒度级别的信号表示,同时包含高质量动作捕捉(MoCap)标注的3D网格和全局轨迹信息,覆盖20名受试者与50种多样化动作(包括静止、坐姿及自由空间运动)。该数据集支持从单模态雷达到多模态融合(如RGB-D)的广泛研究,显著提升了雷达驱动人体建模的能力,同时也揭示了快速、非约束运动下的持续挑战。

链接: https://arxiv.org/abs/2512.12378
作者: Junqiao Fan,Yunjiao Zhou,Yizhuo Yang,Xinyuan Cui,Jiarui Zhang,Lihua Xie,Jianfei Yang,Chris Xiaoxuan Lu,Fangqiang Ding
机构: Nanyang Technological University (南洋理工大学); University of Pennsylvania (宾夕法尼亚大学); University College London (伦敦大学学院); Massachusetts Institute of Technology (麻省理工学院); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ( 9\times prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
zh

[CV-153] V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

【速读】:该论文旨在解决视频个性化(video personalization)中面临的两大挑战:一是现有方法依赖大规模视频微调或数据集,导致计算成本高且难以扩展;二是难以在帧间保持细粒度的外观一致性。其解决方案的关键在于提出一种无需训练的粗到精个性化框架 V-Warper,该框架包含两个核心阶段:(1) 利用少量参考图像进行轻量级粗粒度外观适应,通过仅图像的 LoRA(Low-Rank Adaptation)和主体嵌入适配编码全局身份特征;(2) 在推理时引入细粒度外观注入机制,基于 RoPE-free 中层查询-键特征计算语义对应关系,引导富含外观信息的值表示向生成过程中的语义对齐区域进行形变映射,并通过掩码确保空间可靠性。此方法显著提升外观保真度,同时保持提示对齐与运动动态,且无需大规模视频微调即可实现高效优化。

链接: https://arxiv.org/abs/2512.12375
作者: Hyunkoo Lee,Wooseok Jang,Jini Yang,Taehwan Kim,Sangoh Kim,Sangwon Jung,Seungryong Kim
机构: KAIST AI; Korea University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query–key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.
zh

[CV-154] STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

【速读】:该论文旨在解决生成式视频合成中多镜头叙事(multi-shot narrative)的连贯性问题,尤其是在使用关键帧(keyframe)方法时难以维持跨镜头一致性与电影语言表达的问题。其解决方案的关键在于提出STAGE(SToryboard-Anchored GEneration)工作流,通过引入STEP2模块预测由起始-结束帧对组成的结构化分镜(storyboard),并设计多镜头记忆包(multi-shot memory pack)保障长距离实体一致性、双编码策略(dual-encoding strategy)提升单镜头内连贯性,以及两阶段训练方案学习镜头间的电影级过渡效果。

链接: https://arxiv.org/abs/2512.12372
作者: Peixuan Zhang,Zijian Jia,Kaiqi Liu,Shuchen Weng,Si Li,Boxin Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.
zh

[CV-155] CLeaf-Net: a transformer-convolution framework with global-local attention for robust in-field lesion-level plant leaf disease detection

【速读】:该论文旨在解决真实田间环境下植物叶片病害检测中因背景杂乱、域偏移及病变级数据稀缺导致的模型鲁棒性不足问题。其核心解决方案是提出TCLeaf-Net,一种基于Transformer与卷积混合架构的检测模型:通过Transformer-Convolution模块(TCM)融合全局上下文与局部保留卷积以抑制非叶区域干扰;设计原始尺度特征召回与采样(RSFRS)块结合双线性重采样与卷积操作,减少下采样过程中的空间细节丢失;并引入带特征金字塔网络(FPN)的可变形对齐模块(DFPN),利用偏移对齐和多感受野感知机制增强多尺度特征融合能力,从而提升模型在复杂场景下的精度与泛化性能。

链接: https://arxiv.org/abs/2512.12357
作者: Zishen Song,Yongjian Zhu,Dong Wang,Hongzhan Liu,Lingyu Jiang,Yongxing Duan,Zehua Zhang,Sihan Li,Jiarui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Timely and accurate detection of foliar diseases is vital for safeguarding crop growth and reducing yield losses. Yet, in real-field conditions, cluttered backgrounds, domain shifts, and limited lesion-level datasets hinder robust modeling. To address these challenges, we release Daylily-Leaf, a paired lesion-level dataset comprising 1,746 RGB images and 7,839 lesions captured under both ideal and in-field conditions, and propose TCLeaf-Net, a transformer-convolution hybrid detector optimized for real-field use. TCLeaf-Net is designed to tackle three major challenges. To mitigate interference from complex backgrounds, the transformer-convolution module (TCM) couples global context with locality-preserving convolution to suppress non-leaf regions. To reduce information loss during downsampling, the raw-scale feature recalling and sampling (RSFRS) block combines bilinear resampling and convolution to preserve fine spatial detail. To handle variations in lesion scale and feature shifts, the deformable alignment block with FPN (DFPN) employs offset-based alignment and multi-receptive-field perception to strengthen multi-scale fusion. Experimental results show that on the in-field split of the Daylily-Leaf dataset, TCLeaf-Net improves mAP@50 by 5.4 percentage points over the baseline model, reaching 78.2%, while reducing computation by 7.5 GFLOPs and GPU memory usage by 8.7%. Moreover, the model outperforms recent YOLO and RT-DETR series in both precision and recall, and demonstrates strong performance on the PlantDoc, Tomato-Leaf, and Rice-Leaf datasets, validating its robustness and generalizability to other plant disease detection scenarios.
zh

[CV-156] Unified Control for Inference-Time Guidance of Denoising Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Model)在下游任务中输出与目标任务对齐不足的问题,以提升特定任务的性能。现有方法主要分为两类:基于采样的策略通过探索多个候选输出并选择奖励信号更高的样本进行优化,而基于梯度的策略则利用可微分的奖励近似直接引导生成过程。论文提出了一种统一算法 UniCoDe,其关键在于将局部梯度信号融入采样过程中,从而克服复杂奖励驱动采样方法中存在的采样效率低下的问题。通过协同整合这两种范式,UniCoDe 在保持高效采样的同时,实现了更优的奖励对齐与扩散模型无条件先验分布之间偏差的权衡。

链接: https://arxiv.org/abs/2512.12339
作者: Maurya Goyal,Anuj Singh,Hadi Jamali-Rad
机构: Delft University of Technology (代尔夫特理工大学); Shell Global Solutions International B.V. (壳牌全球解决方案国际有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks. The code is available at this https URL
zh

[CV-157] WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

【速读】:该论文旨在解决开放词汇目标检测(open-vocabulary object detection)中模型效率与泛化能力之间的矛盾问题,即如何在不依赖跨模态融合层(cross-modal fusion layers)的前提下实现高精度、实时且可扩展的检测性能。其解决方案的关键在于完全贯彻“检索式”(retrieval-based)理念,构建了一个名为WeDetect的模型家族:通过双塔架构(dual-tower architecture)将图像区域与文本查询映射到共享嵌入空间,从而避免昂贵的融合计算;同时利用冻结主干网络仅微调对象置信度提示(objectness prompt)实现历史数据的快速回溯(fast backtrack),并进一步结合大语言模型(LMMs)设计轻量级分类器(WeDetect-Ref),以单次前向传播完成复杂指代表达理解(referring expression comprehension, REC)。这一统一框架在15个基准测试上实现了SOTA性能,并显著提升了推理效率。

链接: https://arxiv.org/abs/2512.12309
作者: Shenghao Fu,Yukun Su,Fengyun Rao,Jing Lyu,Xiaohua Xie,Wei-Shi Zheng
机构: WeChat Vision, Tencent Inc. (微信视觉,腾讯公司); Pazhou Laboratory (黄埔),China; Shenzhen Loop Area Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
zh

[CV-158] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

【速读】:该论文旨在解决深度学习视觉模型中表征与决策难以解释的问题,特别是如何量化模型对隐含3D场景属性(如几何形状和材质)的理解程度。传统方法多基于像素层面的分析,缺乏对物理场景参数的直接建模能力,导致难以区分模型响应是由几何、材质还是光照等不同因素驱动的。解决方案的关键在于提出MRD(metamers rendered differentiably,可微分渲染的等效体),通过物理基础的可微分渲染技术,寻找在物理上不同但能引发相同模型激活的3D场景参数(即模型等效体),从而将模型行为映射到具物理意义的场景变量上。这种方法使得对模型敏感性或不变性的分析能够精确控制其他变量(如固定材质和光照),实现对模型内部机制更可靠、可解释的评估。

链接: https://arxiv.org/abs/2512.12307
作者: Benjamin Beilharz,Thomas S. A. Wallis
机构: Technical University of Darmstadt (达姆施塔特工业大学); Center for Mind, Brain and Behavior (CMBB), Universities of Marburg, Giessen, and Darmstadt (马尔堡、吉森和达姆施塔特大学心智、大脑与行为中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 18 pages, 6 figures. Supplementary material and code will be provided at the end of January

点击查看摘要

Abstract:While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models’ implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model’s sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
zh

[CV-159] OMUDA: Omni-level Masking for Unsupervised Domain Adaptation in Semantic Segmentation

【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)中语义分割模型在跨域迁移时面临的三大挑战:跨域上下文歧义、特征表示不一致以及类别级伪标签噪声。为应对这些问题,作者提出了一种统一框架Omni-level Masking for Unsupervised Domain Adaptation (OMUDA),其核心创新在于引入分层掩码策略,在上下文、表征和类别三个层次上协同减少域间差异。关键解决方案包括:1)上下文感知掩码(Context-Aware Masking, CAM)用于动态区分前景与背景,平衡全局上下文与局部细节;2)特征蒸馏掩码(Feature Distillation Masking, FDM)通过预训练模型的知识迁移增强特征学习的鲁棒性与一致性;3)类别解耦掩码(Class Decoupling Masking, CDM)显式建模类别不确定性以缓解伪标签噪声的影响。该分层掩码范式实现了对域偏移的多维度抑制,显著提升了模型在多个跨域语义分割基准上的性能,尤其在SYNTHIA-Cityscapes和GTA5-Cityscapes任务中平均提升达7%。

链接: https://arxiv.org/abs/2512.12303
作者: Yang Ou,Xiongwei Zhao,Xinye Yang,Yihan Wang,Yicheng Di,Rong Yuan,Xieyuanli Chen,Xu Zhu
机构: Chengdu University (成都大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Cardiff University (卡迪夫大学); Southwest Jiaotong University (西南交通大学); Jiangnan University (江南大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TMM

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA-Cityscapes and GTA5-Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.
zh

[CV-160] GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search WACV2026

【速读】:该论文旨在解决现有Transformer架构搜索(Transformer Architecture Search, TAS)方法中因共享权重导致的小规模子网络性能严重下降的问题。传统TAS方法通常训练一个超网络(supernet),其中所有候选架构共享同一组参数,这种设计在训练过程中产生显著的权重干扰,尤其影响小模型的性能。论文提出了一种渐进式训练框架GrowTAS,其核心思想是先训练小规模子网络,再逐步引入并训练更大规模的子网络,从而有效降低权重干扰并稳定训练过程。进一步地,GrowTAS+通过仅微调部分权重来提升大子网络的性能,实验证明该方法在ImageNet及多个迁移学习基准数据集上均优于当前主流TAS方法。

链接: https://arxiv.org/abs/2512.12296
作者: Hyunju Lee,Youngmin Oh,Jeimin Jeon,Donghyeon Baek,Bumsub Ham
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods
zh

[CV-161] RealDrag : The First Drag ging Benchmark with Real Target Image

【速读】:该论文旨在解决基于拖拽(point-based)图像编辑模型的评估不可靠问题,其根源在于缺乏标准化的基准测试集和量化指标,特别是缺少包含真实目标图像(ground truth target images)的数据集,导致不同方法之间的客观比较困难。解决方案的关键在于提出首个全面的基准测试集RealDrag,该数据集包含超过400个来自多样化视频源的人工标注样本,提供源图/目标图对、控制点(handle/target points)、可编辑区域掩码(editable region masks)及描述性标题;同时设计了四项任务特异性的新指标:语义距离(Semantical Distance, SeD)、外掩码保留得分(Outer Mask Preserving Score, OMPS)、内补丁保留得分(Inner Patch Preserving Score, IPPS)和方向相似性(Directional Similarity, DiS),分别用于衡量像素级匹配精度、非编辑区域保真度以及语义一致性,从而实现对当前17种最先进模型的大规模系统性评估,并建立可复现的基准以指导未来研究。

链接: https://arxiv.org/abs/2512.12287
作者: Ahmad Zafarani,Zahra Dehghanian,Mohammadreza Davoodi,Mohsen Shadroo,MohammadAmin Fazli,Hamid R. Rabiee
机构: Sharif University of Technology (谢里夫理工大学); Iran University of Science and Technology (伊朗科学与技术大学); Comprehensive University of Islamic Revolution (伊斯兰革命综合大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbfRealDrag, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.12287 [cs.CV] (or arXiv:2512.12287v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.12287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-162] Cognitive-YOLO: LLM -Driven Architecture Synthesis from First Principles of Data for Object Detection

【速读】:该论文旨在解决传统目标检测架构设计中手动设计耗时费力、而基于神经架构搜索(Neural Architecture Search, NAS)的方法计算成本过高的问题。现有利用大语言模型(Large Language Models, LLMs)的方法通常仅作为搜索循环中的迭代优化器,未能从数据本质特征出发直接生成网络结构。其解决方案的关键在于提出 Cognitive-YOLO 框架,该框架通过三个阶段实现数据驱动的架构合成:首先提取目标数据集的关键元特征(如目标尺度分布和场景密度);其次,LLM 结合检索增强生成(Retrieval-Augmented Generation, RAG)获取的先进组件对这些特征进行推理,生成结构化的神经架构描述语言(Neural Architecture Description Language, NADL);最后由编译器将描述实例化为可部署模型。实验证明,该方法在多个数据集上均能生成性能优越且参数效率更高的架构,且消融实验表明,LLM 对数据“第一性原理”的理解是性能提升的核心因素,远胜于单纯依赖最先进技术组件的检索。

链接: https://arxiv.org/abs/2512.12281
作者: Jiahao Zhao
机构: Xi’an University of Posts and Telecommunications (西安邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 3 ttables

点击查看摘要

Abstract:Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM’s data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data “first principles” is more critical for achieving a superior architecture than simply retrieving SOTA components.
zh

[CV-163] Feature Aggregation for Efficient Continual Learning of Complex Facial Expressions

【速读】:该论文旨在解决情感计算中面部表情识别(Facial Expression Recognition, FER)在持续学习(continual learning)场景下的灾难性遗忘问题,即模型在学习新情绪表达时会丢失对先前知识的记忆。其解决方案的关键在于提出一种融合双模态特征的混合框架:一方面利用深度卷积神经网络提取的视觉特征,另一方面引入基于面部动作单元(Action Units, AUs)的结构化信息,二者共同构建联合表示;并通过贝叶斯高斯混合模型(Bayesian Gaussian Mixture Models, BGMMs)进行建模,该方法无需重新训练即可实现轻量级、概率化的表达分类,从而有效提升模型对基础与复合情绪的渐进式识别能力,并显著增强知识保留性能。

链接: https://arxiv.org/abs/2512.12277
作者: Thibault Geoffroy,Myriam Maumy,Lionel Prevost
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8 figures, chapter for “Emotion and Facial Recognition in Artificial Intelligence: Sustainable Multidisciplinary Perspectives and Applications” (2026)

点击查看摘要

Abstract:As artificial intelligence (AI) systems become increasingly embedded in our daily life, the ability to recognize and adapt to human emotions is essential for effective human-computer interaction. Facial expression recognition (FER) provides a primary channel for inferring affective states, but the dynamic and culturally nuanced nature of emotions requires models that can learn continuously without forgetting prior knowledge. In this work, we propose a hybrid framework for FER in a continual learning setting that mitigates catastrophic forgetting. Our approach integrates two complementary modalities: deep convolutional features and facial Action Units (AUs) derived from the Facial Action Coding System (FACS). The combined representation is modelled through Bayesian Gaussian Mixture Models (BGMMs), which provide a lightweight, probabilistic solution that avoids retraining while offering strong discriminative power. Using the Compound Facial Expression of Emotion (CFEE) dataset, we show that our model can first learn basic expressions and then progressively recognize compound expressions. Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting. This framework contributes to the development of emotionally intelligent AI systems with applications in education, healthcare, and adaptive user interfaces.
zh

[CV-164] MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models NEURIPS2025

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在测试阶段面对领域偏移(domain shift)时性能下降的问题,尽管其具备强大的零样本泛化能力。现有方法如测试时提示调优(Test-Time Prompt Tuning, TPT)依赖固定数据增强策略,难以应对更具挑战性的场景。解决方案的关键在于提出元测试时提示调优(Meta Test-Time Prompt Tuning, MetaTPT),该方法通过元学习框架自监督地学习一个辅助任务,动态为每个样本生成参数化的数据增强策略,从而实现更丰富的特征表达;同时采用双层优化机制:内层学习生成信息量丰富的视图,外层通过跨视图一致性约束进行提示调优,使增强学习与提示调优协同优化,显著提升模型在领域迁移和跨数据集场景下的适应能力。

链接: https://arxiv.org/abs/2512.12268
作者: Yuqing Lei,Yingjun Du,Yawen Huang,Xiantong Zhen,Ling Shao
机构: UCAS-Terminus AI Lab, University of Chinese Academic of Sciences (中国科学院大学Terminus人工智能实验室); AIM Lab, University of Amsterdam (阿姆斯特丹大学AIM实验室); Jarvis Research Center, Tencent Youtu Lab (腾讯优图实验室Jarvis研究中心); Central Research Institue, United Imaging Healthcare Co., Ltd (联影医疗科技股份有限公司中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Workshop

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.
zh

[CV-165] Moment and Highlight Detection via MLLM Frame Segmentation

【速读】:该论文旨在解决视频片段(video moments)和高光时刻(highlights)检测任务中,基于生成式多模态大语言模型(MLLM)的文本时间戳预测方法缺乏帧级梯度传播的问题。传统方法通过生成文本形式的时间戳无法直接提供用于帧级预测的梯度,限制了模型在细粒度定位上的优化能力。解决方案的关键在于:将分割目标(segmentation objectives)直接应用于MLLM输出的字符序列上,即强制模型对固定数量的帧输出连续的“0”或“1”字符序列,其中“0”表示背景、“1”表示前景,从而利用MLLM的语言建模能力同时获得帧级概率分布;训练阶段结合分割损失与因果语言建模损失,推理时通过束搜索(beam search)生成序列和logits,分别对应视频中的高光时刻和显著性得分。该方法在仅采样25帧的情况下实现了优异的高光检测性能(QVHighlights数据集上HIT@1为56.74),且显著优于基线的片段检索性能(MAP达35.28)。

链接: https://arxiv.org/abs/2512.12246
作者: I Putu Andika Bagas Jiwanta,Ayu Purwarianti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM’s output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous “0” and/or “1” characters, with one character per frame. The “0”/“1” characters benefit from the LLM’s inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames – less than half of comparable methods – our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
zh

[CV-166] Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

【速读】:该论文旨在解决超低比特率图像压缩(低于0.05 bits per pixel)场景下,现有方法对复杂预训练编码器(如变分自编码器 VAE 或基于分词器的模型)的依赖问题,这类方法虽能实现高感知保真度,但因计算资源消耗大而不适用于边缘设备等弱发送端。其解决方案的关键在于提出一种不对称极端图像压缩(Asymmetric Extreme Image Compression, AEIC)框架:采用中等甚至浅层编码网络以降低编码复杂度,同时利用一步扩散解码器(one-step diffusion decoder)在极低比特率下保持高质量和高真实感重建;此外,设计双侧特征蒸馏机制,将中等编码器的知识迁移至浅层版本,进一步提升浅层编码器的效率与性能。

链接: https://arxiv.org/abs/2512.12229
作者: Tianyu Zhang,Dong Liu,Chang Wen Chen
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
zh

[CV-167] Comparison of different segmentation algorithms on brain volume and fractal dimension in infant brain MRIs

【速读】:该论文旨在解决婴儿脑部磁共振成像(MRI)中自动分割精度不足的问题,尤其是在出生后早期阶段由于髓鞘形成不完全和组织对比度降低导致的分割挑战。研究通过对比两种主流分割方法——SynthSeg与SamSeg在Baby Open Brains(BOB)数据集上的表现,评估其对体积测量和分形维数(fractal dimension, FD)估计的影响。关键解决方案在于采用基于深度学习的SynthSeg方法,其在Dice系数、交并比(Intersection over Union)、95% Hausdorff距离及归一化互信息等指标上均显著优于SamSeg,且提供更接近专家标注的体积估计(平均偏差+4%),而SamSeg则系统性高估脑室和全脑体积(平均+76%)。此外,研究发现分割误差会直接传导至FD分析中,提示在发育研究中需谨慎解释微小形态差异,因为它们可能源于分割不确定性而非真实生物学变化。

链接: https://arxiv.org/abs/2512.12222
作者: Nathalie Alexander,Arnaud Gucciardi,Umberto Michelucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation of infant brain MRI is essential for quantifying developmental changes in structure and complexity. However, ongoing myelination and reduced tissue contrast make automated segmentation particularly challenging. This study systematically compared segmentation accuracy and its impact on volumetric and fractal dimension (FD) estimates in infant brain MRI using the Baby Open Brains (BOB) dataset (71 scans, 1-9 months). Two methods, SynthSeg and SamSeg, were evaluated against expert annotations using Dice, Intersection over Union, 95th-percentile Hausdorff distance, and Normalised Mutual Information. SynthSeg outperformed SamSeg across all quality metrics (mean Dice 0.8 for major regions) and provided volumetric estimates closely matching the manual reference (mean +4% [-28% - 71%]). SamSeg systematically overestimated ventricular and whole-brain volumes (mean +76% [-12% - 190%]). Segmentation accuracy improved with age, consistent with increasing tissue contrast during myelination. Fractal dimension a(FD) nalyses revealed significant regional differences between SynthSeg and expert segmentations, and Bland-Altman limits of agreement indicated that segmentation-related FD variability exceeded most group differences reported in developmental cohorts. Volume and FD deviations were positively correlated across structures, indicating that segmentation bias directly affects FD estimation. Overall, SynthSeg provided the most reliable volumetric and FD results for paediatric MRI, yet small morphological differences in volume and FD should be interpreted with caution due to segmentation-related uncertainty.
zh

[CV-168] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

【速读】:该论文旨在解决专业图像生成(Professional Image Generation)中模型难以生成信息密集且科学精确的插图问题,即如何从技术描述中合成符合科学规范的生物学示意图、工程/专利图纸及通用科学图表,而非仅生成视觉上合理的图像。其解决方案的关键在于提出ProImage-Bench——一个基于评分标准(rubric-based benchmark)的评估框架,通过从真实教材和技术报告中收集654张图像,构建详细的图像指令与包含6,076个评判标准和44,131个二元检查项的层级评分体系;该评分体系由大语言模型(LLM)和多模态模型(LMM)自动提取并由基于LMM的判别器进行评估,采用结构化惩罚机制将子问题结果聚合为可解释的指标分数。实验表明,即使最强的基础生成模型在rubric准确率上仅为0.791,说明当前模型在细粒度科学保真度方面存在显著差距;进一步地,利用失败检查项反馈至编辑模型进行迭代优化,可使性能大幅提升,验证了该框架不仅能精准诊断问题,还能提供可扩展的监督信号以提升生成内容对技术规范的忠实度。

链接: https://arxiv.org/abs/2512.12220
作者: Minheng Ni,Zhengyuan Yang,Yaowen Zhang,Linjie Li,Chung-Ching Lin,Kevin Lin,Zhendong Wang,Xiaofei Wang,Shujie Liu,Lei Zhang,Wangmeng Zuo,Lijuan Wang
机构: Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.
zh

[CV-169] Fine-Grained Zero-Shot Learning with Attribute-Centric Representations

【速读】:该论文旨在解决零样本学习(Zero-Shot Learning, ZSL)中未见细粒度类别识别的问题,核心挑战在于属性纠缠(attribute entanglement),即传统模型将颜色、形状和纹理等不同视觉属性混合到单一视觉嵌入中,导致关键差异被掩盖。解决方案的关键在于提出一种属性中心表示(Attribute-Centric Representations, ACR)框架,在表征学习阶段直接实现属性解耦(attribute disentanglement)。ACR通过两个专家混合组件实现:一是基于双层路由机制插入Transformer的Patch专家混合(Mixture of Patch Experts, MoPE),条件性地将图像块分配给专用专家以处理同质属性簇;二是属性专家混合(Mixture of Attribute Experts, MoAE)将专家优化后的特征投影为稀疏且部位感知的属性图,从而提升零样本分类的鲁棒性。

链接: https://arxiv.org/abs/2512.12219
作者: Zhi Chen,Jingcai Guo,Taotao Cai,Yuxiang Cai
机构: University of Southern Queensland (南昆士兰大学); The Hong Kong Polytechnic University (香港理工大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Recognizing unseen fine-grained categories demands a model that can distinguish subtle visual differences. This is typically achieved by transferring visual-attribute relationships from seen classes to unseen classes. The core challenge is attribute entanglement, where conventional models collapse distinct attributes like color, shape, and texture into a single visual embedding. This causes interference that masks these critical distinctions. The post-hoc solutions of previous work are insufficient, as they operate on representations that are already mixed. We propose a zero-shot learning framework that learns AttributeCentric Representations (ACR) to tackle this problem by imposing attribute disentanglement during representation learning. ACR is achieved with two mixture-of-experts components, including Mixture of Patch Experts (MoPE) and Mixture of Attribute Experts (MoAE). First, MoPE is inserted into the transformer using a dual-level routing mechanism to conditionally dispatch image patches to specialized experts. This ensures coherent attribute families are processed by dedicated experts. Finally, the MoAE head projects these expert-refined features into sparse, partaware attribute maps for robust zero-shot classification. On zero-shot learning benchmark datasets CUB, AwA2, and SUN, our ACR achieves consistent state-of-the-art results.
zh

[CV-170] CineLOG: A Training Free Approach for Cinematic Long Video Generation

【速读】:该论文旨在解决可控视频生成(Controllable Video Synthesis)中的关键挑战,即现有模型难以在文本提示之外实现对电影属性(如镜头运动轨迹和影片类型)的精细控制,且现有数据集普遍存在数据不平衡、标签噪声或仿真与真实场景差距大的问题。其解决方案的核心在于提出一个名为 CineLOG 的新数据集,包含 5,000 条高质量、未剪辑、类别平衡的视频片段,每条均标注了详细的场景描述、基于标准电影分类体系的明确摄像机指令(camera instructions)以及影片类型标签,从而实现对 17 种摄像机运动方式和 15 种电影类型的均衡覆盖;同时设计了一种将复杂文本到视频(Text-to-Video, T2V)生成任务解耦为四个更易处理阶段的新颖流水线,并引入一种轨迹引导过渡模块(Trajectory Guided Transition Module),以生成平滑的时空插值,从而实现连贯的多镜头序列生成。

链接: https://arxiv.org/abs/2512.12209
作者: Zahra Dehghanian,Morteza Abolghasemi,Hamid Beigy,Hamid R. Rabiee
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at this https URL.
zh

[CV-171] A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction

【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)儿童在社交互动中情感反应识别的难题,尤其是在人形机器人(如NAO)实施挑衅性行为(如辱骂事件)时,如何准确捕捉其细微的情绪变化。解决方案的关键在于构建一个融合视觉与几何特征的深度学习流水线:首先利用微调的ResNet-50 CNN提取面部图像特征,同时通过MediaPipe FaceMesh获取关键点几何信息;随后采用三层图卷积网络(Graph Convolutional Network, GCN)建模面部结构关系;最终通过加权集成DeepFace与FER模型生成软标签,并基于KL散度优化嵌入表示进行分类。该方法首次在印度构建了大规模、真实场景下的ASD儿童情绪数据集,显著提升了对神经多样性儿童微表情的敏感性,为临床和治疗场景中的人机交互(Human-Robot Interaction, HRI)提供了可扩展的个性化情感分析框架。

链接: https://arxiv.org/abs/2512.12208
作者: Indranil Bhattacharjee,Vartika Narayani Srinet,Anirudha Bhattacharjee,Braj Bhushan,Bishakh Bhattacharya
机构: Cochin University of Science and Technology (科钦科技大学); Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 12 pages, journal paper

点击查看摘要

Abstract:Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace’s and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.
zh

[CV-172] ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

【速读】:该论文旨在解决两个关键问题:一是缺乏大规模、多样化的现实场景下用于驾驶员行为识别(Driver Activity Recognition, DAR)的超宽带雷达(Impulse Radio Ultra-Wideband, IR-UWB)数据集;二是现有固定输入尺寸的视觉Transformer(Vision Transformer, ViT)难以适配非标准维度的UWB雷达数据。解决方案的关键在于提出两个核心创新:其一,构建了ALERT数据集,包含10,220个来自真实驾驶环境的七类分心驾驶行为雷达样本,填补了高质量UWB数据的空白;其二,设计了输入尺寸无关的视觉Transformer(Input-size-agnostic Vision Transformer, ISA-ViT),通过调整patch配置和利用预训练位置嵌入向量(Positional Embedding Vectors, PEVs),在不破坏雷达特有的多普勒频移和相位特征的前提下实现对UWB数据的有效重采样,并结合时域与频域特征融合策略提升分类性能。实验表明,ISA-ViT相较现有ViT方法在UWB-DAR任务中准确率提升达22.68%。

链接: https://arxiv.org/abs/2512.12206
作者: Jeongjun Park,Sunwook Hwang,Hyeonho Noh,Jin Mo Yang,Hyun Jong Yang,Saewoong Bahk
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.12206 [cs.CV] (or arXiv:2512.12206v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.12206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-173] A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection

【速读】:该论文旨在解决智能城市部署中视觉模型长期稳定性与漂移问题,特别是针对街灯等固定视角监控场景下的视觉退化、异常检测及MLOps(机器学习运维)策略优化。其关键解决方案是构建一个大规模、纵向的街灯图像数据集(包含超过526,000张小时级采集图像及其丰富元数据),并提出基于卷积变分自编码器(CNN-VAE)的自监督框架,用于建模每个摄像头节点在昼夜条件下的潜在空间漂移。通过定义两个样本级漂移指标——相对中心点漂移(捕捉潜在空间偏离基线季度的变化)和相对重建误差(衡量图像域内的归一化退化程度),该方案为评估长期模型稳定性、漂移感知学习和可部署视觉系统提供了现实且细粒度的基准。

链接: https://arxiv.org/abs/2512.12205
作者: Peizheng Li,Ioannis Mavromatis,Ajith Sahadevan,Tim Farnham,Adnan Aijaz,Aftab Khan
机构: Toshiba Europe Ltd.(东芝欧洲有限公司); Digital Catapult(数字 catapult)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Submitted to Data in Brief (Elsevier)

点击查看摘要

Abstract:We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at this https URL and this https URL.
zh

[CV-174] Navigation Around Unknown Space Objects Using Visible-Thermal Image Fusion

【速读】:该论文旨在解决在低地球轨道(Low Earth Orbit, LEO)环境中对未知空间目标(Resident Space Object, RSO)进行精确导航的问题,特别是在光照条件恶劣(如日食或阴影期)下传统可见光相机性能受限、而热红外相机虽能适应复杂照明但缺乏特征丰富性和分辨率的困境。解决方案的关键在于利用像素级融合方法将可见光与热红外图像进行融合,生成兼具高特征丰富性与强环境适应性的复合图像,并在此基础上评估单目SLAM(Simultaneous Localization and Mapping)算法在不同成像模式下的导航误差。实验表明,融合图像显著优于仅使用可见光或仅使用热红外图像的导航性能。

链接: https://arxiv.org/abs/2512.12203
作者: Eric J. Elias,Michael Esswein,Jonathan P. How,David W. Miller
机构: Massachusetts Institute of Technology (麻省理工学院); The Charles Stark Draper Laboratory Inc. (查尔斯·斯塔克·德雷珀实验室公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures. To be published in proceedings of AIAA SCITECH 2026 Forum

点击查看摘要

Abstract:As the popularity of on-orbit operations grows, so does the need for precise navigation around unknown resident space objects (RSOs) such as other spacecraft, orbital debris, and asteroids. The use of Simultaneous Localization and Mapping (SLAM) algorithms is often studied as a method to map out the surface of an RSO and find the inspector’s relative pose using a lidar or conventional camera. However, conventional cameras struggle during eclipse or shadowed periods, and lidar, though robust to lighting conditions, tends to be heavier, bulkier, and more power-intensive. Thermal-infrared cameras can track the target RSO throughout difficult illumination conditions without these limitations. While useful, thermal-infrared imagery lacks the resolution and feature-richness of visible cameras. In this work, images of a target satellite in low Earth orbit are photo-realistically simulated in both visible and thermal-infrared bands. Pixel-level fusion methods are used to create visible/thermal-infrared composites that leverage the best aspects of each camera. Navigation errors from a monocular SLAM algorithm are compared between visible, thermal-infrared, and fused imagery in various lighting and trajectories. Fused imagery yields substantially improved navigation performance over visible-only and thermal-only methods.
zh

[CV-175] hermal RGB Fusion for Micro-UAV Wildfire Perimeter Tracking with Minimal Comms MICRO

【速读】:该论文旨在解决微小型无人飞行器(micro UAV)团队在野火环境中执行边界追踪任务时,受限于带宽条件下的感知与通信效率问题。解决方案的关键在于提出一种轻量级的多模态融合追踪方法:利用热成像帧通过自适应阈值和形态学优化生成粗略热点掩膜,RGB帧则结合梯度滤波提供边缘线索并抑制纹理伪影;采用规则级合并策略筛选边界候选点,并通过Ramer-Douglas-Peucker算法简化轮廓;同时引入周期性信标与惯性反馈环路以提升GPS信号劣化时的轨迹稳定性,且通过像素级操作约束和梯度表预计算实现嵌入式系统芯片(SoC)平台上的亚50毫秒延迟,从而保障高动态性能下低功耗运行(支持10–15 m/s前向速度),显著减少路径长度和边界抖动,维持环境覆盖完整性。

链接: https://arxiv.org/abs/2512.12199
作者: Ercan Erkalkan,Vedat Topuz,Ayça Ak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Conference paper in 17th International Scientific Studies Congress proceedings. Topic: thermal+RGB rule level fusion, RDP boundary simplification, leader follower guidance, sub 50ms embedded SoC, minimal communications for wildfire perimeter tracking. Thermal RGB Fusion for Micro-UAV

点击查看摘要

Abstract:This study introduces a lightweight perimeter tracking method designed for micro UAV teams operating over wildfire environments under limited bandwidth conditions. Thermal image frames generate coarse hot region masks through adaptive thresholding and morphological refinement, while RGB frames contribute edge cues and suppress texture related false detections using gradient based filtering. A rule level merging strategy selects boundary candidates and simplifies them via the Ramer Douglas Peucker algorithm. The system incorporates periodic beacons and an inertial feedback loop that maintains trajectory stability in the presence of GPS degradation. The guidance loop targets sub 50 ms latency on embedded System on Chip (SoC) platforms by constraining per frame pixel operations and precomputing gradient tables. Small scale simulations demonstrate reductions in average path length and boundary jitter compared to a pure edge tracking baseline, while maintaining environmental coverage measured through intersection merge analysis. Battery consumption and computational utilization confirm the feasibility of achieving 10, 15 m/s forward motion on standard micro platforms. This approach enables rapid deployment in the field, requiring robust sensing and minimal communications for emergency reconnaissance applications.
zh

[CV-176] AutoMV: An Automatic Multi-Agent System for Music Video Generation

【速读】:该论文旨在解决音乐到视频(Music-to-Video, M2V)生成中长期视频难以对齐音乐结构、缺乏时间一致性以及内容割裂的问题。现有方法仅能生成短片段,无法体现歌曲的节奏、歌词或情感发展逻辑。其解决方案的关键在于提出一个名为AutoMV的多智能体系统(multi-agent system),通过分工协作实现全流程自动化:首先利用音乐处理工具提取结构、人声和时间对齐歌词等特征作为上下文输入;随后由编剧Agent与导演Agent共同设计脚本、角色档案及镜头指令,并调用图像生成器生成关键帧和不同场景的视频片段;最后由验证Agent评估输出质量,确保整体连贯性。该架构显著提升了M2V生成的完整性与一致性,使AutoMV在多项指标上超越现有基线,接近专业人工制作水平。

链接: https://arxiv.org/abs/2512.12196
作者: Xiaoxuan Tang,Xinping Lei,Chaoran Zhu,Shiyun Chen,Ruibin Yuan,Yizhi Li,Changjae Oh,Ge Zhang,Wenhao Huang,Emmanouil Benetos,Yang Liu,Jiaheng Liu,Yinghao Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanjing University (南京大学); Queen Mary University of London (伦敦玛丽女王大学); Hong Kong University of Science and Technology (香港科技大学); University of Manchester (曼彻斯特大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for “story” or “singer” scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
zh

[CV-177] SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

【速读】:该论文旨在解决定制化视频生成中Subject(主体)外观相似性与运动模式一致性难以同时保障的问题,其核心挑战在于缺乏对主体和运动的物体级引导。解决方案的关键在于提出SMRABooth框架,通过自监督编码器(self-supervised encoder)提取物体级主体表征,以及光流编码器(optical flow encoder)提取独立于外观的结构一致运动轨迹表征,并在LoRA微调过程中进行对齐;进一步设计了主体-运动关联解耦策略,采用稀疏LoRA注入方式在空间和时间维度上分离主体与运动控制模块,有效降低两者间的干扰,从而实现高保真主体保留与稳定运动模式生成。

链接: https://arxiv.org/abs/2512.12193
作者: Xuancheng Xu,Yaning Li,Sisi You,Bing-Kun Bao
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Customized video generation aims to produce videos that faithfully preserve the subject’s appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.
zh

[CV-178] Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video

【速读】:该论文旨在解决在视觉条件退化(如运动模糊或遮挡)情况下,从真实场景视频中准确估计相对相机位姿(relative camera pose estimation)的问题。其核心挑战在于纯视觉方法在低质量图像条件下性能显著下降。解决方案的关键在于引入被动场景声音作为互补线索,通过将方向到达角(direction-of-arrival, DOA)谱和双耳化嵌入(binauralized embeddings)融合进一个先进的纯视觉位姿估计模型,从而提升位姿估计的鲁棒性与准确性。这一方法首次成功地在真实世界视频中利用音频信号实现相对相机位姿估计,验证了日常声学信息作为空间感知新模态的潜力。

链接: https://arxiv.org/abs/2512.12165
作者: Daniel Adebi,Sagnik Majumder,Kristen Grauman
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: this http URL.
zh

[CV-179] Open Horizons: Evaluating Deep Models in the Wild

【速读】:该论文旨在解决开放世界部署中模型对已知类别识别与未知类别检测的双重挑战,即在面对新类别出现时仍能保持可靠性的问题。其核心解决方案在于通过统一实验框架对比分析不同视觉编码器(如ResNet-50、ConvNeXt-Tiny和CLIP ViT-B/16)与后处理评分机制(MSP、Energy、Mahalanobis、kNN)在开放集识别(Open-set Recognition, OSR)中的表现,并进一步评估原型驱动方法(如ConCM)在少量样本增量学习(Few-shot Class-Incremental Learning, FSCIL)中缓解灾难性遗忘的能力。关键发现包括:CLIP在OSR任务中展现出最优的已知与未知样本分离能力,而Energy评分函数具有最稳定的性能;在FSCIL中,ConCM方法在10-shot设置下达到84.7%准确率且混淆矩阵最清晰,表明原型更新策略能有效抑制遗忘,同时验证了模型架构与评分机制对未知检测效果的显著影响。

链接: https://arxiv.org/abs/2512.12146
作者: Ayush Vaibhav Bhatti,Deniz Karakay,Debottama Das,Nilotpal Rajbongshi,Yuito Sugimoto
机构: University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-world deployment requires models to recognize both known categories and remain reliable when novel classes appear. We present a unified experimental study spanning open-set recognition (OSR) and few-shot class-incremental learning (FSCIL) on CIFAR-10. For OSR, we compare three pretrained frozen visual encoders: ResNet-50, ConvNeXt-Tiny and CLIP ViT-B/16,using a linear probe and four post-hoc scoring functions, namely MSP, Energy, Mahalanobis and kNN. Across metrics,such as, AUROC, AUPR, FPR@95, and OSCR, CLIP consistently yields the strongest separability between known and unknown samples, with Energy providing the most stable performance across backbones. For FSCIL, we compare modified SPPR, OrCo, and ConCM using partially frozen ResNet-50 across 1-, 5-, and 10-shot scenarios. ConCM achieves 84.7% accuracy in the 10-shot setting with the cleanest confusion matrix, while all methods show saturation beyond 5 shots. Our controlled evaluation reveals how the backbone architecture and scoring mechanisms affect unknown detection and how prototype-based methods mitigate catastrophic forgetting during incremental adaptation.
zh

[CV-180] MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater

【速读】:该论文旨在解决格陵兰冰盖表面融水分布监测中高时空分辨率数据缺失的问题,当前遥感观测与物理模型之间的权衡限制了对融化过程的精确刻画。其解决方案的关键在于利用深度学习模型融合多源数据流——包括合成孔径雷达(SAR)、被动微波(PMW)遥感数据和数字高程模型(DEM),并以SAR反演的融水产品作为“地面真值”,实现从区域气候模型(RCM)输出到每日100米空间分辨率的时空下分辨率映射。实验表明,该方法在研究区(东格陵兰Helheim冰川)相比仅依赖RCM或PMW的传统方法显著提升精度(83% vs. 95% Acc. 和 72% vs. 95% Acc.),验证了多源数据融合与深度学习驱动的降尺度策略的有效性。

链接: https://arxiv.org/abs/2512.12142
作者: Björn Lütjens,Patrick Alexander,Raf Antwerpen,Til Widmann,Guido Cervone,Marco Tedesco
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:The Greenland ice sheet is melting at an accelerated rate due to processes that are not fully understood and hard to measure. The distribution of surface meltwater can help understand these processes and is observable through remote sensing, but current maps of meltwater face a trade-off: They are either high-resolution in time or space, but not both. We develop a deep learning model that creates gridded surface meltwater maps at daily 100m resolution by fusing data streams from remote sensing observations and physics-based models. In particular, we spatiotemporally downscale regional climate model (RCM) outputs using synthetic aperture radar (SAR), passive microwave (PMW), and a digital elevation model (DEM) over the Helheim Glacier in Eastern Greenland from 2017-2023. Using SAR-derived meltwater as “ground truth”, we show that a deep learning-based method that fuses all data streams is over 10 percentage points more accurate over our study area than existing non deep learning-based approaches that only rely on a regional climate model (83% vs. 95% Acc.) or passive microwave observations (72% vs. 95% Acc.). Alternatively, creating a gridded product through a running window calculation with SAR data underestimates extreme melt events, but also achieves notable accuracy (90%) and does not rely on deep learning. We evaluate standard deep learning methods (UNet and DeepLabv3+), and publish our spatiotemporally aligned dataset as a benchmark, MeltwaterBench, for intercomparisons with more complex data-driven downscaling methods. The code and data are available at \hrefthis https URLthis http URL .
zh

[CV-181] A Benchmark Dataset for Spatially Aligned Road Damage Assessment in Small Uncrewed Aerial Systems Disaster Imagery AAAI’26

【速读】:该论文旨在解决灾后道路损伤评估数据集规模小、分辨率低以及机器学习(ML)模型缺乏实际操作验证的问题,从而提升应急管理人员对灾害现场道路状况的准确判断能力。其关键解决方案在于构建了目前最大规模的基准数据集CRASAR-U-DRIODs,涵盖10次联邦宣布灾难后的无人机(sUAS)影像,并基于10类标签体系标注了657.25公里道路;同时,通过在飓风Debby和Helene的实际响应中部署18个基线模型,首次实现了ML系统在真实灾情中的操作验证。此外,研究发现道路线空间错位显著影响模型性能(平均Macro IoU下降5.596%),因此引入9,184条道路线调整以纠正初始道路线偏移,确保模型输入几何一致性,从而减少误判率(约8%的不良路况被错误标记,约9%的道路线偏离实际位置)。这一方法填补了当前ML、计算机视觉(CV)与机器人学社区在灾后道路评估中对空间对齐机制重视不足的空白。

链接: https://arxiv.org/abs/2512.12128
作者: Thomas Manzini,Priyankari Perali,Raisa Karnik,Robin R. Murphy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, 6 tables. To appear AAAI’26

点击查看摘要

Abstract:This paper presents the largest known benchmark dataset for road damage assessment and road alignment, and provides 18 baseline models trained on the CRASAR-U-DRIODs dataset’s post-disaster small uncrewed aerial systems (sUAS) imagery from 10 federally declared disasters, addressing three challenges within prior post-disaster road damage assessment datasets. While prior disaster road damage assessment datasets exist, there is no current state of practice, as prior public datasets have either been small-scale or reliant on low-resolution imagery insufficient for detecting phenomena of interest to emergency managers. Further, while machine learning (ML) systems have been developed for this task previously, none are known to have been operationally validated. These limitations are overcome in this work through the labeling of 657.25km of roads according to a 10-class labeling schema, followed by training and deploying ML models during the operational response to Hurricanes Debby and Helene in 2024. Motivated by observed road line misalignment in practice, 9,184 road line adjustments were provided for spatial alignment of a priori road lines, as it was found that when the 18 baseline models are deployed against real-world misaligned road lines, model performance degraded on average by 5.596% Macro IoU. If spatial alignment is not considered, approximately 8% (11km) of adverse conditions on road lines will be labeled incorrectly, with approximately 9% (59km) of road lines misaligned off the actual road. These dynamics are gaps that should be addressed by the ML, CV, and robotics communities to enable more effective and informed decision-making during disasters.
zh

[CV-182] A Novel Patch-Based TDA Approach for Computed Tomography

【速读】:该论文旨在解决基于CT影像的机器学习模型在特征提取效率与性能上的局限性问题,尤其是传统基于3D立方体复形滤波(3D cubical complex filtration)构建持久同调(Persistent Homology, PH)方法在高分辨率CT图像中存在计算复杂度高、性能不稳定的问题。其解决方案的关键在于提出一种新型的基于图像块(patch-based)的PH构造方法,该方法针对体积医学影像数据(特别是CT模态)设计,能够更高效地提取拓扑特征,并显著提升分类性能和时间效率。实验表明,该方法在准确率、AUC、敏感性、特异性及F1分数上平均优于传统立方体复形方法,分别提升了10.38%、6.94%、2.06%、11.58%和8.51%,同时提供了开源Python工具包Patch-TDA以促进实际应用。

链接: https://arxiv.org/abs/2512.12108
作者: Dashti A. Ali,Aras T. Asaad,Jacob J. Peoples,Mohammad Hamghalam,Alex Robins,Mane Piliposyan,Richard K. G. Do,Natalie Gangai,Yun S. Chun,Ahmad Bashir Barekzai,Jayasree Chakraborty,Hala Khasawneh,Camila Vilela,Natally Horvat,João Miranda,Alice C. Wei,Amber L. Simpson
机构: Queen’s University (皇后大学); Oxford University (牛津大学); Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心); University of Texas MD Anderson Cancer Centre (德克萨斯大学MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The development of machine learning (ML) models based on computed tomography (CT) imaging modality has been a major focus of recent research in the medical imaging domain. Incorporating robust feature engineering approach can highly improve the performance of these models. Topological data analysis (TDA), a recent development based on the mathematical field of algebraic topology, mainly focuses on the data from a topological perspective, extracting deeper insight and higher dimensional structures from the data. Persistent homology (PH), a fundamental tool in the area of TDA, can extract topological features such as connected components, cycles and voids from the data. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach may not always yield the best performance and can suffer from computational complexity with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric medical imaging data, in particular CT modality. A wide range of experiments has been conducted on several datasets of 3D CT images to comprehensively analyze the performance of the proposed method with various parameters and benchmark it against the 3D cubical complex algorithm. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and time-efficiency. The proposed approach outperformed the cubical complex method, achieving average improvement of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.
zh

[CV-183] EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

【速读】:该论文旨在解决超声心动图(echocardiography)图像解读中存在的人工密集型、多模态特征复杂以及缺乏基于测量推理的挑战,这些问题限制了现有视觉-语言模型(VLMs)在该领域的应用效果。其解决方案的关键在于构建首个以测量为基准的多模态超声心动图数据集 EchoGround-MIMIC,并提出 EchoVLM 模型,该模型引入两项创新的预训练目标:(i) 视图感知对比损失(view-informed contrastive loss),用于建模超声图像的视图依赖结构;(ii) 否定感知对比损失(negation-aware contrastive loss),能够区分临床关键的阴性与阳性发现。这一方法显著提升了模型在疾病分类、视图识别、图像文本检索等任务上的性能,验证了临床 grounded 多模态预训练可生成可迁移的视觉表征,使 EchoVLM 成为端到端超声心动图分析的基础模型。

链接: https://arxiv.org/abs/2512.12107
作者: Yuheng Li,Yue Zhang,Abdoul Aziz Amadou,Yuxiang Lai,Jike Zhong,Tiziano Passerini,Dorin Comaniciu,Puneet Sharma
机构: Georgia Institute of Technology (佐治亚理工学院); Siemens Healthineers (西门子医疗健康); Siemens Healthcare Limited (西门子医疗有限公司); Emory University (埃默里大学); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.
zh

[CV-184] AI-Augmented Pollen Recognition in Optical and Holographic Microscopy for Veterinary Imaging

【速读】:该论文旨在解决在数字全息显微镜(Digital In-line Holographic Microscopy, DIHM)图像中实现全自动花粉识别的难题,其核心挑战在于DIHM图像存在散斑噪声(speckle noise)、孪生像伪影(twin-image artifacts)以及与明场图像外观显著差异,导致传统基于光学显微镜(optical microscopy)训练的检测与分类模型在DIHM数据上性能急剧下降。解决方案的关键在于引入基于Wasserstein生成对抗网络(WGAN-SN)的合成数据增强策略,通过生成高质量的DIHM图像以扩充训练数据集,并采用1:1.5的真实与合成数据混合比例进行训练,使DIHM图像上的目标检测mAP50从8.15%提升至15.4%,显著缩小了与光学图像识别性能之间的差距,从而推动DIHM在兽医影像自动化分析中的实际应用进程。

链接: https://arxiv.org/abs/2512.12101
作者: Swarn S. Warshaneyan,Maksims Ivanovs,Blaž Cugmas,Inese Bērziņa,Laura Goldberga,Mindaugas Tamosiunas,Roberts Kadiķis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 10 figures, 2 tables, 22 references. Journal submission undergoing peer review

点击查看摘要

Abstract:We present a comprehensive study on fully automated pollen recognition across both conventional optical and digital in-line holographic microscopy (DIHM) images of sample slides. Visually recognizing pollen in unreconstructed holographic images remains challenging due to speckle noise, twin-image artifacts and substantial divergence from bright-field appearances. We establish the performance baseline by training YOLOv8s for object detection and MobileNetV3L for classification on a dual-modality dataset of automatically annotated optical and affinely aligned DIHM images. On optical data, detection mAP50 reaches 91.3% and classification accuracy reaches 97%, whereas on DIHM data, we achieve only 8.15% for detection mAP50 and 50% for classification accuracy. Expanding the bounding boxes of pollens in DIHM images over those acquired in aligned optical images achieves 13.3% for detection mAP50 and 54% for classification accuracy. To improve object detection in DIHM images, we employ a Wasserstein GAN with spectral normalization (WGAN-SN) to create synthetic DIHM images, yielding an FID score of 58.246. Mixing real-world and synthetic data at the 1.0 : 1.5 ratio for DIHM images improves object detection up to 15.4%. These results demonstrate that GAN-based augmentation can reduce the performance divide, bringing fully automated DIHM workflows for veterinary imaging a small but important step closer to practice.
zh

[CV-185] SPDMark: Selective Parameter Displacement for Robust Video Watermarking

【速读】:该论文旨在解决生成式视频(generative video)中缺乏高效且可靠的水印方案问题,现有基于后处理或生成过程中的水印方法难以同时实现不可感知性、鲁棒性和计算效率。其解决方案的关键在于提出一种名为SPDMark(SpeedMark)的生成过程中水印框架,通过选择性地调整视频扩散模型(video diffusion model)的部分参数来嵌入水印信息;具体而言,将参数偏移建模为层间基向量位移的加性组合,并由水印密钥索引,从而实现高效且可追溯的水印嵌入与提取。该方法利用低秩适应(LoRA)技术提升参数效率,在训练阶段联合优化基位移和水印提取器,最小化消息恢复误差、感知相似性和时间一致性损失;检测时采用密码学哈希函数从基础密钥生成帧级水印消息,并在提取阶段使用最大二分图匹配算法恢复被篡改视频中的正确帧序,从而保障水印在多种常见视频修改下的鲁棒性。

链接: https://arxiv.org/abs/2512.12090
作者: Samar Fares,Nurbek Tastan,Karthik Nandakumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark’) based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.
zh

[CV-186] RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

【速读】:该论文旨在解决预训练视觉基础模型(Vision Foundation Models, VFMs)高维特征注入扩散Transformer(Diffusion Transformers, DiTs)时导致的信息过载(Information Overload)问题,尤其是在VFM特征维度超过原始图像解码尺寸时,会干扰模型学习并降低生成质量。解决方案的关键在于提出RePack(Representation Packing)框架,通过将高维VFM特征投影到低维流形空间中,实现对语义信息的压缩与重构,从而有效过滤非语义噪声并保留重建所需的结构信息,同时提升DiT的收敛速度和生成性能。实验表明,RePack在DiT-XL/2上仅用64个epoch即可达到FID=3.66,比当前最优方法快35%。

链接: https://arxiv.org/abs/2512.12083
作者: Guanfang Dong,Luke Schultz,Negar Hassanpour,Chao Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.
zh

[CV-187] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

【速读】:该论文旨在解决自回归视频模型在世界建模中因暴露偏差(exposure bias)导致的误差累积和质量漂移问题,即训练时使用真实帧作为上下文,而推理时依赖模型自身生成的帧,从而引发长期预测不稳定。其解决方案的关键在于提出一种名为Backwards Aggregation (BAgger) 的自监督机制,通过利用模型自身的轨迹回放构建纠正性路径,使模型学会从自身错误中恢复;该方法采用标准的score matching或flow matching目标进行训练,无需大型教师模型或长时间反向传播,显著提升了长时程运动的稳定性与视觉一致性。

链接: https://arxiv.org/abs/2512.12080
作者: Ryan Po,Eric Ryan Chan,Changan Chen,Gordon Wetzstein
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page here: this https URL

点击查看摘要

Abstract:Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model’s own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.
zh

[CV-188] CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

【速读】:该论文旨在解决AI生成视频(AIGC)和真实视频中普遍存在但难以修复的严重结构与时间不一致性问题,如人脸和手部扭曲、背景变形及运动失真等。传统视频恢复(VR/VSR)方法针对合成退化(如模糊和下采样)设计,倾向于稳定而非修复此类结构性缺陷;而基于扩散先验的恢复方法通常仅训练于光度噪声,且难以在感知质量与保真度之间进行有效权衡。其解决方案的关键在于提出CreativeVR框架,采用基于深度适配器(deep-adapter)的方法,引入一个可调节精度控制旋钮,实现从标准退化下的精确恢复到复杂内容上更强的结构与运动校正行为之间的平滑过渡;此外,创新性地设计了时序一致的退化模块用于训练阶段,模拟现实中的结构性失败,从而提升模型对AIGC伪影的修复能力。

链接: https://arxiv.org/abs/2512.12060
作者: Tejas Panambur,Ishan Rajendrakumar Dave,Chongjian Ge,Ersin Yumer,Xue Bai
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: this https URL. Comments: The first two authors contributed equally Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2512.12060 [cs.CV] (or arXiv:2512.12060v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.12060 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-189] Enhancing deep learning performance on burned area delineation from SPOT-6/7 imagery for emergency management

【速读】:该论文旨在解决野火后烧毁区域(Burned Areas, BAs)快速精准制图的问题,以支持灾后损害评估与生态系统恢复决策。当前方法多依赖计算机视觉模型处理灾后遥感影像,但在时间紧迫的应急场景中存在适用性不足的问题。解决方案的关键在于提出一种监督语义分割工作流,利用SPOT-6/7高分辨率影像实现高效且准确的BA识别:首先对比U-Net与SegFormer模型性能,发现二者在小样本下表现相当,但后者资源消耗更高;其次引入土地覆盖数据作为辅助任务,在不增加推理时间的前提下提升模型鲁棒性;最后通过测试时增强(Test-Time Augmentation)进一步提高精度,结合混合精度(Mixed Precision)优化缓解其带来的推理延迟问题。

链接: https://arxiv.org/abs/2512.12056
作者: Maria Rodriguez,Minh-Tan Pham,Martin Sudmanns,Quentin Poterek,Oscar Narvaez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, IGARSS 2025

点击查看摘要

Abstract:After a wildfire, delineating burned areas (BAs) is crucial for quantifying damages and supporting ecosystem recovery. Current BA mapping approaches rely on computer vision models trained on post-event remote sensing imagery, but often overlook their applicability to time-constrained emergency management scenarios. This study introduces a supervised semantic segmentation workflow aimed at boosting both the performance and efficiency of BA delineation. It targets SPOT-6/7 imagery due to its very high resolution and on-demand availability. Experiments are evaluated based on Dice score, Intersection over Union, and inference time. The results show that U-Net and SegFormer models perform similarly with limited training data. However, SegFormer requires more resources, challenging its practical use in emergencies. Incorporating land cover data as an auxiliary task enhances model robustness without increasing inference time. Lastly, Test-Time Augmentation improves BA delineation performance but raises inference time, which can be mitigated with optimization methods like Mixed Precision.
zh

[CV-190] Adaptive federated learning for ship detection across diverse satellite imagery sources

【速读】:该论文旨在解决跨多个卫星数据集进行船舶检测时面临的隐私保护与数据孤岛问题,传统方法通常需要集中收集或共享原始数据,这在处理商业卫星影像或敏感标注信息时存在风险。解决方案的关键在于采用联邦学习(Federated Learning, FL)框架,在不交换本地数据的前提下,通过协作训练多个分布式模型来提升整体检测性能。研究对比了四种FL算法(FedAvg、FedProx、FedOpt 和 FedMedian)与本地独立训练基线的性能,结果表明FL能够显著提高小规模本地数据集上的检测精度,并接近全局联合训练的效果,同时强调了通信轮次和本地训练轮数等配置参数对检测精度与计算效率的优化作用。

链接: https://arxiv.org/abs/2512.12053
作者: Tran-Vu La,Minh-Tan Pham,Yu Li,Patrick Matgen,Marco Chini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, IGARSS 2025

点击查看摘要

Abstract:We investigate the application of Federated Learning (FL) for ship detection across diverse satellite datasets, offering a privacy-preserving solution that eliminates the need for data sharing or centralized collection. This approach is particularly advantageous for handling commercial satellite imagery or sensitive ship annotations. Four FL models including FedAvg, FedProx, FedOpt, and FedMedian, are evaluated and compared to a local training baseline, where the YOLOv8 ship detection model is independently trained on each dataset without sharing learned parameters. The results reveal that FL models substantially improve detection accuracy over training on smaller local datasets and achieve performance levels close to global training that uses all datasets during the training. Furthermore, the study underscores the importance of selecting appropriate FL configurations, such as the number of communication rounds and local training epochs, to optimize detection precision while maintaining computational efficiency.
zh

[CV-191] Exploring Spatial-Temporal Representation via Star Graph for mmWave Radar-based Human Activity Recognition

【速读】:该论文旨在解决毫米波雷达点云(mmWave radar point cloud)在人体活动识别(Human Activity Recognition, HAR)中因信号物理特性导致的稀疏性和帧间尺寸不固定的问题。传统方法借用视觉系统中针对密集点云设计的预处理算法,难以适配雷达数据特性。其解决方案的关键在于提出一种基于离散动态图神经网络(Discrete Dynamic Graph Neural Network, DDGNN)的图表示方法:通过构建星型图(star graph)来刻画人工添加的静态中心点与连续帧中动态雷达点之间的高维相对关系,从而有效建模时空特征;并利用DDGNN学习可变尺寸星型图上的特征表示,实现对雷达点云的高效建模。实验表明,该方法在真实世界HAR数据集上达到94.27%的整体分类准确率,接近基于视觉骨架数据的最优性能(97.25%),且无需重采样或帧聚合模块,适用于资源受限平台(如Raspberry Pi 4)。

链接: https://arxiv.org/abs/2512.12013
作者: Senhao Gao,Junqing Zhang,Luoyu Mei,Shuai Wang,Xuyu Wang
机构: University of Liverpool (利物浦大学); Southeast University (东南大学); City University of Hong Kong (香港城市大学); Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) requires extracting accurate spatial-temporal features with human movements. A mmWave radar point cloud-based HAR system suffers from sparsity and variable-size problems due to the physical features of the mmWave signal. Existing works usually borrow the preprocessing algorithms for the vision-based systems with dense point clouds, which may not be optimal for mmWave radar systems. In this work, we proposed a graph representation with a discrete dynamic graph neural network (DDGNN) to explore the spatial-temporal representation of human movement-related features. Specifically, we designed a star graph to describe the high-dimensional relative relationship between a manually added static center point and the dynamic mmWave radar points in the same and consecutive frames. We then adopted DDGNN to learn the features residing in the star graph with variable sizes. Experimental results demonstrated that our approach outperformed other baseline methods using real-world HAR datasets. Our system achieved an overall classification accuracy of 94.27%, which gets the near-optimal performance with a vision-based skeleton data accuracy of 97.25%. We also conducted an inference test on Raspberry Pi~4 to demonstrate its effectiveness on resource-constraint platforms. \sh We provided a comprehensive ablation study for variable DDGNN structures to validate our model design. Our system also outperformed three recent radar-specific methods without requiring resampling or frame aggregators.
zh

[CV-192] V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在处理复杂开放任务时表现不佳的问题,这类任务通常需要多步骤的视觉探索与推理,而现有模型往往缺乏有效的中间步骤规划与执行能力。解决方案的关键在于提出一个名为“Visual Reasoning with multi-step EXploration (V-REX)”的评估套件,其核心创新是将多步探索式推理形式化为链式问题(Chain-of-Questions, CoQ),并将其解耦为两个可独立评估的能力维度:(1) 规划(Planning)——通过选择一系列探索性问题来分解开放任务;(2) 执行(Following)——按序回答这些问题以收集信息并得出最终答案。通过限制每一步的有限选项,V-REX 实现了对中间步骤的可靠定量分析,从而揭示了当前主流VLM在多步探索推理中的性能瓶颈与改进空间。

链接: https://arxiv.org/abs/2512.11995
作者: Chenrui Fan,Yijun Liang,Shweta Bhardwaj,Kwesi Cobbina,Ming Li,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages

点击查看摘要

Abstract:While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)‘’, which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs’ capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.
zh

[CV-193] CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

【速读】:该论文旨在解决从单目RGB视频中重建空间和时间上一致的4D人类-物体交互(Human-Object Interaction, HOI)问题,其挑战在于未知物体与人体信息、深度模糊性、遮挡以及复杂运动导致的3D和时序重建不一致。现有方法通常依赖于已知物体模板或限定在有限类别中,限制了泛化能力。解决方案的关键在于提出CARI4D,一种类别无关(category-agnostic)的方法,通过三个核心步骤实现:首先设计姿态假设选择算法,鲁棒地融合来自基础模型(foundation models)的个体预测;其次采用学习的“渲染-对比”(render-and-compare)范式联合优化这些预测,确保空间、时间及像素级对齐;最后基于物理约束推理复杂接触关系以进一步精修结果。该方法在分布内和分布外数据集上分别提升重建精度38%和36%,且具备零样本迁移至真实网络视频的能力。

链接: https://arxiv.org/abs/2512.11988
作者: Xianghui Xie,Bowen Wen,Yan Chang,Hesam Rabeti,Jiefeng Li,Ye Yuan,Gerard Pons-Moll,Stan Birchfield
机构: NVIDIA; University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 4 tables. Project page: this https URL

点击查看摘要

Abstract:Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
zh

[CV-194] A Comparative Analysis of Semiconductor Wafer Map Defect Detection with Image Transformer

【速读】:该论文旨在解决半导体晶圆缺陷分类在数据受限和类别不平衡场景下的性能下降问题。传统卷积神经网络(Convolutional Neural Networks, CNNs)如VGG-19、SqueezeNet和Xception在小样本条件下表现不佳,难以满足半导体制造中对高精度预测性维护的需求。解决方案的关键在于引入数据高效图像变换器(Data-Efficient Image Transformer, DeiT),其通过自注意力机制捕捉全局特征依赖关系,在有限标注数据下仍能实现高准确率与强鲁棒性。实验表明,DeiT在晶圆缺陷分类任务中达到90.83%的准确率,显著优于对比模型,并展现出更快的训练收敛速度和对少数类缺陷更好的识别能力,验证了基于Transformer架构的模型在半导体制造预测性维护中的有效性。

链接: https://arxiv.org/abs/2512.11977
作者: Sushmita Nath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit/7075585. 5 pages with 5 figures

点击查看摘要

Abstract:Predictive maintenance is an important sector in modern industries which improves fault detection and cost reduction processes. By using machine learning algorithms in the whole process, the defects detection process can be implemented smoothly. Semiconductor is a sensitive maintenance field that requires predictability in work. While convolutional neural networks (CNNs) such as VGG-19, Xception and Squeeze-Net have demonstrated solid performance in image classification for semiconductor wafer industry, their effectiveness often declines in scenarios with limited and imbalanced data. This study investigates the use of the Data-Efficient Image Transformer (DeiT) for classifying wafer map defects under data-constrained conditions. Experimental results reveal that the DeiT model achieves highest classification accuracy of 90.83%, outperforming CNN models such as VGG-19(65%), SqueezeNet(82%), Xception(66%) and Hybrid(67%). DeiT also demonstrated superior F1-score (90.78%) and faster training convergence, with enhanced robustness in detecting minority defect classes. These findings highlight the potential of transformer-based models like DeiT in semiconductor wafer defect detection and support predictive maintenance strategies within semiconductor fabrication processes.
zh

[CV-195] DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition

【速读】:该论文旨在解决零样本骨骼动作识别(Zero-shot skeleton-based action recognition, ZS-SAR)中因粗粒度语义对齐导致的跨类别域偏移问题,即现有方法依赖静态类级语义与骨骼特征对齐,难以有效迁移细粒度视觉知识至未见类别。其解决方案的关键在于提出一个统一框架DynaPURLS,通过构建多尺度视觉-语义对应关系并动态优化推理阶段的对齐过程:首先利用大语言模型生成包含全局运动与局部肢体动态的分层文本描述,同时设计自适应分区模块基于语义分组提取细粒度骨骼关节表示;进而引入动态精化模块,在推理时通过轻量可学习投影将文本特征适配至输入视觉流,并借助置信度感知的类平衡记忆库稳定精化过程,抑制噪声伪标签引发的误差传播。

链接: https://arxiv.org/abs/2512.11941
作者: Jingmin Zhu,Anqi Zhu,James Bailey,Jun Liu,Hossein Rahmani,Mohammed Bennamoun,Farid Boussaid,Qiuhong Ke
机构: Monash University (莫纳什大学); University of Melbourne (墨尔本大学); Lancaster University (兰卡斯特大学); University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at this https URL
zh

[CV-196] Contextual Peano Scan and Fast Image Segmentation Using Hidden and Evidential Markov Chains

【速读】:该论文旨在解决图像分割中计算效率与模型表达能力之间的平衡问题,特别是在利用隐马尔可夫链(Hidden Markov Chain, HMC)进行无监督图像分割时,如何提升分割精度并保持较高计算速度。其关键解决方案是提出一种新的隐证据马尔可夫链-上下文皮亚诺扫描(Hidden Evidential Markov Chain with Contextual Peano Scan, HEMC-CPS)模型,该模型同时融合了上下文皮亚诺扫描(Contextual Peano Scan, CPS)以增强空间上下文建模能力,以及证据理论(Evidential Theory)框架下的隐马尔可夫链结构以提高不确定性建模和分割鲁棒性。实验表明,该方法在合成与真实图像上均能实现高效的贝叶斯最大后验模(Maximum Posterior Mode, MPM)分割,且参数估计采用随机期望最大化(Stochastic Expectation-Maximization, SEM)算法,具有良好的泛化潜力,适用于三维或多传感器多分辨率图像等复杂场景。

链接: https://arxiv.org/abs/2512.11939
作者: Clément Fernandes(SAMOVAR, SOP - SAMOVAR, TSP),Wojciech Pieczynski(SAMOVAR, SOP - SAMOVAR, TSP)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Transforming bi-dimensional sets of image pixels into mono-dimensional sequences with a Peano scan (PS) is an established technique enabling the use of hidden Markov chains (HMCs) for unsupervised image segmentation. Related Bayesian segmentation methods can compete with hidden Markov fields (HMFs)-based ones and are much faster. PS has recently been extended to the contextual PS, and some initial experiments have shown the value of the associated HMC model, denoted as HMC-CPS, in image segmentation. Moreover, HMCs have been extended to hidden evidential Markov chains (HEMCs), which are capable of improving HMC-based Bayesian segmentation. In this study, we introduce a new HEMC-CPS model by simultaneously considering contextual PS and evidential HMC. We show its effectiveness for Bayesian maximum posterior mode (MPM) segmentation using synthetic and real images. Segmentation is performed in an unsupervised manner, with parameters being estimated using the stochastic expectation–maximization (SEM) method. The new HEMC-CPS model presents potential for the modeling and segmentation of more complex images, such as three-dimensional or multi-sensor multi-resolution images. Finally, the HMC-CPS and HEMC-CPS models are not limited to image segmentation and could be used for any kind of spatially correlated data.
zh

[CV-197] MONET – Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion

【速读】:该论文旨在解决细胞着色(Cell painting)技术存在的两大问题:一是实验过程劳动密集,二是依赖化学固定处理,无法实现对细胞动态过程的观测。其解决方案的关键在于训练一个扩散模型(Morphological Observation Neural Enhancement Tool, MONET),利用大规模数据集从明场图像中预测细胞着色通道,并通过一致性架构生成时间序列视频,即便缺乏细胞着色视频的训练数据也能实现;此外,该架构还支持一种上下文学习(in-context learning)能力,使模型在面对分布外的细胞系和成像协议时仍具备一定的迁移性能。此方法为生物研究提供了虚拟细胞着色这一互补工具,推动新型实验流程的发展。

链接: https://arxiv.org/abs/2512.11928
作者: Alexander Peysakhovich,William Berman,Joseph Rufo,Felix Wong,Maxwell Z. Wilson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.
zh

[CV-198] ransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder

【速读】:该论文旨在解决远距离区域中由于LiDAR点云稀疏导致的3D目标检测性能下降问题(sparse point cloud in distant regions)。其核心解决方案是提出一种联合补全与检测框架,其中关键创新在于设计了TransBridge模块——一个基于Transformer的上采样块,用于融合检测网络与补全过程的特征,使检测网络能够利用补全网络隐式提取的稠密特征;同时引入动态-静态重建(Dynamic-Static Reconstruction, DSRecon)模块以生成满足稠密点云地面真值要求的补全数据,并通过Transformer机制建模通道与空间关系,从而提升特征图分辨率和检测精度。实验表明,该方法在nuScenes和Waymo数据集上显著提升了端到端3D检测性能,mAP提升达0.7–1.5点,两阶段检测框架甚至提升5.78点。

链接: https://arxiv.org/abs/2512.11926
作者: Qinghao Meng,Chenming Wu,Liangjun Zhang,Jianbing Shen
机构: Beijing Institute of Technology (北京理工大学); Baidu Research (百度研究院); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:3D object detection is essential in autonomous driving, providing vital information about moving objects and obstacles. Detecting objects in distant regions with only a few LiDAR points is still a challenge, and numerous strategies have been developed to address point cloud sparsity through this http URL paper presents a joint completion and detection framework that improves the detection feature in sparse areas while maintaining costs unchanged. Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion this http URL detection network can benefit from acquiring implicit completion features derived from the completion network. Additionally, we design the Dynamic-Static Reconstruction (DSRecon) module to produce dense LiDAR data for the completion network, meeting the requirement for dense point cloud ground this http URL, we employ the transformer mechanism to establish connections between channels and spatial relations, resulting in a high-resolution feature map used for completion this http URL experiments on the nuScenes and Waymo datasets demonstrate the effectiveness of the proposed this http URL results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods, indicating its generalization ability. For the two-stage detection framework, it also boosts the mAP up to 5.78 points.
zh

[CV-199] FloraForge: LLM -Assisted Procedural Generation of Editable and Analysis-Ready 3D Plant Geometric Models For Agricultural Applications

【速读】:该论文旨在解决当前植物三维建模方法在生物准确性、可编辑性与易用性之间的矛盾问题:学习驱动的重建方法依赖大量物种特异性数据且缺乏可编辑性,而传统程序化建模虽具参数控制能力却需深厚的几何建模专业知识,难以被植物科学领域的非编程专家使用。其解决方案的关键在于提出FloraForge框架,通过大语言模型(LLM)辅助的协同设计机制,将自然语言指令转化为可执行的Python脚本,生成具有生物学约束的分层B样条曲面表示(hierarchical B-spline surface representations),并以人类可读的Plant Descriptor(PD)文件形式实现参数化控制。该表示支持任意精度的三角网格细分,兼容光模拟、计算流体力学和有限元分析等下游应用,从而在保持数学连续性和生物合理性的同时,显著降低几何建模门槛,使领域专家能通过迭代自然语言描述完成高质量植物模型的构建与优化。

链接: https://arxiv.org/abs/2512.11925
作者: Mozhgan Hadadi,Talukder Z. Jubery,Patrick S. Schnable,Arti Singh,Bedrich Benes,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
机构: Iowa State University (爱荷华州立大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate 3D plant models are crucial for computational phenotyping and physics-based simulation; however, current approaches face significant limitations. Learning-based reconstruction methods require extensive species-specific training data and lack editability. Procedural modeling offers parametric control but demands specialized expertise in geometric modeling and an in-depth understanding of complex procedural rules, making it inaccessible to domain scientists. We present FloraForge, an LLM-assisted framework that enables domain experts to generate biologically accurate, fully parametric 3D plant models through iterative natural language Plant Refinements (PR), minimizing programming expertise. Our framework leverages LLM-enabled co-design to refine Python scripts that generate parameterized plant geometries as hierarchical B-spline surface representations with botanical constraints with explicit control points and parametric deformation functions. This representation can be easily tessellated into polygonal meshes with arbitrary precision, ensuring compatibility with functional structural plant analysis workflows such as light simulation, computational fluid dynamics, and finite element analysis. We demonstrate the framework on maize, soybean, and mung bean, fitting procedural models to empirical point cloud data through manual refinement of the Plant Descriptor (PD), human-readable files. The pipeline generates dual outputs: triangular meshes for visualization and triangular meshes with additional parametric metadata for quantitative analysis. This approach uniquely combines LLM-assisted template creation, mathematically continuous representations enabling both phenotyping and rendering, and direct parametric control through PD. The framework democratizes sophisticated geometric modeling for plant science while maintaining mathematical rigor.
zh

[CV-200] MPath: Multimodal Pathology Report Generation from Whole Slide Images

【速读】:该论文旨在解决从全切片图像(Whole Slide Images, WSI)中自动生成临床一致的病理诊断报告这一挑战,其核心难点在于组织形态学的高变异性以及病理报告结构的复杂性。解决方案的关键在于提出一种轻量级多模态框架MPath,通过学习到的视觉前缀提示机制(visual-prefix prompting mechanism),将预训练的生物医学语言模型(BioBART)与WSI提取的视觉嵌入进行条件化结合;该方法不依赖端到端的视觉-语言预训练,而是利用基础模型(CONCH + Titan)提取的WSI特征,通过一个紧凑的投影模块注入到冻结的语言主干中,从而在保持稳定性的同时实现数据高效和可解释的报告生成。

链接: https://arxiv.org/abs/2512.11906
作者: Noorul Wahab,Nasir Rajpoot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Pages 4, Figures 1, Table 1

点击查看摘要

Abstract:Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.
zh

[CV-201] Smartphone monitoring of smiling as a behavioral proxy of well-being in everyday life

【速读】:该论文旨在解决主观幸福感(Subjective Well-being)的传统测量方法依赖自我报告所导致的回忆偏差和高参与者负担问题,从而限制了对日常生活中幸福感动态变化的理解。其解决方案的关键在于利用智能手机被动采集的自然交互视频中捕捉到的“真诚微笑”(candid smiles)作为积极情绪的客观行为指标,并通过深度学习模型量化微笑强度,进而揭示个体层面的昼夜节律与日间模式,以及这些模式与物理活动、光照暴露等环境因素之间的关联。该方法实现了大规模、生态效度高的情感行为研究,为在群体尺度上理解幸福感提供了新路径。

链接: https://arxiv.org/abs/2512.11905
作者: Ming-Zher Poh,Shun Liao,Marco Andreetto,Daniel McDuff,Jonathan Wang,Paolo Di Achille,Jiang Wu,Yun Liu,Lawrence Cai,Eric Teasley,Mark Malhotra,Anupam Pathak,Shwetak Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subjective well-being is a cornerstone of individual and societal health, yet its scientific measurement has traditionally relied on self-report methods prone to recall bias and high participant burden. This has left a gap in our understanding of well-being as it is expressed in everyday life. We hypothesized that candid smiles captured during natural smartphone interactions could serve as a scalable, objective behavioral correlate of positive affect. To test this, we analyzed 405,448 video clips passively recorded from 233 consented participants over one week. Using a deep learning model to quantify smile intensity, we identified distinct diurnal and daily patterns. Daily patterns of smile intensity across the week showed strong correlation with national survey data on happiness (r=0.92), and diurnal rhythms documented close correspondence with established results from the day reconstruction method (r=0.80). Higher daily mean smile intensity was significantly associated with more physical activity (Beta coefficient = 0.043, 95% CI [0.001, 0.085]) and greater light exposure (Beta coefficient = 0.038, [0.013, 0.063]), whereas no significant effects were found for smartphone use. These findings suggest that passive smartphone sensing could serve as a powerful, ecologically valid methodology for studying the dynamics of affective behavior and open the door to understanding this behavior at a population scale.
zh

[CV-202] Aion: Towards Hierarchical 4D Scene Graphs with Temporal Flow Dynamics

【速读】:该论文旨在解决动态环境中自主导航所面临的挑战,即如何构建既能表达语义结构又能捕捉时间演化特性的空间表征。现有方法中,3D场景图(3D Scene Graphs, 3DSGs)虽能提供分层多分辨率的几何与语义抽象,但对动态性的扩展通常局限于单个物体或代理;而运动模式地图(Maps of Dynamics, MoDs)虽能建模典型运动模式和时间规律,却常依赖于网格离散化,缺乏语义感知且难以扩展至大规模环境。论文提出Aion框架,其核心创新在于将时间流动力学直接嵌入到层次化的3DSG中,通过基于图的稀疏MoD表示来捕捉任意时间间隔内的运动流,并将其关联至场景图中的导航节点,从而实现更具可解释性与可扩展性的预测,显著提升复杂动态环境下的路径规划与交互能力。

链接: https://arxiv.org/abs/2512.11903
作者: Iacopo Catalano,Eduardo Montijano,Javier Civera,Julio A. Placed,Jorge Pena-Queralta
机构: University of Turku (图尔库大学); Zürich University of Applied Sciences (苏黎世应用科学大学); Aragón Institute of Technology (阿拉贡技术研究所); University of Zaragoza (萨拉戈萨大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous navigation in dynamic environments requires spatial representations that capture both semantic structure and temporal evolution. 3D Scene Graphs (3DSGs) provide hierarchical multi-resolution abstractions that encode geometry and semantics, but existing extensions toward dynamics largely focus on individual objects or agents. In parallel, Maps of Dynamics (MoDs) model typical motion patterns and temporal regularities, yet are usually tied to grid-based discretizations that lack semantic awareness and do not scale well to large environments. In this paper we introduce Aion, a framework that embeds temporal flow dynamics directly within a hierarchical 3DSG, effectively incorporating the temporal dimension. Aion employs a graph-based sparse MoD representation to capture motion flows over arbitrary time intervals and attaches them to navigational nodes in the scene graph, yielding more interpretable and scalable predictions that improve planning and interaction in complex dynamic environments.
zh

[CV-203] CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities WACV

【速读】:该论文旨在解决多模态表示学习中模态融合的通用性与效率问题,即如何在不改变底层框架的前提下,有效融合任意数量和类型的模态特征,并适应缺失模态输入。其解决方案的关键在于提出CLARGA(Cross-Modal Learning with Adaptive Graph Attention),该架构通过为每个样本构建注意力加权图来动态建模模态间关系,并利用多头图注意力网络(Multi-head Graph Attention Network)进行消息传递,从而实现自适应、高效(亚二次复杂度)的模态融合;同时引入可学习掩码机制以处理缺失模态输入,并采用监督任务损失与对比InfoNCE损失相结合的混合目标函数,提升跨模态一致性与对噪声输入的鲁棒性。

链接: https://arxiv.org/abs/2512.11901
作者: Santosh Patapati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: WACV; Supplementary material is available on CVF proceedings

点击查看摘要

Abstract:We introduce CLARGA, a general-purpose multimodal fusion architecture for multimodal representation learning that works with any number and type of modalities without changing the underlying framework. Given a supervised dataset, CLARGA can be applied to virtually any machine learning task to fuse different multimodal representations for processing by downstream layers. On a sample-by-sample basis, CLARGA learns how modalities should inform one another by building an attention weighted graph over their features and passing messages along this graph with a multi-head Graph Attention Network. Not only does this make CLARGA highly adaptive, as it constructs unique graphs for different samples, it makes for efficient fusion with sub-quadratic complexity as the number of modalities grows. Through a learnable mask, it can also adapt to missing modality inputs. The model is trained with a hybrid objective that combines a supervised task loss with contrastive InfoNCE loss, improving cross-modal consistency and robustness to noisy inputs. We demonstrate CLARGA’s effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and affective computing. It consistently outperforms baselines, state-of-the-art models, and ablations. Additional experiments also demonstrate its robustness to missing inputs and ability to excel on niche tasks. Overall, CLARGA can be easily plugged into machine learning models for effective and efficient learning of representations across a wide variety of tasks.
zh

[CV-204] Read or Ignore? A Unified Benchmark for Typographic-Attack Robustness and Text Recognition in Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在面对字体攻击(typographic attacks)时存在的鲁棒性与文本理解能力之间的失衡问题。现有评估协议和防御方法主要聚焦于物体识别任务,倾向于通过忽略图像中的文本以实现鲁棒性,但这与真实场景中需要同时对物体和文本进行联合推理的需求相悖(例如,在识别行人时同时读取交通标志)。为应对这一挑战,作者提出了一种新任务——读或忽略视觉问答(Read-or-Ignore VQA, RIO-VQA),其核心在于让模型根据上下文动态决定是否读取图像中的文本内容。关键解决方案是构建了一个标准化的基准测试集 Read-or-Ignore Benchmark (RIO-Bench),该基准通过在同一场景下生成仅改变文本内容和问题类型的反事实样本(read / ignore),实现了对模型选择性使用文本能力的精准评估。基于此,研究进一步提出一种数据驱动的自适应防御机制,能够学习模型在不同情境下灵活地选择是否读取文本,从而突破以往静态、非自适应的文本忽略型防御局限,推动LVLM向更符合实际需求的方向发展。

链接: https://arxiv.org/abs/2512.11899
作者: Futa Waseda,Shojiro Yamabe,Daiki Shiono,Kento Sasaki,Tsubasa Takahashi
机构: Turing Inc.; The University of Tokyo; Institute of Science Tokyo; Tohoku University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) are vulnerable to typographic attacks, where misleading text within an image overrides visual understanding. Existing evaluation protocols and defenses, largely focused on object recognition, implicitly encourage ignoring text to achieve robustness; however, real-world scenarios often require joint reasoning over both objects and text (e.g., recognizing pedestrians while reading traffic signs). To address this, we introduce a novel task, Read-or-Ignore VQA (RIO-VQA), which formalizes selective text use in visual question answering (VQA): models must decide, from context, when to read text and when to ignore it. For evaluation, we present the Read-or-Ignore Benchmark (RIO-Bench), a standardized dataset and protocol that, for each real image, provides same-scene counterfactuals (read / ignore) by varying only the textual content and question type. Using RIO-Bench, we show that strong LVLMs and existing defenses fail to balance typographic robustness and text-reading capability, highlighting the need for improved approaches. Finally, RIO-Bench enables a novel data-driven defense that learns adaptive selective text use, moving beyond prior non-adaptive, text-ignoring defenses. Overall, this work reveals a fundamental misalignment between the existing evaluation scope and real-world requirements, providing a principled path toward reliable LVLMs. Our Project Page is at this https URL.
zh

[CV-205] Microscopic Vehicle Trajectory Datasets from UAV-collected Video for Heterogeneous Area-Based Urban Traffic

【速读】:该论文旨在解决传统路边视频采集在密集混合交通场景下因遮挡、视角受限及车辆运动不规则而导致的数据获取困难问题。其解决方案的关键在于利用无人飞行器(UAV)进行顶部视角的轨迹数据采集,从而有效减少上述限制并捕捉丰富的时空动态信息。通过Data from Sky (DFS)平台提取的微观车辆轨迹(MVT)数据集,在印度国家首都地区六个路段覆盖了多样化交通组成和密度水平,为仿真建模、安全评估与行为研究提供了高精度、开放获取的实证基础。

链接: https://arxiv.org/abs/2512.11898
作者: Yawar Ali,K. Ramachandra Rao,Ashish Bhaskar,Niladri Chatterjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper presents basic statistics and trends in empirically observed data from highly heterogeneous and area-based traffic while offering the datasets open source for researchers and practitioners

点击查看摘要

Abstract:This paper offers openly available microscopic vehicle trajectory (MVT) datasets collected using unmanned aerial vehicles (UAVs) in heterogeneous, area-based urban traffic conditions. Traditional roadside video collection often fails in dense mixed traffic due to occlusion, limited viewing angles, and irregular vehicle movements. UAV-based recording provides a top-down perspective that reduces these issues and captures rich spatial and temporal dynamics. The datasets described here were extracted using the Data from Sky (DFS) platform and validated against manual counts, space mean speeds, and probe trajectories in earlier work. Each dataset contains time-stamped vehicle positions, speeds, longitudinal and lateral accelerations, and vehicle classifications at a resolution of 30 frames per second. Data were collected at six mid-block locations in the national capital region of India, covering diverse traffic compositions and density levels. Exploratory analyses highlight key behavioural patterns, including lane-keeping preferences, speed distributions, and lateral manoeuvres typical of heterogeneous and area-based traffic settings. These datasets are intended as a resource for the global research community to support simulation modelling, safety assessment, and behavioural studies under area-based traffic conditions. By making these empirical datasets openly available, this work offers researchers a unique opportunity to develop, test, and validate models that more accurately represent complex urban traffic environments.
zh

[CV-206] Hot Hém: Sài Gòn Giũa Cái Nóng Hông Còng Bàng – Saigon in Unequal Heat

【速读】:该论文旨在解决密集热带城市中行人热暴露(Pedestrian Heat Exposure)这一关键健康风险问题,传统路径规划算法常忽略微尺度热环境差异。其解决方案的关键在于构建了一个GeoAI工作流Hot Hém,整合Google Street View(GSV)影像、语义图像分割与遥感数据,利用两个XGBoost模型基于GSV训练数据预测地表温度(Land Surface Temperature, LST),并将模型以拼接方式部署至OSMnx提取的行人网络节点上,实现热感知路由(Heat-aware Routing),从而在基础设施尺度上精准识别并理解特定城市通道为何呈现显著更高的温度。

链接: https://arxiv.org/abs/2512.11896
作者: Tessa Vu
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注: Completed as a requirement in MUSA 6950-001

点击查看摘要

Abstract:Pedestrian heat exposure is a critical health risk in dense tropical cities, yet standard routing algorithms often ignore micro-scale thermal variation. Hot Hém is a GeoAI workflow that estimates and operationalizes pedestrian heat exposure in Hô Chí Minh City (HCMC), Vi\det Nam, colloquially known as Sài Gòn. This spatial data science pipeline combines Google Street View (GSV) imagery, semantic image segmentation, and remote sensing. Two XGBoost models are trained to predict land surface temperature (LST) using a GSV training dataset in selected administrative wards, known as phŏng, and are deployed in a patchwork manner across all OSMnx-derived pedestrian network nodes to enable heat-aware routing. This is a model that, when deployed, can provide a foundation for pinpointing where and further understanding why certain city corridors may experience disproportionately higher temperatures at an infrastructural scale.
zh

[CV-207] mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description WACV2026

【速读】:该论文旨在解决毫米波雷达(mmWave radar)应用中真实信号生成与数据集增强的难题,尤其针对活动识别和姿态估计等任务所需的高度多样化且环境特异性的信号数据集稀缺问题。由于mmWave信号具有高维、稀疏和复杂的特性,传统物理仿真方法计算成本高昂,难以满足实际需求。解决方案的关键在于提出mmWeaver框架,该框架利用隐式神经表示(Implicit Neural Representations, INRs)将复杂mmWave信号建模为连续函数,实现高达49倍的数据压缩;并通过超网络(hypernetworks)动态生成INR参数,结合从RGB-D图像提取的环境上下文信息与通过MotionGPT生成的人体运动特征(text-to-pose),实现高效、自适应的信号合成,同时保留对下游任务至关重要的相位信息,显著提升信号真实性与任务性能。

链接: https://arxiv.org/abs/2512.11894
作者: Mahathir Monjur,Shahriar Nirjon
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026 (WACV 2026)

点击查看摘要

Abstract:Realistic signal generation and dataset augmentation are essential for advancing mmWave radar applications such as activity recognition and pose estimation, which rely heavily on diverse, and environment-specific signal datasets. However, mmWave signals are inherently complex, sparse, and high-dimensional, making physical simulation computationally expensive. This paper presents mmWeaver, a novel framework that synthesizes realistic, environment-specific complex mmWave signals by modeling them as continuous functions using Implicit Neural Representations (INRs), achieving up to 49-fold compression. mmWeaver incorporates hypernetworks that dynamically generate INR parameters based on environmental context (extracted from RGB-D images) and human motion features (derived from text-to-pose generation via MotionGPT), enabling efficient and adaptive signal synthesis. By conditioning on these semantic and geometric priors, mmWeaver generates diverse I/Q signals at multiple resolutions, preserving phase information critical for downstream tasks such as point cloud estimation and activity classification. Extensive experiments show that mmWeaver achieves a complex SSIM of 0.88 and a PSNR of 35 dB, outperforming existing methods in signal realism while improving activity recognition accuracy by up to 7% and reducing human pose estimation error by up to 15%, all while operating 6-35 times faster than simulation-based approaches.
zh

[CV-208] Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

【速读】:该论文旨在解决在密集实例分割任务中,专用微调模型与通用基础模型(如SAM3)之间的性能差异及其稳定性问题,特别是在高密度目标和遮挡场景下的表现对比。其关键解决方案在于通过在MinneApple数据集上进行系统性评估,量化不同IoU阈值对性能指标的影响,并揭示了SAM3在边界精度上的显著优势(边界稳定性提升12倍),而YOLO11系列模型则在检测完整性方面更具优势。这一发现为选择适合特定应用场景的模型提供了依据,即当边界精度更为关键时应优先考虑通用基础模型,而当检测覆盖率更重要时可采用专用微调模型。

链接: https://arxiv.org/abs/2512.11884
作者: Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee,Nikolaos D. Tselikas
机构: Cornell University (康奈尔大学); University of Patras (帕特雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task-specific fine-tuning and generalist foundation models capable of zero-shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine-tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero-shot mode. However, YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open-source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine-tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as this https URL
zh

[CV-209] Aesthetic Alignment Risks Assimilation: How Image Generation and Reward Models Reinforce Beauty Bias and Ideological “Censorship”

【速读】:该论文旨在解决当前图像生成模型过度对齐于通用审美偏好(aesthetic preference)所导致的用户意图偏离问题,尤其在用户明确要求“反美学”(anti-aesthetic)输出用于艺术或批判性表达时,模型仍倾向于生成符合传统审美的结果,从而损害用户自主性和审美多样性。解决方案的关键在于通过构建涵盖广泛美学维度的数据集,系统评估主流生成模型与奖励模型的表现,揭示其内在偏见:即生成模型默认输出高美感图像、奖励模型甚至对完全符合用户指令的低质量或负向图像施加惩罚。这一发现表明,现有模型设计需从开发者中心主义转向用户意图优先,以实现更尊重多样性的生成能力。

链接: https://arxiv.org/abs/2512.11883
作者: Wenqi Marshall Guo,Qingyun Qian,Khalad Hasan,Shan Du
机构: University of British Columbia (不列颠哥伦比亚大学); Weathon Software (威森软件)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when ``anti-aesthetic" outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. We find that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.
zh

[CV-210] Pseudo-Label Refinement for Robust Wheat Head Segmentation via Two-Stage Hybrid Training ICCV2025

【速读】:该论文旨在解决小麦图像的全语义分割(Full Semantic Segmentation)问题,即在复杂背景下精确识别和分割不同类型的 wheat structures(如穗、茎、叶等)。其解决方案的关键在于提出了一种系统性的自训练框架,该框架结合两阶段混合训练策略与大规模数据增强技术,并采用基于 SegFormer 模型(以 Mix Transformer MiT-B4 为骨干网络)的迭代教师-学生循环机制,通过不断优化模型预测精度并最大化利用未标注数据,从而在开发集和测试集上均取得了具有竞争力的性能表现。

链接: https://arxiv.org/abs/2512.11874
作者: Jiahao Jiang,Zhangrui Yang,Xuanhan Wang,Jingkuan Song
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages,3 figures, Extended abstract submitted to the 10th Computer Vision in Plant Phenotyping and Agriculture (CVPPA) Workshop, held in conjunction with ICCV 2025

点击查看摘要

Abstract:This extended abstract details our solution for the Global Wheat Full Semantic Segmentation Competition. We developed a systematic self-training framework. This framework combines a two-stage hybrid training strategy with extensive data augmentation. Our core model is SegFormer with a Mix Transformer (MiT-B4) backbone. We employ an iterative teacher-student loop. This loop progressively refines model accuracy. It also maximizes data utilization. Our method achieved competitive performance. This was evident on both the Development and Testing Phase datasets.
zh

[CV-211] Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops

【速读】:该论文旨在解决埃塞俄比亚提格雷地区农业中因基础设施破坏导致的作物病害专家诊断难以获取的问题,尤其聚焦于本土仙人掌果(Opuntia ficus-indica)病害的本地化、离线检测需求。解决方案的关键在于构建一个以轻量化模型为核心的离线-first检测系统:首先,基于3,587张田野图像构建了首个针对本土仙人掌果的病害数据集;其次,在部署受限的战后边缘环境中,对比三种移动高效架构——定制轻量CNN、EfficientNet-Lite1与CNN-Transformer混合结构MobileViT-XS,发现MobileViT-XS凭借多头自注意力机制(Multi-Head Self-Attention, MHSA)实现全局推理能力,在跨验证中达到97.3%准确率,显著优于局部纹理感知的卷积核,能更可靠地区分虫害簇与二维真菌斑块;最终,所有ARM兼容模型集成至支持提格雷语(Tigrigna)和阿姆哈拉语(Amharic)的Flutter应用中,实现在Cortex-A53类设备上的完全离线推理,从而提升粮食安全关键诊断的可及性与包容性。

链接: https://arxiv.org/abs/2512.11871
作者: Tekleab G. Gebremedhin,Hailom S. Asegede,Bruh W. Tesheme,Tadesse B. Gebremichael,Kalayu G. Redae
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A preliminary version of this work was presented at the International Conference on Postwar Technology for Recovery and Sustainable Development (Feb. 2025). This manuscript substantially extends that work with expanded experiments and on-device deployment analysis. Code and dataset are publicly available at: this https URL

点击查看摘要

Abstract:Agriculture supports over 80% of the population in the Tigray region of Ethiopia, where infrastructural disruptions limit access to expert crop disease diagnosis. We present an offline-first detection system centered on a newly curated indigenous cactus-fig (Opuntia ficus-indica) dataset consisting of 3,587 field images across three core symptom classes. Given deployment constraints in post-conflict edge environments, we benchmark three mobile-efficient architectures: a custom lightweight CNN, EfficientNet-Lite1, and the CNN-Transformer hybrid MobileViT-XS. While the broader system contains independent modules for potato, apple, and corn, this study isolates cactus-fig model performance to evaluate attention sensitivity and inductive bias transfer on indigenous morphology alone. Results establish a clear Pareto trade-off: EfficientNet-Lite1 achieves 90.7% test accuracy, the lightweight CNN reaches 89.5% with the most favorable deployment profile (42 ms inference latency, 4.8 MB model size), and MobileViT-XS delivers 97.3% mean cross-validation accuracy, demonstrating that MHSA-based global reasoning disambiguates pest clusters from two dimensional fungal lesions more reliably than local texture CNN kernels. The ARM compatible models are deployed in a Tigrigna and Amharic localized Flutter application supporting fully offline inference on Cortex-A53 class devices, strengthening inclusivity for food security critical diagnostics.
zh

[CV-212] mporal-Anchor3DLane: Enhanced 3D Lane Detection with Multi-Task Losses and LSTM Fusion

【速读】:该论文旨在解决单目3D车道线检测中因深度模糊、遮挡及帧间不稳定性带来的挑战,尤其针对现有基于Anchor的模型(如Anchor3DLane)在回归异常值敏感性、全局曲线几何监督不足、多损失项平衡困难以及未充分利用时序连续性等方面的问题。解决方案的关键在于:(1) 引入多任务损失优化策略,包括平衡L1回归、Chamfer点集距离和不确定性加权损失,并结合焦点损失(focal loss)与Dice损失提升分类与可见性预测;(2) 设计轻量级Temporal LSTM Fusion模块以聚合跨帧的锚点特征,替代计算复杂度较高的Transformer式时序融合机制;(3) 采用ESCOP风格训练优化,将曲线级监督与时序一致性耦合,从而显著提升模型鲁棒性和轨迹平滑性,在OpenLane数据集上F1指标提升6.2%,且无需额外传感器或模型规模扩展。

链接: https://arxiv.org/abs/2512.11869
作者: D. Shainu Suhas,G. Rahul,K. Muni
机构: National Institute of Technology Karnataka (NITK), Surathkal, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D lane detection remains challenging due to depth ambiguity, occlusion, and temporal instability across frames. Anchor-based approaches such as Anchor3DLane have demonstrated strong performance by regressing continuous 3D lane curves from multi-camera surround views. However, the baseline model still exhibits (i) sensitivity to regression outliers, (ii) weak supervision of global curve geometry, (iii) difficulty in balancing multiple loss terms, and (iv) limited exploitation of temporal continuity. We propose Temporal-Anchor3DLane, an enhanced 3D lane detection framework that extends Anchor3DLane with three key contributions: (1) a set of multi-task loss improvements, including Balanced L1 regression, Chamfer point-set distance, and uncertainty-based loss weighting, together with focal and Dice components for classification and visibility; (2) a lightweight Temporal LSTM Fusion module that aggregates per-anchor features across frames, replacing a heavier Transformer-style temporal fusion; and (3) ESCOP-style training refinements that couple curve-level supervision with temporal consistency. On OpenLane, Temporal-Anchor3DLane improves F1 by +6.2 and yields smoother temporal trajectories, showing that small architectural and loss refinements significantly enhance 3D lane robustness without extra sensors or scaling.
zh

[CV-213] On the Dangers of Bootstrapping Generation for Continual Learning and Beyond

【速读】:该论文旨在解决在持续学习(continual learning)场景中,反复使用生成式数据(synthetic data)进行模型训练所引发的分布漂移(distribution drift)和性能退化问题。其核心挑战在于合成数据引入的偏差(bias)与方差(variance)会削弱最大似然估计(maximum likelihood estimation)的可靠性,并导致生成模型在迭代训练中发生崩溃。解决方案的关键在于通过统计分析和实证研究揭示:即使采用先进的生成经验回放(Generative Experience Replay, GER)方法,也无法有效维持潜在空间(latent space)中的对齐一致性,从而表明当前主流的合成数据驱动策略在持续学习框架下存在根本性局限。

链接: https://arxiv.org/abs/2512.11867
作者: Daniil Zverev,A. Sophia Koepke,Joao F. Henriques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: DAGM German Conference on Pattern Recognition, 2025

点击查看摘要

Abstract:The use of synthetically generated data for training models is becoming a common practice. While generated data can augment the training data, repeated training on synthetic data raises concerns about distribution drift and degradation of performance due to contamination of the dataset. We investigate the consequences of this bootstrapping process through the lens of continual learning, drawing a connection to Generative Experience Replay (GER) methods. We present a statistical analysis showing that synthetic data introduces significant bias and variance into training objectives, weakening the reliability of maximum likelihood estimation. We provide empirical evidence showing that popular generative models collapse under repeated training with synthetic data. We quantify this degradation and show that state-of-the-art GER methods fail to maintain alignment in the latent space. Our findings raise critical concerns about the use of synthetic data in continual learning.
zh

[CV-214] Explainable Adversarial-Robust Vision-Language-Action Model for Robotic Manipulation

【速读】:该论文旨在解决智能农业中基于RGB相机感知与机械臂控制的系统在遭遇光照变化、色彩偏移和噪声等光度扰动时易受对抗攻击而失效的问题。解决方案的关键在于提出一种基于OpenVLA-OFT框架的可解释对抗鲁棒视觉-语言-动作模型,其核心创新是引入Evidence-3模块,能够检测光度扰动并生成自然语言形式的原因与影响解释,从而提升模型在对抗条件下的动作预测准确性和可解释性。实验表明,该方法相较基线模型将当前动作L1损失降低21.7%,下一动作L1损失降低18.4%。

链接: https://arxiv.org/abs/2512.11865
作者: Ju-Young Kim,Ji-Hong Park,Myeongjun Kim,Gun-Woo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to MobieSec 2025 (poster session)

点击查看摘要

Abstract:Smart farming has emerged as a key technology for advancing modern agriculture through automation and intelligent control. However, systems relying on RGB cameras for perception and robotic manipulators for control, common in smart farming, are vulnerable to photometric perturbations such as hue, illumination, and noise changes, which can cause malfunction under adversarial attacks. To address this issue, we propose an explainable adversarial-robust Vision-Language-Action model based on the OpenVLA-OFT framework. The model integrates an Evidence-3 module that detects photometric perturbations and generates natural language explanations of their causes and effects. Experiments show that the proposed model reduces Current Action L1 loss by 21.7% and Next Actions L1 loss by 18.4% compared to the baseline, demonstrating improved action prediction accuracy and explainability under adversarial conditions.
zh

[CV-215] Soft Decision Tree classifier: explainable and extendable PyTorch implementation

【速读】:该论文旨在解决传统决策树模型在分类任务中可解释性与性能之间难以平衡的问题,尤其是如何在保持高预测准确率的同时提升模型的透明度和人类可理解性。其解决方案的关键在于提出并实现两种新型结构:软决策树(Soft Decision Tree, SDT)和带短期记忆的软决策树(Short-term Memory Soft Decision Tree, SM-SDT),二者均基于PyTorch框架构建,并通过模拟数据和临床数据集进行验证。实验表明,SDT 和 SM-SDT 在AUC指标上与XGBoost相当,显著优于随机森林、逻辑回归和传统决策树,且具备可视化潜力以增强模型的可解释性,从而为医疗等高风险领域提供兼具性能与可信度的分类工具。

链接: https://arxiv.org/abs/2512.11833
作者: Reuben R Shamir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Keywords: Soft Decision Tree, Short-term Memory Soft Decision Tree, Classification, Explainability

点击查看摘要

Abstract:We implemented a Soft Decision Tree (SDT) and a Short-term Memory Soft Decision Tree (SM-SDT) using PyTorch. The methods were extensively tested on simulated and clinical datasets. The SDT was visualized to demonstrate the potential for its explainability. SDT, SM-SDT, and XGBoost demonstrated similar area under the curve (AUC) values. These methods were better than Random Forest, Logistic Regression, and Decision Tree. The results on clinical datasets suggest that, aside from a decision tree, all tested classification methods yield comparable results. The code and datasets are available online on GitHub: this https URL Comments: Keywords: Soft Decision Tree, Short-term Memory Soft Decision Tree, Classification, Explainability Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.11833 [cs.LG] (or arXiv:2512.11833v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.11833 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-216] On the Design of One-step Diffusion via Shortcutting Flow Paths

【速读】:该论文旨在解决当前短路径扩散模型(shortcut diffusion models)在理论推导与实际实现之间耦合紧密的问题,这种耦合限制了设计空间的清晰理解和系统性优化。其解决方案的关键在于提出一个通用的设计框架,该框架不仅为代表性短路径模型提供了理论依据,还解耦了组件级别的具体选择,从而实现了对改进方向的系统识别与验证。通过该框架所引入的改进,作者训练出的一步扩散模型在无预训练、蒸馏或课程学习的情况下,在ImageNet-256x256数据集上达到了2.85的FID50k新最优性能,显著降低了组件级创新的门槛,并推动了该类模型设计空间的理性探索。

链接: https://arxiv.org/abs/2512.11831
作者: Haitao Lin,Peiyan Hu,Minsi Ren,Zhifeng Gao,Zhi-Ming Ma,Guolin ke,Tailin Wu,Stan Z. Li
机构: Westlake University (西湖大学); Chinese Academy of Sciences (中国科学院); DP Technology (DP科技)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages of main body, conference paper

点击查看摘要

Abstract:Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (a.k.a. shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.
zh

[CV-217] Assessing Greenspace Attractiveness with ChatGPT Claude and Gemini: Do AI Models Reflect Human Perceptions?

【速读】:该论文旨在解决城市绿地吸引力评估中长期存在的两大问题:一是现有方法往往忽视非正式或临时性绿地(如荒地、野草地等),二是传统评估手段资源密集,难以大规模捕捉居民的主观感知。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)——包括ChatGPT GPT-4o、Claude 3.5 Haiku和Gemini 2.0 Flash——通过分析Google街景图像自动判断绿地吸引力,并与波兰罗兹市居民的地理问卷调查结果进行对比。研究发现,MLLMs在评估正式绿地的吸引力和非正式绿地的不吸引力方面表现出较高的人类一致性,但在识别吸引人的非正式绿地和不吸引人的正式绿地时存在显著偏差;同时,模型更关注美学与设计特征,而弱化了居民重视的安全性、功能性基础设施及本地文化嵌入等维度。因此,尽管MLLMs具备规模化预评估潜力,仍需结合人类监督与参与式方法以实现情境敏感的城市绿地评价。

链接: https://arxiv.org/abs/2512.11827
作者: Milad Malekzadeh,Magdalena Biernacka,Elias Willberg,Jussi Torkko,Edyta Łaszkiewicz,Tuuli Toivonen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding greenspace attractiveness is essential for designing livable and inclusive urban environments, yet existing assessment approaches often overlook informal or transient spaces and remain too resource intensive to capture subjective perceptions at scale. This study examines the ability of multimodal large language models (MLLMs), ChatGPT GPT-4o, Claude 3.5 Haiku, and Gemini 2.0 Flash, to assess greenspace attractiveness similarly to humans using Google Street View imagery. We compared model outputs with responses from a geo-questionnaire of residents in Lodz, Poland, across both formal (for example, parks and managed greenspaces) and informal (for example, meadows and wastelands) greenspaces. Survey respondents and models indicated whether each greenspace was attractive or unattractive and provided up to three free text explanations. Analyses examined how often their attractiveness judgments aligned and compared their explanations after classifying them into shared reasoning categories. Results show high AI human agreement for attractive formal greenspaces and unattractive informal spaces, but low alignment for attractive informal and unattractive formal greenspaces. Models consistently emphasized aesthetic and design oriented features, underrepresenting safety, functional infrastructure, and locally embedded qualities valued by survey respondents. While these findings highlight the potential for scalable pre-assessment, they also underscore the need for human oversight and complementary participatory approaches. We conclude that MLLMs can support, but not replace, context sensitive greenspace evaluation in planning practice.
zh

[CV-218] ReGlove: A Soft Pneumatic Glove for Activities of Daily Living Assistance via Wrist-Mounted Vision

【速读】:该论文旨在解决慢性上肢功能障碍患者缺乏经济可行且可靠的辅助技术问题,现有设备往往价格昂贵或依赖不可靠的生物信号(如肌电信号,EMG)。其解决方案的关键在于提出ReGlove系统,通过将低成本商用气动康复手套改造为基于视觉引导的辅助矫形器,集成腕部摄像头与边缘计算推理引擎(Raspberry Pi 5),利用实时YOLO-based计算机视觉模型实现上下文感知抓握控制,从而在无需可靠肌肉信号的情况下完成高精度、低延迟的物体识别与操作,整体系统成本低于250美元,具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2512.11824
作者: Rosh Ho,Jian Zhang
机构: Columbia University (哥伦比亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents ReGlove, a system that converts low-cost commercial pneumatic rehabilitation gloves into vision-guided assistive orthoses. Chronic upper-limb impairment affects millions worldwide, yet existing assistive technologies remain prohibitively expensive or rely on unreliable biological signals. Our platform integrates a wrist-mounted camera with an edge-computing inference engine (Raspberry Pi 5) to enable context-aware grasping without requiring reliable muscle signals. By adapting real-time YOLO-based computer vision models, the system achieves \SI96.73\percent grasp classification accuracy with sub-\SI40.00\milli\second end-to-end latency. Physical validation using standardized benchmarks shows \SI82.71\percent success on YCB object manipulation and reliable performance across \SI27.00 Activities of Daily Living (ADL) tasks. With a total cost under \ \SI250.00 and exclusively commercial components, ReGlove provides a technical foundation for accessible, vision-based upper-limb assistance that could benefit populations excluded from traditional EMG-controlled devices.
zh

[CV-219] A Reproducible Workflow for Scraping Structuring and Segmenting Legacy Archaeological Artifact Images

【速读】:该论文旨在解决考古图像数据集从非结构化网页资源向机器学习友好格式转换的难题,特别是针对由英国考古数据服务(ADS)维护的旧式手斧与两面器(biface)图像集合,该集合虽包含数千张标准化照片,但缺乏批量下载和自动化处理机制。解决方案的关键在于开发两个开源工具:一是遵循伦理规范的网络爬虫脚本,用于提取记录页面元数据并下载图像;二是图像处理流水线,通过经典计算机视觉方法生成二值掩膜(binary masks)和边界框(bounding boxes),并将所有衍生信息以COOC格式JSON文件存储,同时保留考古学元数据。整个流程不重新分发原始图像,仅共享掩膜、轮廓和标注等衍生产品,从而实现轻量、可复现的数据转化,推动数字考古领域机器学习应用的可重复性研究实践。

链接: https://arxiv.org/abs/2512.11817
作者: Juan Palomeque-Gonzalez
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 Pages, 5 figures

点击查看摘要

Abstract:This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitating downstream analysis and contributing to more reproducible research practices in digital archaeology.
zh

[CV-220] Benchmarking Teslas Traffic Light and Stop Sign Control: Field Dataset and Behavior Insights

【速读】:该论文旨在解决高级驾驶辅助系统(Advanced Driver-Assistance Systems, ADAS)与交通控制设施(Traffic Control Devices, TCDs)之间交互行为的实证研究不足问题,以准确评估其对交通运行的影响。解决方案的关键在于构建了一个高质量的实地数据集,通过同步采集高分辨率车辆轨迹与驾驶员视角视频,在不同限速和TCD类型条件下设计并执行实验;在此基础上,提出了一种TLSSC-TCD交互行为分类体系(包括停车、加速和跟车三种模式),并基于全速度差模型(Full Velocity Difference Model, FVDM)对每种行为模式进行量化标定,从而揭示了关键行为阈值(如约90米的跟车阈值)及各行为的动力学特性差异,为未来ADAS-TCD交互逻辑的仿真建模、安全评估与优化设计提供了可复用的数据基础与理论支撑。

链接: https://arxiv.org/abs/2512.11802
作者: Zheng Li,Peng Zhang,Shixiao Liang,Hang Zhou,Chengyuan Ma,Handong Yao,Qianwen Li,Xiaopeng Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Georgia (佐治亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Understanding how Advanced Driver-Assistance Systems (ADAS) interact with Traffic Control Devices (TCDs) is critical for assessing their influence on traffic operations, yet this interaction has received little focused empirical study. This paper presents a field dataset and behavioral analysis of Tesla’s Traffic Light and Stop Sign Control (TLSSC), a mature ADAS that perceives traffic lights and stop signs. We design and execute experiments across varied speed limits and TCD types, collecting synchronized high-resolution vehicle trajectory data and driver-perspective video. From these data, we develop a taxonomy of TLSSC-TCD interaction behaviors (i.e., stopping, accelerating, and car following) and calibrate the Full Velocity Difference Model (FVDM) to quantitatively characterize each behavior mode. A novel empirical insight is the identification of a car-following threshold (~90 m). Calibration results reveal that stopping behavior is driven by strong responsiveness to both desired speed deviation and relative speed, whereas accelerating behavior is more conservative. Intersection car-following behavior exhibits smoother dynamics and tighter headways compared to standard car-following behaviors. The established dataset, behavior definitions, and model characterizations together provide a foundation for future simulation, safety evaluation, and design of ADAS-TCD interaction logic. Our dataset is available at GitHub.
zh

[CV-221] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

【速读】:该论文旨在解决长尾分布下的2D目标检测问题,即在真实场景中由于类别样本数量极度不均衡(头类样本多、尾类样本少),导致检测模型对稀有类别性能显著下降的问题。其核心解决方案是基于LVISv1数据集(含1,203个类别和164,000张图像)改进Balanced Group Softmax(BAGS)框架,并引入度量学习策略以优化特征空间中的类别可分性:一方面通过k近邻(k-NN)分类机制提升稀有类别的识别精度,另一方面假设尾类特征在头类特征空间中形成更密集的小簇,从而阻碍回归型分类器判别,因此采用度量学习构建类间分离度高、类内紧凑的特征嵌入表示,最终在mAP指标上达到24.5%,超越此前24.0%的基准表现。

链接: https://arxiv.org/abs/2511.16619
作者: Satyam Gaba
机构: Univeristy of the Cumberlands(坎伯兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, International Conference on Semantic Computing

点击查看摘要

Abstract:Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection. Comments: 8 pages, 7 figures, International Conference on Semantic Computing Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.16619 [cs.CV] (or arXiv:2511.16619v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16619 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1109/ICSC64641.2025.00051 Focus to learn more DOI(s) linking to related resources
zh

[CV-222] Self-Supervised Ultrasound Representation Learning for Renal Anomaly Prediction in Prenatal Imaging

【速读】:该论文旨在解决产前超声在检测胎儿肾脏及尿路畸形(Congenital Anomalies of the Kidneys and Urinary Tract, CAKUT)时面临的局限性,主要包括操作者依赖性强和成像条件不佳导致的诊断准确性不足问题。解决方案的关键在于引入一种基于掩码自编码(Masked Autoencoding)的超声自监督基础模型(Ultrasound Self-Supervised Foundation Model, USF-MAE),通过在969张二维超声图像上进行微调,实现对正常肾、尿路扩张和多囊性肾发育不良等类别的自动分类。该方法显著优于传统的DenseNet-169卷积基线模型,在AUC和F1-score上均有提升,尤其在多分类任务中表现突出;同时借助Score-CAM可视化技术增强了模型可解释性,表明其决策依据与临床已知的肾结构(如肾盂或囊性区域)一致,从而为产科影像中的肾脏异常筛查提供了一种鲁棒且可解释的智能辅助诊断框架。

链接: https://arxiv.org/abs/2512.13434
作者: Youssef Megahed,Inok Lee,Robin Ducharme,Kevin Dick,Adrian D. C. Chan,Steven Hawken,Mark C. Walker
机构: Carleton University (卡尔顿大学); Toronto General Hospital (多伦多综合医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Prenatal ultrasound is the cornerstone for detecting congenital anomalies of the kidneys and urinary tract, but diagnosis is limited by operator dependence and suboptimal imaging conditions. We sought to assess the performance of a self-supervised ultrasound foundation model for automated fetal renal anomaly classification using a curated dataset of 969 two-dimensional ultrasound images. A pretrained Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE) was fine-tuned for binary and multi-class classification of normal kidneys, urinary tract dilation, and multicystic dysplastic kidney. Models were compared with a DenseNet-169 convolutional baseline using cross-validation and an independent test set. USF-MAE consistently improved upon the baseline across all evaluation metrics in both binary and multi-class settings. USF-MAE achieved an improvement of about 1.87% (AUC) and 7.8% (F1-score) on the validation set, 2.32% (AUC) and 4.33% (F1-score) on the independent holdout test set. The largest gains were observed in the multi-class setting, where the improvement in AUC was 16.28% and 46.15% in F1-score. To facilitate model interpretability, Score-CAM visualizations were adapted for a transformer architecture and show that model predictions were informed by known, clinically relevant renal structures, including the renal pelvis in urinary tract dilation and cystic regions in multicystic dysplastic kidney. These results show that ultrasound-specific self-supervised learning can generate a useful representation as a foundation for downstream diagnostic tasks. The proposed framework offers a robust, interpretable approach to support the prenatal detection of renal anomalies and demonstrates the promise of foundation models in obstetric imaging.
zh

[CV-223] Leverag ing Compression to Construct Transferable Bitrate Ladders

【速读】:该论文旨在解决传统视频编码中比特率分配效率低的问题,尤其是在固定码率阶梯(fixed bitrate ladder)和恒定CRF(Constant Rate Factor)编码下难以实现内容自适应优化的局限性。其核心挑战在于如何在保证主观质量(Quality of Experience, QoE)的同时显著降低比特率消耗,同时避免计算开销巨大的凸包(convex hull)构建过程。解决方案的关键在于提出一种基于机器学习(Machine Learning, ML)的比特率阶梯构造方法,该方法通过分析压缩流程并在压缩前对源视频进行感知相关测量,精准预测压缩后视频的VMAF分数,从而生成内容感知的、高效的比特率分配策略。该方法在大规模视频数据集上验证了优于现有主流方法的性能表现,并进一步探讨了不同编码设置下每帧比特率阶梯的鲁棒性。

链接: https://arxiv.org/abs/2512.12952
作者: Krishna Srikar Durbha,Hassene Tmar,Ping-Hao Wu,Ioannis Katsavounidis,Alan C. Bovik
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Colorado Boulder (科罗拉多大学博尔德分校); Meta Platforms Inc. (Meta)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review in IEEE Transactions on Image Processing

点击查看摘要

Abstract:Over the past few years, per-title and per-shot video encoding techniques have demonstrated significant gains as compared to conventional techniques such as constant CRF encoding and the fixed bitrate ladder. These techniques have demonstrated that constructing content-gnostic per-shot bitrate ladders can provide significant bitrate gains and improved Quality of Experience (QoE) for viewers under various network conditions. However, constructing a convex hull for every video incurs a significant computational overhead. Recently, machine learning-based bitrate ladder construction techniques have emerged as a substitute for convex hull construction. These methods operate by extracting features from source videos to train machine learning (ML) models to construct content-adaptive bitrate ladders. Here, we present a new ML-based bitrate ladder construction technique that accurately predicts the VMAF scores of compressed videos, by analyzing the compression procedure and by making perceptually relevant measurements on the source videos prior to compression. We evaluate the performance of our proposed framework against leading prior methods on a large corpus of videos. Since training ML models on every encoder setting is time-consuming, we also investigate how per-shot bitrate ladders perform under different encoding settings. We evaluate the performance of all models against the fixed bitrate ladder and the best possible convex hull constructed using exhaustive encoding with Bjontegaard-delta metrics.
zh

[CV-224] Quantum Implicit Neural Representations for 3D Scene Reconstruction and Novel View Synthesis

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在3D场景重建中因经典网络存在频谱偏差(spectral bias)而导致难以捕捉高频细节的问题。其解决方案的关键在于提出量子隐式表示网络(Quantum Implicit Representation Networks, QIREN),通过引入具有内在傅里叶结构的参数化量子电路,实现比传统多层感知机(MLP)更紧凑且表达能力更强的频率建模。在此基础上,论文进一步构建了首个混合量子-经典神经辐射场框架(Quantum Neural Radiance Fields, Q-NeRF),将QIREN模块嵌入Nerfacto骨干网络中,替代部分密度和辐射率预测组件,在保持高效采样、位姿优化与体渲染策略的同时,显著提升了对细粒度视点依赖外观的建模能力。

链接: https://arxiv.org/abs/2512.12683
作者: Yeray Cordero,Paula García-Molina,Fernando Vilariño
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) have become a powerful paradigm for continuous signal modeling and 3D scene reconstruction, yet classical networks suffer from a well-known spectral bias that limits their ability to capture high-frequency details. Quantum Implicit Representation Networks (QIREN) mitigate this limitation by employing parameterized quantum circuits with inherent Fourier structures, enabling compact and expressive frequency modeling beyond classical MLPs. In this paper, we present Quantum Neural Radiance Fields (Q-NeRF), the first hybrid quantum-classical framework for neural radiance field rendering. Q-NeRF integrates QIREN modules into the Nerfacto backbone, preserving its efficient sampling, pose refinement, and volumetric rendering strategies while replacing selected density and radiance prediction components with quantum-enhanced counterparts. We systematically evaluate three hybrid configurations on standard multi-view indoor datasets, comparing them to classical baselines using PSNR, SSIM, and LPIPS metrics. Results show that hybrid quantum-classical models achieve competitive reconstruction quality under limited computational resources, with quantum modules particularly effective in representing fine-scale, view-dependent appearance. Although current implementations rely on quantum circuit simulators constrained to few-qubit regimes, the results highlight the potential of quantum encodings to alleviate spectral bias in implicit representations. Q-NeRF provides a foundational step toward scalable quantum-enabled 3D scene reconstruction and a baseline for future quantum neural rendering research.
zh

[CV-225] JPEG-Inspired Cloud-Edge Holography

【速读】:该论文旨在解决生成式全息(Computer-generated Holography, CGH)在近眼显示设备中部署时面临的计算资源受限与带宽需求高之间的矛盾问题。具体而言,现有基于深度学习的CGH方法虽提升了重建质量与效率,但直接部署于轻量级眼镜式设备存在算力和能耗瓶颈;而将计算任务迁移至云端后,若采用传统图像编解码方式传输相位信息,则易造成失真且需高带宽保障质量。为此,作者提出JPEG-Inspired Cloud-Edge Holography框架,其核心创新在于设计了一种可学习的变换编码器(learnable transform codec),该编码器保留了JPEG的块结构特性与硬件友好性,使全部复杂神经处理均在云端完成,边缘设备仅执行轻量级非神经解码操作,从而显著降低延迟(低至4.2 ms)并减少对专用硬件依赖,实现了在资源受限可穿戴设备上的高效、低延迟全息流媒体传输。

链接: https://arxiv.org/abs/2512.12367
作者: Shuyang Xie,Jie Zhou,Jun Wang,Renjing Xu
机构: HKUST(GZ); Sichuan University
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-generated holography (CGH) presents a transformative solution for near-eye displays in augmented and virtual reality. Recent advances in deep learning have greatly improved CGH in reconstructed quality and computational efficiency. However, deploying neural CGH pipelines directly on compact, eyeglass-style devices is hindered by stringent constraints on computation and energy consumption, while cloud offloading followed by transmission with natural image codecs often distorts phase information and requires high bandwidth to maintain reconstruction quality. Neural compression methods can reduce bandwidth but impose heavy neural decoders at the edge, increasing inference latency and hardware demand. In this work, we introduce JPEG-Inspired Cloud-Edge Holography, an efficient pipeline designed around a learnable transform codec that retains the block-structured and hardware-friendly nature of JPEG. Our system shifts all heavy neural processing to the cloud, while the edge device performs only lightweight decoding without any neural inference. To further improve throughput, we implement custom CUDA kernels for entropy coding on both cloud and edge. This design achieves a peak signal-to-noise ratio of 32.15 dB at 2 bits per pixel with decode latency as low as 4.2 ms. Both numerical simulations and optical experiments confirm the high reconstruction quality of the holograms. By aligning CGH with a codec that preserves JPEG’s structural efficiency while extending it with learnable components, our framework enables low-latency, bandwidth-efficient hologram streaming on resource-constrained wearable devices-using only simple block-based decoding readily supported by modern system-on-chips, without requiring neural decoders or specialized hardware.
zh

[CV-226] V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

【速读】:该论文旨在解决流式视频大语言模型(Streaming Video Large Language Models, SV-LLMs)在边缘部署时因持续视频输入导致的关键值(Key-Value, KV)缓存急剧增长所引发的内存与计算瓶颈问题,尤其是其特有的迭代预填充(iterative prefill)阶段带来的高计算开销、大量数据传输以及精度下降等挑战。解决方案的核心是提出首个软硬件协同设计加速器V-Rex,其中关键创新在于训练-free的动态KV缓存检索算法ReSV,该算法通过利用视频帧间的时间和空间相似性进行token聚类,显著减少冗余KV缓存占用;同时配套设计了低延迟、紧凑型硬件加速单元——动态KV缓存检索引擎(Dynamic Retrieval Engine, DRE),集成位级计算与早期退出机制,从而实现端侧实时推理(3.9–8.3 FPS)且能效提升达18.5倍,同时保持近似无损的准确性。

链接: https://arxiv.org/abs/2512.12284
作者: Donghyuk Kim,Sejeong Yang,Wonjin Shin,Joo-Young Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 14 pages, 20 figures, conference

点击查看摘要

Abstract:Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices. Comments: 14 pages, 20 figures, conference Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2512.12284 [eess.IV] (or arXiv:2512.12284v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.12284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-227] Resolution-Independent Neural Operators for Multi-Rate Sparse-View CT

【速读】:该论文旨在解决稀疏视图计算机断层扫描(Sparse-view Computed Tomography, CT)中因投影数据有限导致的重建病态逆问题,以及现有深度学习方法在不同采样率和图像分辨率下泛化能力差的问题。传统卷积神经网络(Convolutional Neural Networks, CNNs)由于固定核参数无法适应分辨率变化,易产生伪影。其解决方案的关键在于提出一种统一的CT重建框架——CTO(Computed Tomography neural Operator),该框架将模型扩展至连续函数空间,通过旋转等变的离散-连续卷积(Discrete-Continuous convolutions)在sinogram域与图像域联合操作,从而实现无需重训练即可跨采样率和分辨率的通用重建。这一设计使CTO具备内在的分辨率和采样率无关性,在多项指标上显著优于CNN及当前最优扩散模型(如PSNR提升4dB,推理速度提高500倍)。

链接: https://arxiv.org/abs/2512.12236
作者: Aujasvit Datta,Jiayun Wang,Asad Aali,Armeet Singh Jatyani,Anima Anandkumar
机构: Caltech(加州理工学院); IIT Kanpur(印度理工学院坎普尔分校); Stanford University(斯坦福大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-view Computed Tomography (CT) reconstructs images from a limited number of X-ray projections to reduce radiation and scanning time, which makes reconstruction an ill-posed inverse problem. Deep learning methods achieve high-fidelity reconstructions but often overfit to a fixed acquisition setup, failing to generalize across sampling rates and image resolutions. For example, convolutional neural networks (CNNs) use the same learned kernels across resolutions, leading to artifacts when data resolution changes. We propose Computed Tomography neural Operator (CTO), a unified CT reconstruction framework that extends to continuous function space, enabling generalization (without retraining) across sampling rates and image resolutions. CTO operates jointly in the sinogram and image domains through rotation-equivariant Discrete-Continuous convolutions parametrized in the function space, making it inherently resolution- and sampling-agnostic. Empirically, CTO enables consistent multi-sampling-rate and cross-resolution performance, with on average 4dB PSNR gain over CNNs. Compared to state-of-the-art diffusion methods, CTO is 500 \times faster in inference time with on average 3dB gain. Empirical results also validate our design choices behind CTO’s sinogram-space operator learning and rotation-equivariant convolution. Overall, CTO outperforms state-of-the-art baselines across sampling rates and resolutions, offering a scalable and generalizable solution that makes automated CT reconstruction more practical for deployment. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.12236 [eess.IV] (or arXiv:2512.12236v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.12236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-228] Semantic search for 100M galaxy images using AI-generated captions NEURIPS2025

【速读】:该论文旨在解决科学图像数据(如天文观测中的数十亿张星系图像)因依赖人工标注而难以高效探索的问题,从而限制了对罕见或有趣现象的发现能力。其核心解决方案是构建一个基于视觉-语言模型(Vision-Language Models, VLMs)的语义搜索系统——AION-Search,关键在于利用VLM自动为无标签图像生成描述,并通过对比学习将预训练的多模态天文学基础模型与这些文本嵌入对齐,从而在不依赖人工标注的情况下实现可扩展的语义嵌入表示。该方法首次使1.4亿张星系图像具备灵活、高效的语义搜索能力,在零样本条件下即能识别稀有天体现象,且结合VLM重排序策略显著提升前100名结果的召回率,为大规模未标注科学图像库提供了一种通用的语义检索范式。

链接: https://arxiv.org/abs/2512.11982
作者: Nolan Koblischke,Liam Parker,Francois Lanusse,Irina Espejo Morales,Jo Bovy,Shirley Ho
机构: University of Toronto (多伦多大学); Princeton University (普林斯顿大学); Max Planck Institute for Astrophysics (马克斯·普朗克天体物理研究所); University of Cambridge (剑桥大学); Stanford University (斯坦福大学); Flatiron Institute (扁平铁研究所)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Presented at the NeurIPS 2025 AI4Science Workshop

点击查看摘要

Abstract:Finding scientifically interesting phenomena through slow, manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained multimodal astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search scalable to 140 million galaxy images, enabling discovery from previously infeasible searches. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at this https URL
zh

[CV-229] Pre-training vision models for the classification of alerts from wide-field time-domain surveys

【速读】:该论文旨在解决时间域天文学中警报分类任务的性能与效率瓶颈问题,即当前普遍采用自定义卷积神经网络(Convolutional Neural Networks, CNNs)从零开始训练模型,导致资源消耗大、效率低且性能受限。其解决方案的关键在于引入计算机视觉领域的最新实践——使用在大规模通用图像数据集(如ImageNet或Galaxy Zoo星系图像)上预训练的标准化模型架构,而非设计专用CNN并从头训练。研究发现,这类预训练模型不仅在分类性能上优于传统定制模型,而且在推理时间和内存占用方面显著更高效,即使参数量更大,也因架构优化而表现更优,从而为下一代巡天项目(如Legacy Survey of Space and Time)提供了更具可行性的视觉模型构建范式。

链接: https://arxiv.org/abs/2512.11957
作者: Nabeel Rehemtulla,Adam A. Miller,Mike Walmsley,Ved G. Shah,Theophile Jegou du Laz,Michael W. Coughlin,Argyro Sasli,Joshua Bloom,Christoffer Fremling,Matthew J. Graham,Steven L. Groom,David Hale,Ashish A. Mahabal,Daniel A. Perley,Josiah Purdum,Ben Rusholme,Jesper Sollerman,Mansi M. Kasliwal
机构: Northwestern University (西北大学); Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA) (跨学科探索与天体物理研究中心); NSF-Simons AI Institute for the Sky (SkAI) (国家科学基金会-西蒙斯人工智能天空研究所); University of Toronto (多伦多大学); California Institute of Technology (加州理工学院); NSF Institute on Accelerated AI Algorithms for Data-Driven Discovery (A3D3) (国家科学基金会加速人工智能算法用于数据驱动发现研究所); University of Minnesota (明尼苏达大学); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California, Berkeley (加州大学伯克利分校); Caltech Optical Observatories (加州理工学院光学天文台); IPAC (红外处理分析中心); Center for Data Driven Discovery (数据驱动发现中心); Liverpool John Moores University (利物浦约翰摩尔斯大学); Stockholm University (斯德哥尔摩大学); Mansi M. Kasliwal (曼西·M·卡斯利瓦尔,此为个人姓名,非机构,不纳入)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: To be submitted to PASP

点击查看摘要

Abstract:Modern wide-field time-domain surveys facilitate the study of transient, variable and moving phenomena by conducting image differencing and relaying alerts to their communities. Machine learning tools have been used on data from these surveys and their precursors for more than a decade, and convolutional neural networks (CNNs), which make predictions directly from input images, saw particularly broad adoption through the 2010s. Since then, continually rapid advances in computer vision have transformed the standard practices around using such models. It is now commonplace to use standardized architectures pre-trained on large corpora of everyday images (e.g., ImageNet). In contrast, time-domain astronomy studies still typically design custom CNN architectures and train them from scratch. Here, we explore the affects of adopting various pre-training regimens and standardized model architectures on the performance of alert classification. We find that the resulting models match or outperform a custom, specialized CNN like what is typically used for filtering alerts. Moreover, our results show that pre-training on galaxy images from Galaxy Zoo tends to yield better performance than pre-training on ImageNet or training from scratch. We observe that the design of standardized architectures are much better optimized than the custom CNN baseline, requiring significantly less time and memory for inference despite having more trainable parameters. On the eve of the Legacy Survey of Space and Time and other image-differencing surveys, these findings advocate for a paradigm shift in the creation of vision models for alerts, demonstrating that greater performance and efficiency, in time and in data, can be achieved by adopting the latest practices from the computer vision field.
zh

[CV-230] Vision Foundry: A System for Training Foundational Vision AI Models

【速读】:该论文旨在解决临床研究人员在应用自监督学习(Self-supervised Learning, SSL)时面临的高技术门槛问题,尤其是在医学影像领域中,如何高效地利用大规模未标注数据进行模型预训练、适配与部署。解决方案的关键在于提出并实现了一个无需编写代码、符合HIPAA合规要求的平台——Vision Foundry,其核心创新包括集成DINO-MX框架以抽象分布式基础设施复杂性,并引入专为医学影像设计的策略,如放大感知蒸馏(Magnification-Aware Distillation, MAD)和参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),从而显著提升模型在病理分割、细胞密度估计及冠状动脉钙化评分等任务中的性能与零样本泛化能力,使临床专家能够以最小标注成本开发高性能AI工具,聚焦于临床发现而非工程优化。

链接: https://arxiv.org/abs/2512.11837
作者: Mahmut S. Gokmen,Mitchell A. Klusty,Evan W. Damron,W. Vaiden Logan,Aaron D. Mullen,Caroline N. Leach,Emily B. Collier,Samuel E. Armstrong,V.K. Cody Bumgardner
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 3 tables, submitted to AMIA 2026 Informatics Summit

点击查看摘要

Abstract:Self-supervised learning (SSL) leverages vast unannotated medical datasets, yet steep technical barriers limit adoption by clinical researchers. We introduce Vision Foundry, a code-free, HIPAA-compliant platform that democratizes pre-training, adaptation, and deployment of foundational vision models. The system integrates the DINO-MX framework, abstracting distributed infrastructure complexities while implementing specialized strategies like Magnification-Aware Distillation (MAD) and Parameter-Efficient Fine-Tuning (PEFT). We validate the platform across domains, including neuropathology segmentation, lung cellularity estimation, and coronary calcium scoring. Our experiments demonstrate that models trained via Vision Foundry significantly outperform generic baselines in segmentation fidelity and regression accuracy, while exhibiting robust zero-shot generalization across imaging protocols. By bridging the gap between advanced representation learning and practical application, Vision Foundry enables domain experts to develop state-of-the-art clinical AI tools with minimal annotation overhead, shifting focus from engineering optimization to clinical discovery.
zh

人工智能

[AI-0] Embedding-Based Rankings of Educational Resources based on Learning Outcome Alignment: Benchmarking Expert Validation and Learner Performance

【速读】:该论文旨在解决在线教育中个性化学习资源难以有效匹配预期学习目标的问题,尤其是在海量教育资源背景下,教师面临筛选与学习目标对齐内容的挑战。其核心解决方案是提出一种基于文本嵌入(text-embedding)模型的自动化评估框架,通过计算教育材料与学习目标之间的语义相似度来判断对齐程度,从而替代高成本的人工审核流程。关键创新在于验证了最优嵌入模型(Voyage)在检测对齐性上具有较高准确率(79%),并进一步证明该方法可有效应用于大语言模型生成的内容,专家评估显示其准确率达83%,且实验表明更高的对齐分数显著提升学习绩效(χ²(2, N = 360) = 15.39, p < 0.001),实现了规模化个性化教学的支持。

链接: https://arxiv.org/abs/2512.13658
作者: Mohammadreza Molavi,Mohammad Moein,Mohammadreza Tavakoli,Abdolali Faraji,Stefan T. Mol,Gábor Kismihók
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 16th International Conference on Learning Analytics Knowledge (LAK 2026)

点击查看摘要

Abstract:As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
zh

[AI-1] DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication

【速读】:该论文旨在解决去中心化学习中如何在保证严格差分隐私(Differential Privacy, DP)的同时,实现高模型效用与高效通信的问题。现有方法往往难以兼顾隐私保护、模型性能和通信效率,尤其是在有向图结构下的分布式优化场景中。解决方案的关键在于提出一种带有压缩通信的差分隐私随机梯度推算法(Differentially Private Stochastic Gradient Push with Compressed communication, DP-CSGP),通过引入压缩机制降低通信开销,并在非凸光滑目标函数下证明其能达到与精确通信版本相当的理论效用边界,即 O(dlog(1δ)/(nJϵ))\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)/(\sqrt{nJ}\epsilon)} \right),其中 nn 为节点数,JJ 为每节点本地样本数,dd 为决策变量维度,ϵ\epsilonδ\delta 为隐私参数。实验表明,在相同隐私预算下,DP-CSGP 在保持相近模型精度的同时显著降低了通信成本。

链接: https://arxiv.org/abs/2512.13583
作者: Zehan Zhu,Heng Zhao,Yan Huang,Joey Tianyi Zhou,Shouling Ji,Jinming Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of \mathcalO\left( \sqrtd\log \left( \frac1\delta \right)/(\sqrtnJ\epsilon) \right) ( J and d are the number of local samples and the dimension of decision variables, respectively) with \left(\epsilon, \delta\right) -DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.
zh

[AI-2] Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

【速读】:该论文试图解决神经网络中特征叠加(superposition)现象的量化与测量问题,即如何在激活空间中衡量多个特征以重叠方向编码时所损失的信息及其有效自由度。传统方法难以准确评估这种非线性压缩机制对模型性能和可解释性的影响。解决方案的关键在于提出一个信息论框架,利用香农熵(Shannon entropy)对稀疏自编码器(sparse autoencoder)激活进行建模,从而计算出实现无干扰编码所需的最小神经元数量——这等价于网络通过叠加模拟的“虚拟神经元”数量。该指标能够反映神经网络在计算约束下组织信息的方式,并揭示了超参数如dropout、对抗训练以及任务复杂度对特征有效数量的系统性影响。

链接: https://arxiv.org/abs/2512.13568
作者: Leonard Bereska,Zoe Tzifa-Kratira,Reza Samavi,Efstratios Gavves
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to TMLR, view HTML here: this https URL

点击查看摘要

Abstract:Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation’s effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many “virtual neurons” the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.
zh

[AI-3] MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph

【速读】:该论文旨在解决当前具备推理能力的医学语言模型在临床应用中存在推理过程不可靠的问题,即尽管强化学习等方法提升了模型的推理性能,但其推理路径的准确性与临床有效性常被忽视,导致医生难以信任其决策依据。解决方案的关键在于提出MedCEG框架,通过引入一种称为“关键证据图”(Critical Evidence Graph, CEG)的结构化监督机制,显式约束和引导模型生成具有临床可验证性的推理链条;同时设计了一种临床推理过程奖励函数(Clinical Reasoning Procedure Reward),从节点覆盖度、结构正确性和链路完整性三个维度对推理质量进行综合评估,从而显著提升医学AI推理的可靠性与透明性。

链接: https://arxiv.org/abs/2512.13510
作者: Linjie Mu,Yannian Gu,Zhongzhen Huang,Yakun Zhu,Shaoting Zhang,Xiaofan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at this https URL.
zh

[AI-4] Defending the Hierarchical Result Models of Precedential Constraint

【速读】:该论文旨在解决层次化案例推理(hierarchical case-based reasoning)模型在判例约束(precedential constraint)建模中可能产生错误结果的问题,特别是针对Trevor Bench-Capon指出的:现有模型未能充分考虑中间因素(intermediate factors)由不同基础因素(base-level factors)以不同强度确立的情形。解决方案的关键在于区分中间因素作为“维度”(dimensions)与作为“结果”(results)的不同解释,并引入van Woerkom的结果导向型层次模型(result-based hierarchical model)的维度版本,从而在Bench-Capon提出的具体示例中避免其批评所指的逻辑缺陷。

链接: https://arxiv.org/abs/2512.13505
作者: Henry Prakken,Wijnand van Woerkom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the long version of a paper with the same title presented at the 38th International Conference on Legal Knowledge and Information Systems

点击查看摘要

Abstract:In recent years, hierarchical case-based-reasoning models of precedential constraint have been proposed. In various papers, Trevor Bench-Capon criticised these models on the grounds that they would give incorrect outcomes in some cases. In particular, the models would not account for the possibility that intermediate factors are established with different strengths by different base-level factors. In this paper we respond to these criticisms for van Woerkom’s result-based hierarchical models. We argue that in some examples Bench-Capon seems to interpret intermediate factors as dimensions, and that applying van Woerkom’s dimension-based version of the hierarchical result model to these examples avoids Bench-Capon’s criticisms.
zh

[AI-5] Behavior-Aware and Generalizable Defense Against Black-Box Adversarial Attacks for ML-Based IDS

【速读】:该论文针对机器学习驱动的入侵检测系统(Intrusion Detection System, IDS)在面对黑盒对抗攻击时防御能力不足的问题展开研究。黑盒攻击者通过间接反馈(如二进制输出或响应时间、资源使用等行为信号)构造规避输入,现有防御方法如输入变换、对抗训练和替代检测机制往往局限于特定攻击类型、依赖模型内部访问权限或采用静态策略,难以适应不断演化的攻击手段,且部分方法会损害IDS的检测性能,不适用于实时部署。论文提出了一种轻量级、主动式的防御机制——自适应特征污染(Adaptive Feature Poisoning),其核心在于:假设攻击者可进行静默且持续的探测,通过流量特征分析、变化点检测与自适应缩放技术,动态识别并扰动攻击者正在利用的关键流量特征,从而破坏攻击者的反馈闭环,同时保持检测能力不受影响。该方案具备攻击无关性、通用性和隐蔽性,显著提升了IDS在真实黑盒场景下的对抗鲁棒性。

链接: https://arxiv.org/abs/2512.13501
作者: Sabrine Ennaji,Elhadj Benkhelifa,Luigi Vincenzo Mancini
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning based intrusion detection systems are increasingly targeted by black box adversarial attacks, where attackers craft evasive inputs using indirect feedback such as binary outputs or behavioral signals like response time and resource usage. While several defenses have been proposed, including input transformation, adversarial training, and surrogate detection, they often fall short in practice. Most are tailored to specific attack types, require internal model access, or rely on static mechanisms that fail to generalize across evolving attack strategies. Furthermore, defenses such as input transformation can degrade intrusion detection system performance, making them unsuitable for real time deployment. To address these limitations, we propose Adaptive Feature Poisoning, a lightweight and proactive defense mechanism designed specifically for realistic black box scenarios. Adaptive Feature Poisoning assumes that probing can occur silently and continuously, and introduces dynamic and context aware perturbations to selected traffic features, corrupting the attacker feedback loop without impacting detection capabilities. The method leverages traffic profiling, change point detection, and adaptive scaling to selectively perturb features that an attacker is likely exploiting, based on observed deviations. We evaluate Adaptive Feature Poisoning against multiple realistic adversarial attack strategies, including silent probing, transferability based attacks, and decision boundary based attacks. The results demonstrate its ability to confuse attackers, degrade attack effectiveness, and preserve detection performance. By offering a generalizable, attack agnostic, and undetectable defense, Adaptive Feature Poisoning represents a significant step toward practical and robust adversarial resilience in machine learning based intrusion detection systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.13501 [cs.CR] (or arXiv:2512.13501v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.13501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-6] SSAS: Cross-subject EEG-based Emotion Recognition through Source Selection with Adversarial Strategy

【速读】:该论文旨在解决跨被试(cross-subject)脑电(EEG)情绪识别中因个体差异导致的负迁移(negative transfer)问题,即在模型训练过程中忽视了不同被试间特征分布的异质性,从而影响情绪分类性能。其解决方案的关键在于提出一种基于源域选择与对抗策略相结合的方法,包含两个核心模块:源选择网络(Source Selection Network, SS)和对抗策略网络(Adversarial Strategies Network, AS)。其中,SS通过利用域标签重构域适应训练过程,主动破坏类别可分性并放大域间差异,迫使模型学习到域不变但与情绪相关表征;AS则基于SS输出的源域选择结果及预训练的域判别器,引入一种新颖的对抗损失函数以提升域分类性能,确保对抗训练中的策略平衡。该方法在SEED和SEED-IV两个EEG情绪数据集上均取得了优异效果,验证了其有效性。

链接: https://arxiv.org/abs/2512.13458
作者: Yici Liu,Qi Wei Oung,Hoi Leong Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at this https URL.
zh

[AI-7] From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在自动化用户界面(User Interface, UI)导航任务中因UI表示效率低下而导致的性能瓶颈问题。核心挑战在于:一方面缺乏布尔型验证器(Boolean oracles),难以同时优化token效率与语义完整性;另一方面,输入为复杂的大规模UI树结构,输出为长且组合性强的转换程序,导致搜索空间庞大且易出错。解决方案的关键在于提出UIFormer框架,其通过两个核心技术实现突破:一是设计领域特定语言(Domain-Specific Language, DSL)限制程序空间以捕捉UI特有操作;二是基于LLM进行迭代精化,引入正确性和效率双重奖励机制,引导完成效率与完整性协同优化。该框架作为轻量级插件可无缝集成至现有LLM代理,无需修改其核心逻辑,实测在跨Android和Web平台的三个UI导航基准上实现48.7%–55.8%的token减少,同时保持或提升代理性能。

链接: https://arxiv.org/abs/2512.13438
作者: Dezhi Ran,Zhi Gong,Yuzhe Guo,Mengzhou Wu,Yuan Cao,Haochuan Lu,Hengyu Zhang,Xia Zeng,Gang Cao,Liangchao Yao,Yuetang Deng,Wei Yang,Tao Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co-optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error-prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain-specific language (DSL) that captures UI-specific operations. Second, UIFormer conducts LLM-based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency-completeness co-optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real-world industry deployment at WeChat further validates the practical impact of UIFormer.
zh

[AI-8] Behavior and Representation in Large Language Models for Combinatorial Optimization: From Feature Extraction to Algorithm Selection

【速读】:该论文旨在解决生成式 AI(Generative AI)在组合优化问题中对问题结构与算法行为的学习机制尚不明确的问题,即探究大型语言模型(Large Language Models, LLMs)如何内部表征组合优化实例,并评估这些表征是否能够支持下游决策任务。其解决方案的关键在于采用双重方法:一是通过直接查询(direct querying)检验LLMs显式提取实例特征的能力;二是借助探测分析(probing analyses)考察此类信息是否隐式编码于模型隐藏层中,并进一步将该框架扩展至实例级算法选择任务,验证LLM衍生表征能否预测最优求解器性能。实验表明,LLM在恢复问题特征方面具备中等能力,且其隐藏层表征的预测效能可媲美传统特征提取方法,证明LLM能捕获与优化性能相关的重要结构信息。

链接: https://arxiv.org/abs/2512.13374
作者: Francesca Da Ros,Luca Di Gaspero,Kevin Roitero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have opened new perspectives for automation in optimization. While several studies have explored how LLMs can generate or solve optimization models, far less is understood about what these models actually learn regarding problem structure or algorithmic behavior. This study investigates how LLMs internally represent combinatorial optimization problems and whether such representations can support downstream decision tasks. We adopt a twofold methodology combining direct querying, which assesses LLM capacity to explicitly extract instance features, with probing analyses that examine whether such information is implicitly encoded within their hidden layers. The probing framework is further extended to a per-instance algorithm selection task, evaluating whether LLM-derived representations can predict the best-performing solver. Experiments span four benchmark problems and three instance representations. Results show that LLMs exhibit moderate ability to recover feature information from problem instances, either through direct querying or probing. Notably, the predictive power of LLM hidden-layer representations proves comparable to that achieved through traditional feature extraction, suggesting that LLMs capture meaningful structural information relevant to optimization performance.
zh

[AI-9] Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

【速读】:该论文旨在解决Twin Rotor Aerodynamic System (TRAS) 在复杂非线性动态特性下难以通过传统控制算法实现稳定控制与轨迹跟踪的问题。其解决方案的关键在于采用基于强化学习(Reinforcement Learning, RL)的策略,具体使用了双延迟深度确定性策略梯度(Twin Delayed Deep Deterministic Policy Gradient, TD3)算法来训练智能体,该方法适用于连续状态和动作空间,且无需系统模型,从而有效应对TRAS的高维非线性控制挑战。仿真与实验结果表明,该方法在抗风扰动能力和实际应用中均优于传统PID控制器。

链接: https://arxiv.org/abs/2512.13356
作者: Zeyad Gamal,Youssef Mahran,Ayman El-Badawy
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

点击查看摘要

Abstract:This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller’s effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller’s effectiveness in real-world applications.
zh

[AI-10] Security and Detectability Analysis of Unicode Text Watermarking Methods Against Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)背景下数字文本水印技术的安全性与隐蔽性问题,即现有文本水印方法是否能有效抵御大型语言模型(Large Language Models, LLMs)的检测和提取。研究通过在六个主流LLM上测试十种Unicode文本水印方法,发现尽管最新推理型模型能够识别出被水印标记的文本,但它们无法在没有源代码或实现细节的情况下提取水印信息。解决方案的关键在于:水印嵌入机制需具备对LLM的隐蔽性,同时其安全性依赖于实现细节的保密性,而非仅靠算法本身不可逆性,这为未来水印设计提供了新的安全边界和研究方向。

链接: https://arxiv.org/abs/2512.13325
作者: Malte Hellmeier
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at the ICISSP 2026

点击查看摘要

Abstract:Securing digital text is becoming increasingly relevant due to the widespread use of large language models. Individuals’ fear of losing control over data when it is being used to train such machine learning models or when distinguishing model-generated output from text written by humans. Digital watermarking provides additional protection by embedding an invisible watermark within the data that requires protection. However, little work has been taken to analyze and verify if existing digital text watermarking methods are secure and undetectable by large language models. In this paper, we investigate the security-related area of watermarking and machine learning models for text data. In a controlled testbed of three experiments, ten existing Unicode text watermarking methods were implemented and analyzed across six large language models: GPT-5, GPT-4o, Teuken 7B, Llama 3.3, Claude Sonnet 4, and Gemini 2.5 Pro. The findings of our experiments indicate that, especially the latest reasoning models, can detect a watermarked text. Nevertheless, all models fail to extract the watermark unless implementation details in the form of source code are provided. We discuss the implications for security researchers and practitioners and outline future research opportunities to address security concerns.
zh

[AI-11] Error-Driven Prompt Optimization for Arithmetic Reasoning

【速读】:该论文旨在解决工业场景中对隐私敏感的表格数据处理任务中,小型语言模型(Small Language Models, SLMs)在执行精确算术推理时存在的准确性不足问题,尤其是在要求数据不出内网的合规环境下。解决方案的关键在于提出一种基于错误驱动的优化框架,通过聚类模型错误预测结果并迭代优化提示规则(prompt-rules),从而显著提升SLM在算术任务上的表现,使Qwen3 4B模型准确率从原始水平提升至70.8%,且优于GPT-3.5 Turbo,在不依赖昂贵微调的前提下实现可解释、可部署的AI辅助系统。

链接: https://arxiv.org/abs/2512.13323
作者: Árpád Pándy,Róbert Lakatos,András Hajdu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model’s accuracy to 70.8%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.
zh

[AI-12] ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据分布高度非独立同分布(Non-IID)而导致的模型性能下降与隐私泄露问题。其核心挑战在于如何在保护敏感数据的同时,实现跨域协作下的有效知识迁移。解决方案的关键在于提出ALIGN-FL框架,通过仅共享生成式组件而非完整模型参数,使服务器基于合成样本进行全局训练,从而实现隐私保护下的客户端不变学习(Client-invariant Learning)。该方法结合了自适应裁剪的差分隐私随机梯度下降(DP-SGD)与Lipschitz正则化的变分自编码器(VAE)解码器,有效将敏感异常点映射为典型数据点,同时在极端Non-IID场景下保持模型实用性。

链接: https://arxiv.org/abs/2512.13316
作者: Mayank Gulati,Benedikt Groß,Gerhard Wunder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)

点击查看摘要

Abstract:We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations. Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures Comments: Accepted at 2025 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.13316 [cs.LG] (or arXiv:2512.13316v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.13316 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/CyberC66434.2025.00025 Focus to learn more DOI(s) linking to related resources
zh

[AI-13] No One Left Behind: How to Exploit the Incomplete and Skewed Multi-Label Data for Conversion Rate Prediction

【速读】:该论文旨在解决在线广告系统中多任务学习(Multi-Task Learning, MTL)因转换率(Conversion Rate, CVR)标签不完整和分布偏斜而导致的模型性能下降问题。具体而言,由于广告主仅提交部分用户转化行为(受隐私等因素限制),导致多标签数据存在不对称缺失,使得训练与部署阶段的数据分布不一致,进而影响模型泛化能力。其解决方案的关键在于提出细粒度知识迁移框架KAML(Knowledge transfer framework for Asymmetric Multi-Label data),核心创新包括:1)引入基于归因驱动的掩码策略(Attribution-driven Masking Strategy, ADM),以更高效地利用具有不对称多标签结构的数据;2)设计分层知识提取机制(Hierarchical Knowledge Extraction, HKE),建模目标任务塔内样本差异以缓解由数据偏斜带来的噪声干扰;3)结合排序损失策略增强未标记样本的利用效率,从而提升整体CVR预测准确性。

链接: https://arxiv.org/abs/2512.13300
作者: Qinglin Jia,Zhaocheng Du,Chuhan Wu,Huifeng Guo,Ruiming Tang,Shuting Shi,Muyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversion actions due to privacy or other constraints, making the labels of multi-task data incomplete. If the model is trained on all available samples where advertisers submit user conversion actions, it may struggle when deployed to serve a subset of advertisers targeting specific conversion actions, as the training and deployment data distributions are mismatched. While considerable MTL efforts have been made, a long-standing challenge is how to effectively train a unified model with the incomplete and skewed multi-label data. In this paper, we propose a fine-grained Knowledge transfer framework for Asymmetric Multi-Label data (KAML). We introduce an attribution-driven masking strategy (ADM) to better utilize data with asymmetric multi-label data in training. However, the more relaxed masking in ADM is a double-edged sword: it provides additional training signals but also introduces noise due to skewed data. To address this, we propose a hierarchical knowledge extraction mechanism (HKE) to model the sample discrepancy within the target task tower. Finally, to maximize the utility of unlabeled samples, we incorporate ranking loss strategy to further enhance our model. The effectiveness of KAML has been demonstrated through comprehensive evaluations on offline industry datasets and online A/B tests, which show significant performance improvements over existing MTL baselines.
zh

[AI-14] MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data

【速读】:该论文旨在解决当前大型多模态模型(Large Multi-Modal Models, LMMs)在医学数据洞察发现能力上的不足,尤其是缺乏高质量基准来系统评估其在复杂多模态医学图像数据中提取深层次、可操作性见解的能力。现有LMMs在处理多步骤推理和整合专业医学知识方面表现有限,导致其在真实医疗场景中的应用效果不佳。解决方案的关键在于提出MedInsightAgent——一个自动化的智能代理框架,由三个核心模块构成:视觉根因定位器(Visual Root Finder)、分析洞察代理(Analytical Insight Agent)和随访问题生成器(Follow-up Question Composer),通过结构化流程提升模型对医学影像的深度理解与推理能力,从而显著增强LMMs在医学数据洞察发现任务中的性能。

链接: https://arxiv.org/abs/2512.13297
作者: Zhenghao Zhu,Chuxue Cao,Sirui Han,Yuanfeng Song,Xing Chen,Caleb Chen Cao,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In medical data analysis, extracting deep insights from complex, multi-modal datasets is essential for improving patient care, increasing diagnostic accuracy, and optimizing healthcare operations. However, there is currently a lack of high-quality datasets specifically designed to evaluate the ability of large multi-modal models (LMMs) to discover medical insights. In this paper, we introduce MedInsightBench, the first benchmark that comprises 332 carefully curated medical cases, each annotated with thoughtfully designed insights. This benchmark is intended to evaluate the ability of LMMs and agent frameworks to analyze multi-modal medical image data, including posing relevant questions, interpreting complex findings, and synthesizing actionable insights and recommendations. Our analysis indicates that existing LMMs exhibit limited performance on MedInsightBench, which is primarily attributed to their challenges in extracting multi-step, deep insights and the absence of medical expertise. Therefore, we propose MedInsightAgent, an automated agent framework for medical data analysis, composed of three modules: Visual Root Finder, Analytical Insight Agent, and Follow-up Question Composer. Experiments on MedInsightBench highlight pervasive challenges and demonstrate that MedInsightAgent can improve the performance of general LMMs in medical data insight discovery.
zh

[AI-15] Intrinsic-Motivation Multi-Robot Social Formation Navigation with Coordinated Exploration

【速读】:该论文旨在解决多机器人社交编队导航(social formation navigation)中因行人行为不可预测且常具非合作性而导致的协调探索效率低下的问题。其解决方案的关键在于提出一种新颖的协同探索多机器人强化学习算法,核心创新为引入内在动机驱动的探索机制,通过自学习的内在奖励机制共同缓解策略保守性;同时在集中训练、分散执行框架内采用双采样模式,并结合双时间尺度更新规则解耦参数更新,从而提升导航策略与内在奖励的表征能力。

链接: https://arxiv.org/abs/2512.13293
作者: Hao Fua,Wei Liu,Shuai Zhoua
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm’s superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: this https URL.
zh

[AI-16] Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

【速读】:该论文旨在解决标准直接偏好优化(Direct Preference Optimization, DPO)在对齐大语言模型和视觉-语言模型时存在的学习信号弱、收敛缓慢且不稳定的问题。其核心原因在于DPO中选择与拒绝的响应均由同一策略生成,导致两者常共享相似错误,KL散度较小,难以提供清晰的偏好信号。解决方案的关键在于提出反射式偏好优化(Reflective Preference Optimization, RPO),通过引入外部模型识别幻觉来源并生成简洁的反思提示(reflective hints),构建具有更强对比性的同策略偏好对,从而增强偏好信号的清晰度和有效性。RPO在理论上通过条件化提示提升期望偏好边际并改善样本效率,同时保持策略分布家族不变;实验证明其能以更少训练样本和迭代次数实现更优对齐效果,显著降低幻觉率并在多模态基准上达到当前最优性能。

链接: https://arxiv.org/abs/2512.13240
作者: Zihui Zhao,Zechang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
zh

[AI-17] CORE: Contrastive Masked Feature Reconstruction on Graphs

【速读】:该论文旨在解决当前图自监督学习中生成式方法(如掩码特征重建,MFR)与对比式方法(如图对比学习,GCL)虽目标相似但机制迥异、难以协同的问题。其核心挑战在于如何融合二者的优势以提升模型对图结构信息的表征能力。解决方案的关键在于提出一种新型框架——对比掩码特征重建(CORE),该框架通过将原始特征与重建特征构成正样本对,并利用掩码节点作为负样本,实现了MFR的重构能力与GCL的判别能力的有效结合,从而在节点和图分类任务上显著优于现有方法,达到新的性能上限。

链接: https://arxiv.org/abs/2512.13235
作者: Jianyuan Bo,Yuan Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node’s own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR’s reconstructive power with GCL’s discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
zh

[AI-18] WAY: Estimation of Vessel Destination in Worldwide AIS Trajectory

【速读】:该论文旨在解决船舶目的地预测问题,其核心挑战在于自动识别系统(AIS)数据的可靠性不足与时间间隔不规则性。为应对这一问题,作者提出了一种差异化方法,将长距离的港口至港口轨迹重构为嵌套序列结构,并利用空间网格(spatial grids)缓解时空偏差的同时保持高分辨率。关键解决方案是引入一种新型深度学习架构WAY,包含轨迹表示层和通道聚合序列处理(CASP)模块:前者从运动学与非运动学特征生成多通道向量序列,后者通过多头通道注意力与自注意力机制实现信息聚合与序列传递;此外,提出任务专用的梯度丢弃(Gradient Dropout, GD)技术,通过基于样本长度随机阻断梯度流,支持单标签的多对多训练,避免反馈偏差激增。实验表明,WAY在五年AIS数据上显著优于传统空间网格方法,且GD进一步提升性能,同时展示了通过多任务学习进行预计到达时间(ETA)估计的现实应用潜力。

链接: https://arxiv.org/abs/2512.13190
作者: Jin Sob Kim,Hyun Joon Park,Wooseok Shin,Dongil Park,Sung Won Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Aerospace and Electronic Systems (TAES)

点击查看摘要

Abstract:The Automatic Identification System (AIS) enables data-driven maritime surveillance but suffers from reliability issues and irregular intervals. We address vessel destination estimation using global-scope AIS data by proposing a differentiated approach that recasts long port-to-port trajectories as a nested sequence structure. Using spatial grids, this method mitigates spatio-temporal bias while preserving detailed resolution. We introduce a novel deep learning architecture, WAY, designed to process these reformulated trajectories for long-term destination estimation days to weeks in advance. WAY comprises a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks. The representation layer generates multi-channel vector sequences from kinematic and non-kinematic features. CASP blocks utilize multi-headed channel- and self-attention for aggregation and sequential information delivery. Additionally, we propose a task-specialized Gradient Dropout (GD) technique to enable many-to-many training on single labels, preventing biased feedback surges by stochastically blocking gradient flow based on sample length. Experiments on 5-year AIS data demonstrate WAY’s superiority over conventional spatial grid-based approaches regardless of trajectory progression. Results further confirm that adopting GD leads to performance gains. Finally, we explore WAY’s potential for real-world application through multitask learning for ETA estimation.
zh

[AI-19] PolySet: Restoring the Statistical Ensemble Nature of Polymers for Machine Learning

【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)模型在聚合物科学中对聚合物物理本质描述不足的问题,即现有模型将聚合物视为单一、理想化的分子图,而忽略了真实材料中聚合物链长度呈分布状态的统计特性。这一数字表示与物理现实之间的不匹配限制了模型对聚合物行为的准确捕捉。其解决方案的关键在于提出PolySet框架,该框架将聚合物表示为从假设的数均摩尔质量分布中采样的有限加权链集合(ensemble),从而显式引入聚合物的统计特性;这种基于集合的编码方式独立于化学细节,兼容任意分子表示形式,并能保留高阶分布矩(如Mz、Mz+1),使ML模型能够更稳定、准确地学习对聚合物链端部敏感的性质,为聚合物机器学习提供了物理基础并具备扩展至共聚物、嵌段结构等复杂拓扑的能力。

链接: https://arxiv.org/abs/2512.13186
作者: Khalid Ferji
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.
zh

[AI-20] Finch: Benchmarking Finance Accounting across Spreadsheet-Centric Enterprise Workflows

【速读】:该论文旨在解决当前AI代理在评估真实企业级财务会计工作流时存在的局限性,即现有基准测试往往缺乏对复杂、多步骤、跨文件协作和知识密集型任务的充分刻画。为应对这一问题,作者提出了Finch基准,其核心解决方案是构建一个基于真实企业环境(如Enron及多家金融机构)的高质量、高复杂度的工作流数据集,涵盖从数据录入到报告生成的完整链条,包含172个复合工作流、384项任务以及超过1700张电子表格(共2700万单元格)和多种模态文档。关键创新在于采用“LLM辅助发现+专家标注”的双阶段流程:首先利用大语言模型(LLM)从原始邮件和电子表格版本历史中自动推导出工作流逻辑,再由领域专家进行严格验证与注释,总耗时超700小时,从而确保了工作流的真实性、多样性与挑战性。此方法显著提升了AI代理在现实场景中的评估可信度,并揭示了当前主流模型(如GPT-5.1 Pro)在长周期、多步骤任务中的性能瓶颈。

链接: https://arxiv.org/abs/2512.13168
作者: Haoyu Dong,Pengkun Zhang,Yan Gao,Xuanyu Dong,Yilin Cheng,Mingzhe Lu,Adina Yakefu,Shuxin Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce a finance accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows – interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents. Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Multiagent Systems (cs.MA) Cite as: arXiv:2512.13168 [cs.AI] (or arXiv:2512.13168v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.13168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] SACn: Soft Actor-Critic with n-step Returns

【速读】:该论文旨在解决Soft Actor-Critic (SAC) 与n-step returns结合时引入的偏差问题,该偏差源于off-policy算法中动作分布的变化。传统方法依赖重要性采样(importance sampling)来修正偏差,但易导致数值不稳定。论文的关键解决方案是提出一种数值稳定的改进型重要性采样方法,并简化超参数选择过程;同时,在n-step最大熵框架下重新分析了SAC的熵估计策略,提出τ-采样熵估计以降低学习目标的方差。最终构建了SAC n算法并在MuJoCo仿真环境中进行了实验验证。

链接: https://arxiv.org/abs/2512.13165
作者: Jakub Łyskawa,Jakub Lewandowski,Paweł Wawrzyński
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICAART 2026

点击查看摘要

Abstract:Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the \tau -sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC n ) algorithm that we experimentally verify on MuJoCo simulated environments.
zh

[AI-22] Can AI Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive Interpersonal and Structural Levels

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在处理具有高度社会敏感性的健康决策(如堕胎)时,是否具备对复杂心理与生理现象的多层协同理解能力的问题。其核心发现是:尽管LLMs能生成符合语法规则的语言,但在认知层(自我评判)、人际层(预期评判与孤立感)和结构层(社区谴责与披露模式)上均未能展现出一致且准确的堕胎污名(abortion stigma)表征,表现为高估人际污名、低估认知污名、忽视污名与保密之间的实证关系,并引入人类验证数据中不存在的人口统计学偏差。解决方案的关键在于构建面向多层一致性(multilevel coherence)的设计标准、建立持续审计机制(continuous auditing)以评估模型在高风险场景下的理解深度,并通过强制性审计、问责制及部署限制等治理手段确保AI安全,从而避免因误解“人们无法言说”的心理状态而导致支持无效甚至有害的结果。

链接: https://arxiv.org/abs/2512.13142
作者: Anika Sharma,Malavika Mampally,Chidaksh Ravuru,Kandyce Brennan,Neil Gaikwad
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (anticipated judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
zh

[AI-23] From Overfitting to Reliability: Introducing the Hierarchical Approximate Bayesian Neural Network

【速读】:该论文旨在解决神经网络在实际应用中面临的超参数调优困难和过拟合问题,尤其是在分布外(out-of-distribution)数据上预测可靠性不足的挑战。其解决方案的关键在于提出一种分层近似贝叶斯神经网络(Hierarchical Approximate Bayesian Neural Network, HABNN),通过引入高斯-逆威沙特分布(Gaussian-inverse-Wishart distribution)作为权重的超先验(hyperprior),在模型中显式建模不确定性,从而提升鲁棒性和性能。该方法提供了预测分布和权重后验的解析表达式,计算复杂度与权重数量呈线性关系,且结果以学生t分布(Student’s t-distribution)的形式闭式表示,有效缓解了过拟合并提供可靠的不确定性估计,实验表明其性能可媲美甚至超越当前最先进模型,在安全关键场景中具有重要应用前景。

链接: https://arxiv.org/abs/2512.13111
作者: Hayk Amirkhanian,Marco F. Huber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages main body, 1 Figure, 15 pages Appendix

点击查看摘要

Abstract:In recent years, neural networks have revolutionized various domains, yet challenges such as hyperparameter tuning and overfitting remain significant hurdles. Bayesian neural networks offer a framework to address these challenges by incorporating uncertainty directly into the model, yielding more reliable predictions, particularly for out-of-distribution data. This paper presents Hierarchical Approximate Bayesian Neural Network, a novel approach that uses a Gaussian-inverse-Wishart distribution as a hyperprior of the network’s weights to increase both the robustness and performance of the model. We provide analytical representations for the predictive distribution and weight posterior, which amount to the calculation of the parameters of Student’s t-distributions in closed form with linear complexity with respect to the number of weights. Our method demonstrates robust performance, effectively addressing issues of overfitting and providing reliable uncertainty estimates, particularly for out-of-distribution tasks. Experimental results indicate that HABNN not only matches but often outperforms state-of-the-art models, suggesting a promising direction for future applications in safety-critical environments.
zh

[AI-24] raPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大模型时面临的高标注成本问题,以及现有无监督RLVR方法因缺乏外部监督而导致的后期训练中模型坍塌(model collapse)问题。其解决方案的关键在于提出一种新的半监督RLVR范式,通过引入少量标注数据来引导对大量未标注样本的训练,确保仅基于标注样本验证过的推理模式被纳入策略优化过程。技术上,论文提出了TraPO算法,该算法通过匹配未标注样本与标注样本的学习轨迹相似性来识别可靠样本,从而实现高效且稳定的策略优化,在多个数学推理基准和分布外任务上均展现出卓越的数据效率和泛化能力。

链接: https://arxiv.org/abs/2512.13106
作者: Shenzhi Yang,Guangcheng Zhu,Xing Zheng,Yingfan MA,Zhongqi Chen,Bowen Song,Weiqiang Wang,Junbo Zhao,Gang Chen,Haobo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model’s internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via this https URL.
zh

[AI-25] Socratic Students: Teaching Language Models to Learn by Asking Questions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态交互场景中表现不足的问题,即当关键信息未编码于模型参数中时,模型无法主动获取知识。传统方法聚焦于教师指导学生(teacher-to-student),而本文转向学生主导的主动查询策略(student-led querying),通过识别自身不确定性、提出针对性问题并高效吸收新知识来提升学习效果。解决方案的关键在于:1)设计学生驱动的交互机制以实现动态知识获取;2)利用直接偏好优化(Direct Preference Optimization, DPO)结合自监督或更强学生模型的反馈,训练学生模型生成高质量问题,从而显著提升小模型的学习效率与性能,在数学和编程基准测试中相较静态基线绝对提升Pass@k至少0.5。

链接: https://arxiv.org/abs/2512.13102
作者: Rajeev Bhatt Ambati,Tianyi Niu,Aashu Singh,Shlok Mishra,Shashank Srivastava,Snigdha Chaturvedi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.
zh

[AI-26] OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning

【速读】:该论文旨在解决通用机器人策略(generalist robot policies)在跨硬件平台迁移时因训练数据分布不均衡而导致的过拟合问题。现有数据集如Open X-Embodiment(OXE)高度依赖少数几种机器人类型(占85%以上),限制了策略在新机器人形态(robot embodiment)上的泛化能力。解决方案的关键在于提出AugE-Toolkit这一可扩展的机器人增强流水线,通过引入9种不同机械臂和夹爪组合对OXE进行数据增强,构建出规模更大、多样性更高的OXE-AugE数据集(超440万条轨迹),从而提升模型在未见机器人配置及原始机器人在分布偏移场景下的性能表现。

链接: https://arxiv.org/abs/2512.13100
作者: Guanhua Ji,Harsha Polavaram,Lawrence Yunliang Chen,Sandeep Bajamahal,Zehan Ma,Simeon Adebola,Chenfeng Xu,Ken Goldberg
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments – robot arm and gripper combinations – across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85% of its real data, which risks overfitting to robot–scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that state-of-the-art generalist policies such as OpenVLA and \pi_0 benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot–gripper combinations across four real-world manipulation tasks. Project website: this https URL.
zh

[AI-27] Sequence of Expert: Boosting Imitation Planners for Autonomous Driving through Temporal Alternation

【速读】:该论文旨在解决模仿学习(Imitation Learning, IL)在自动驾驶闭环控制场景下性能下降的问题,其根本原因在于小误差在连续规划周期中逐次累积,导致严重偏差。现有研究多依赖更复杂的网络结构或高保真数据集来提升模型在单一时点的状态鲁棒性,但忽视了自动驾驶作为连续时间过程的特性。解决方案的关键在于提出一种称为“专家序列”(Sequence of Experts, SoE)的时间交替策略,通过利用时间尺度上的动态变化来增强模型的闭环鲁棒性,无需增加模型规模或数据量即可显著提升各类IL模型在nuPlan大规模基准上的表现,实现更高效且通用的训练优化。

链接: https://arxiv.org/abs/2512.13094
作者: Xiang Li,Gang Liu,Weitao Zhou,Hongyi Zhu,Zhong Cao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over this http URL successive planning cycles, these errors compound, potentially resulting in severe this http URL research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this this http URL this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art this http URL module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
zh

[AI-28] A Simple and Effective Framework for Symmetric Consistent Indexing in Large-Scale Dense Retrieval

【速读】:该论文旨在解决双塔(dual-tower)编码架构在密集检索(dense retrieval)中面临的代表性空间错位(representational space misalignment)和检索索引不一致(retrieval index inconsistency)问题,这些问题会降低匹配精度、检索稳定性以及长尾查询的表现,进而限制下游生成式推荐(generative recommendation)模型的性能上限。解决方案的关键在于提出一个名为SCI的轻量级框架,其核心是两个协同模块:一是对称表示对齐模块(symmetric representation alignment module),通过创新的输入交换机制在不引入额外参数的情况下统一双塔表示空间;二是双塔协同的一致性索引模块(consistent indexing with dual-tower synergy module),采用双视角索引策略重构检索路径,确保训练到推理阶段的一致性。该框架具备理论保障、工程友好性,并支持百亿规模部署。

链接: https://arxiv.org/abs/2512.13074
作者: Huimu Wang,Yiming Qiu,Xingzhi Yao,Zhiguo Chen,Guoyu Tang,Songlin Wang,Sulong Xu,Mingming Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models. To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.13074 [cs.IR] (or arXiv:2512.13074v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.13074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-29] owards Open Standards for Systemic Complexity in Digital Forensics

【速读】:该论文旨在解决数字取证(Digital Forensics, DF)领域中因系统复杂性导致的错误和脆弱性问题,尤其是在人工智能(Artificial Intelligence, AI)技术广泛应用背景下,传统DF方法难以避免的误判与不可靠性。解决方案的关键在于引入可读的人类可理解的工件(human-readable artifacts)和开放标准(open standards),以此降低系统复杂性并提升DF流程的透明度、可验证性和可靠性,从而构建一个基于当前技术前沿的DF AI模型框架。

链接: https://arxiv.org/abs/2512.12970
作者: Paola Di Maio
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The intersection of artificial intelligence (AI) and digital forensics (DF) is becoming increasingly complex, ubiquitous, and pervasive, with overlapping techniques and technologies being adopted in all types of scientific and technical inquiry. Despite incredible advances, forensic sciences are not exempt from errors and remain vulnerable to fallibility. To mitigate the limitations of errors in DF, the systemic complexity is identified and addressed with the adoption of human-readable artifacts and open standards. A DF AI model schema based on the state of the art is outlined.
zh

[AI-30] Investigating Data Pruning for Pretraining Biological Foundation Models at Scale AAAI2026

【速读】:该论文旨在解决生物基础模型(Biological Foundation Models, BioFMs)预训练过程中因依赖数百万至数十亿条生物序列及数十亿参数而导致的计算成本高昂、复现困难和可及性差的问题。其解决方案的关键在于提出一种面向生物领域的事后影响引导的数据剪枝框架,通过引入基于子集的自影响估计方法,实现对样本重要性的高效低成本估算,并在此基础上设计了两种简单但有效的选择策略:Top-k Influence(Top I)和Coverage-Centric Influence(CCI)。实验表明,该方法在极端剪枝率(超过99%)下仍能保持优异性能,且所生成的核心数据集(coreset)在RNA与蛋白质任务中均优于随机选取且规模大十倍的子集,揭示了生物序列数据集中存在显著冗余,从而为降低BioFM预训练成本、提升生物AI研究的效率、可访问性和可持续性提供了有效路径。

链接: https://arxiv.org/abs/2512.12932
作者: Yifan Wu,Jiyue Jiang,Xichen Ye,Yiqi Wang,Chang Zhou,Yitao Xu,Jiayang Chen,He Hu,Weizhong Zhang,Cheng Jin,Jiao Yuan,Yu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
zh

[AI-31] Cisco Integrated AI Security and Safety Framework Report

【速读】:该论文旨在解决当前人工智能(AI)系统在快速部署过程中面临的安全与安全风险碎片化、缺乏统一治理框架的问题。现有框架如MITRE ATLAS、NIST AI 100-2 AML分类法及OWASP针对大语言模型(LLM)和代理型AI应用的Top 10清单,仅覆盖了AI风险的部分维度,无法全面应对从内容安全失效到生态系统滥用等多维威胁。为此,论文提出Cisco的集成式AI安全与安全框架(AI Security Framework),其关键在于构建一个生命周期感知的统一分类体系与可操作化框架,能够整合跨模态、代理、流水线及生态系统的AI风险,并支持威胁识别、红队测试、风险优先级排序等功能,具备可扩展性以适应多模态、人形机器人、可穿戴设备等新兴场景,从而为组织提供结构化方法来理解AI系统失败机制、对手利用路径,并建立随能力演进而动态增强的防御体系。

链接: https://arxiv.org/abs/2512.12921
作者: Amy Chang,Tiffany Saade,Sanket Mendapara,Adam Swanda,Ankit Garg
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems are being readily and rapidly adopted, increasingly permeating critical domains: from consumer platforms and enterprise software to networked systems with embedded agents. While this has unlocked potential for human productivity gains, the attack surface has expanded accordingly: threats now span content safety failures (e.g., harmful or deceptive outputs), model and data integrity compromise (e.g., poisoning, supply-chain tampering), runtime manipulations (e.g., prompt injection, tool and agent misuse), and ecosystem risks (e.g., orchestration abuse, multi-agent collusion). Existing frameworks such as MITRE ATLAS, National Institute of Standards and Technology (NIST) AI 100-2 Adversarial Machine Learning (AML) taxonomy, and OWASP Top 10s for Large Language Models (LLMs) and Agentic AI Applications provide valuable viewpoints, but each covers only slices of this multi-dimensional space. This paper presents Cisco’s Integrated AI Security and Safety Framework (“AI Security Framework”), a unified, lifecycle-aware taxonomy and operationalization framework that can be used to classify, integrate, and operationalize the full range of AI risks. It integrates AI security and AI safety across modalities, agents, pipelines, and the broader ecosystem. The AI Security Framework is designed to be practical for threat identification, red-teaming, risk prioritization, and it is comprehensive in scope and can be extensible to emerging deployments in multimodal contexts, humanoids, wearables, and sensory infrastructures. We analyze gaps in prevailing frameworks, discuss design principles for our framework, and demonstrate how the taxonomy provides structure for understanding how modern AI systems fail, how adversaries exploit these failures, and how organizations can build defenses across the AI lifecycle that evolve alongside capability advancements. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12921 [cs.CR] (or arXiv:2512.12921v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.12921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Satisfiability Modulo Theory Meets Inductive Logic Programming

【速读】:该论文旨在解决传统归纳逻辑编程(Inductive Logic Programming, ILP)在处理数值约束方面的局限性,尤其是其难以自动推导出跨样本一致的阈值或算术关系的问题。经典ILP系统依赖离散谓词和手动设计的数值谓词,导致无法有效学习数值参数与符号规则的联合推理。解决方案的关键在于提出一种模块化架构,将ILP系统PyGol与SMT求解器Z3相结合:PyGol生成候选子句,这些子句被解释为背景理论(如线性或非线性实数算术)上的无量词公式,由Z3负责数值参数的实例化与验证,从而在保持ILP的声明式关系偏置的同时,支持混合规则的诱导,包括阈值、区间及多文字算术关系。这一方法显著扩展了符号规则学习的表达能力,并为未来面向更复杂理论感知归纳的学习提供了灵活基础。

链接: https://arxiv.org/abs/2512.12918
作者: Nijesh Upreti,Vaishak Belle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inductive Logic Programming (ILP) provides interpretable rule learning in relational domains, yet remains limited in its ability to induce and reason with numerical constraints. Classical ILP systems operate over discrete predicates and typically rely on discretisation or hand-crafted numerical predicates, making it difficult to infer thresholds or arithmetic relations that must hold jointly across examples. Recent work has begun to address these limitations through tighter integrations of ILP with Satisfiability Modulo Theories (SMT) or specialised numerical inference mechanisms. In this paper we investigate a modular alternative that couples the ILP system PyGol with the SMT solver Z3. Candidate clauses proposed by PyGol are interpreted as quantifier-free formulas over background theories such as linear or nonlinear real arithmetic, allowing numerical parameters to be instantiated and verified by the SMT solver while preserving ILP’s declarative relational bias. This supports the induction of hybrid rules that combine symbolic predicates with learned numerical constraints, including thresholds, intervals, and multi-literal arithmetic relations. We formalise this SMT-ILP setting and evaluate it on a suite of synthetic datasets designed to probe linear, relational, nonlinear, and multi-hop reasoning. The results illustrate how a modular SMT-ILP architecture can extend the expressivity of symbolic rule learning, complementing prior numerical ILP approaches while providing a flexible basis for future extensions toward richer theory-aware induction.
zh

[AI-33] CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLM s ACSA

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域(如网络威胁情报,Cyber Threat Intelligence, CTI)微调后可能无意泄露训练数据中敏感信息的问题。传统方法通过重新训练整个模型来消除此类泄露风险,但计算成本过高且不切实际。其解决方案的关键在于提出一种“隐私对齐”(privacy alignment)机制,受安全对齐(safety alignment)启发,利用少量示例监督实现隐私保护:系统集成一个由同一底层LLM驱动的隐私分类器(privacy classifier)和隐私擦除器(privacy redactor),在不显著牺牲任务性能的前提下有效抑制敏感信息泄露。实验表明,该框架在CTI场景下优于基于命名实体识别(Named Entity Recognition, NER)的基线方法,具备良好的隐私-效用权衡能力,并具有跨敏感领域的通用性。

链接: https://arxiv.org/abs/2512.12914
作者: Shashie Dilhara Batan Arachchige,Benjamin Zi Hao Zhao,Hassan Jameel Asghar,Dinusha Vatsalan,Dali Kaafar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 18th Cybersecurity Experimentation and Test Workshop (CSET), in conjunction with ACSAC 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.
zh

[AI-34] Optimal Labeler Assignment and Sampling for Active Learning in the Presence of Imperfect Labels

【速读】:该论文旨在解决主动学习(Active Learning, AL)中因标注者(oracle)准确性不一而导致的标签噪声问题,尤其关注复杂样本更容易被错误标注的现象。解决方案的关键在于提出一个新颖的AL框架:首先设计了一个分配模型(assignment model),在每轮查询中将样本最优分配给不同标注者,以最小化每轮可能产生的最大噪声;其次引入一种新的采样方法,用于识别最具信息量且受标签噪声影响最小的查询点,从而提升分类器的鲁棒性与性能。实验表明,该方法显著优于多个基准方法。

链接: https://arxiv.org/abs/2512.12870
作者: Pouya Ahadi,Blair Winograd,Camille Zaug,Karunesh Arora,Lijun Wang,Kamran Paynabar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 22 pages, 6 figures. Preprint under review

点击查看摘要

Abstract:Active Learning (AL) has garnered significant interest across various application domains where labeling training data is costly. AL provides a framework that helps practitioners query informative samples for annotation by oracles (labelers). However, these labels often contain noise due to varying levels of labeler accuracy. Additionally, uncertain samples are more prone to receiving incorrect labels because of their complexity. Learning from imperfectly labeled data leads to an inaccurate classifier. We propose a novel AL framework to construct a robust classification model by minimizing noise levels. Our approach includes an assignment model that optimally assigns query points to labelers, aiming to minimize the maximum possible noise within each cycle. Additionally, we introduce a new sampling method to identify the best query points, reducing the impact of label noise on classifier performance. Our experiments demonstrate that our approach significantly improves classification performance compared to several benchmark methods.
zh

[AI-35] Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业级应用场景中因提示词(prompt)语义相同但表述微调而导致输出不一致的问题,这种不一致性会削弱用户信任、影响合规性并破坏用户体验。解决方案的关键在于提出一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习框架,通过设计基于熵的有用性与稳定性奖励机制,将语义等价的提示词视为一组,并重置对话上下文以隔离表述差异的影响,从而直接优化模型在不同提示变体下保持信息内容稳定性的能力。实验表明,该方法在投资和职位推荐任务中显著优于微调和解码基线方法,实现了对LLM输出一致性的有效提升。

链接: https://arxiv.org/abs/2512.12858
作者: Sonal Prabhune,Balaji Padmanabhan,Kaushik Dutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.
zh

[AI-36] Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents

【速读】:该论文旨在解决生成式 AI(Generative AI)在长期交互场景中因记忆管理能力不足而导致的性能瓶颈与隐私风险问题。当前方法要么采用无限制的记忆存储,引发计算不可行性和隐私泄露风险;要么使用简单遗忘机制,损害代理的一致性与功能性。解决方案的关键在于提出一种面向人类中心的记忆管理框架——Memory-Aware Retention Schema(MaRS),并设计六种理论驱动的遗忘策略,在保证叙事连贯性、目标达成率、社交回忆准确性、隐私保护和成本效率之间实现平衡。通过引入Forgetful but Faithful Agent(FiFA)基准进行系统评估,实验证明其混合遗忘策略在多种内存预算下均能获得最优综合性能(得分0.911),同时确保计算可行性与隐私合规性,为资源受限且注重隐私的生成式代理部署提供了可落地的技术路径与评估标准。

链接: https://arxiv.org/abs/2512.12856
作者: Saad Alqithami
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As generative agents become increasingly sophisticated and deployed in long-term interactive scenarios, their memory management capabilities emerge as a critical bottleneck for both performance and privacy. Current approaches either maintain unlimited memory stores, leading to computational intractability and privacy concerns, or employ simplistic forgetting mechanisms that compromise agent coherence and functionality. This paper introduces the Memory-Aware Retention Schema (MaRS), a novel framework for human-centered memory management in generative agents, coupled with six theoretically-grounded forgetting policies that balance performance, privacy, and computational efficiency. We present the Forgetful but Faithful Agent (FiFA) benchmark, a comprehensive evaluation framework that assesses agent performance across narrative coherence, goal completion, social recall accuracy, privacy preservation, and cost efficiency. Through extensive experimentation involving 300 evaluation runs across multiple memory budgets and agent configurations, we demonstrate that our hybrid forgetting policy achieves superior performance (composite score: 0.911) while maintaining computational tractability and privacy guarantees. Our work establishes new benchmarks for memory-budgeted agent evaluation and provides practical guidelines for deploying generative agents in resource-constrained, privacy-sensitive environments. The theoretical foundations, implementation framework, and empirical results contribute to the emerging field of human-centered AI by addressing fundamental challenges in agent memory management that directly impact user trust, system scalability, and regulatory compliance.
zh

[AI-37] Selective Conformal Risk Control

【速读】:该论文旨在解决置信预测集过大的问题,即在高风险领域部署机器学习系统时,传统**等变预测(conformal prediction)**虽然能提供无分布假设的覆盖率保证,但往往生成的预测集合过于宽泛,限制了其实际应用价值。解决方案的关键在于提出一种统一框架——选择性等变风险控制(Selective Conformal Risk Control, SCRC),该框架将不确定性量化建模为两阶段问题:第一阶段通过筛选置信样本缩小预测范围,第二阶段在选定子集上应用等变风险控制以构建校准后的紧凑预测集。该方法通过引入选择机制,在不牺牲覆盖率和风险控制精度的前提下显著提升预测集的实用性。

链接: https://arxiv.org/abs/2512.12844
作者: Yunpeng Xu,Wenge Guo,Zhi Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable uncertainty quantification is essential for deploying machine learning systems in high-stakes domains. Conformal prediction provides distribution-free coverage guarantees but often produces overly large prediction sets, limiting its practical utility. To address this issue, we propose \textitSelective Conformal Risk Control (SCRC), a unified framework that integrates conformal prediction with selective classification. The framework formulates uncertainty control as a two-stage problem: the first stage selects confident samples for prediction, and the second stage applies conformal risk control on the selected subset to construct calibrated prediction sets. We develop two algorithms under this framework. The first, SCRC-T, preserves exchangeability by computing thresholds jointly over calibration and test samples, offering exact finite-sample guarantees. The second, SCRC-I, is a calibration-only variant that provides PAC-style probabilistic guarantees while being more computational efficient. Experiments on two public datasets show that both methods achieve the target coverage and risk levels, with nearly identical performance, while SCRC-I exhibits slightly more conservative risk control but superior computational practicality. Our results demonstrate that selective conformal risk control offers an effective and efficient path toward compact, reliable uncertainty quantification.
zh

[AI-38] SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding

【速读】:该论文旨在解决机器人在复杂多变环境中实现跨任务、跨场景的泛化性 visuomotor 控制问题,即如何让机器人在不同环境、任务目标和用户指令下均能稳定执行操作。解决方案的关键在于提出一种名为 SAGA 的通用自适应框架,其核心思想是通过显式地将任务目标与环境中的可操作性(affordance)进行语义解耦,从而分离高层语义意图与底层视觉运动控制。具体而言,SAGA 利用基于可操作性的任务表示方法,以统一且结构化的方式表达多样化行为,并借助多模态基础模型将任务表示映射为 3D 可操作性热图(affordance heatmaps),从而在视觉观测中突出任务相关实体并忽略无关外观变化,提升泛化能力。这一结构化的可操作性接地机制使得模型能够在多任务示范数据上训练出条件策略,支持语言指令、点选和示例演示等多种输入形式,实现零样本执行与少样本适应,最终在真实世界十一项任务中显著优于端到端及模块化基线方法。

链接: https://arxiv.org/abs/2512.12842
作者: Kuan Fang,Yuxin Chen,Xinghao Zhu,Farzad Niroui,Lingfeng Sun,Jiuguang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot’s visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.
zh

[AI-39] PRIVEE: Privacy-Preserving Vertical Federated Learning Against Feature Inference Attacks

【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)中的特征推理攻击问题,即恶意参与方利用推理阶段共享的置信度分数(confidence scores,即预测概率)来重构其他参与方的私有输入特征。为应对这一威胁,作者提出了一种名为PRIVEE(PRIvacy-preserving Vertical fEderated lEarning)的新防御机制,其核心在于对置信度分数进行变换以生成不可逆的表示形式,同时保留关键的相对排序关系和分数间距离特性;该方法不直接暴露原始分数,而是仅共享经过扰动的表示,从而在不损害模型预测精度的前提下显著提升隐私保护水平,实验表明PRIVEE相较现有最优防御方案实现了三倍以上的隐私保护增强。

链接: https://arxiv.org/abs/2512.12840
作者: Sindhuja Madabushi,Ahmad Faraz Khan,Haider Ali,Ananthram Swami,Rui Ning,Hongyi Wu,Jin-Hee Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Vertical Federated Learning (VFL) enables collaborative model training across organizations that share common user samples but hold disjoint feature spaces. Despite its potential, VFL is susceptible to feature inference attacks, in which adversarial parties exploit shared confidence scores (i.e., prediction probabilities) during inference to reconstruct private input features of other participants. To counter this threat, we propose PRIVEE (PRIvacy-preserving Vertical fEderated lEarning), a novel defense mechanism named after the French word privée, meaning “private.” PRIVEE obfuscates confidence scores while preserving critical properties such as relative ranking and inter-score distances. Rather than exposing raw scores, PRIVEE shares only the transformed representations, mitigating the risk of reconstruction attacks without degrading model prediction accuracy. Extensive experiments show that PRIVEE achieves a threefold improvement in privacy protection compared to state-of-the-art defenses, while preserving full predictive performance against advanced feature inference attacks.
zh

[AI-40] Network Level Evaluation of Hangup Susceptibility of HRGCs using Deep Learning and Sensing Techniques: A Goal Towards Safer Future

【速读】:该论文旨在解决陡坡型高速公路铁路平交道口(Steep Profiled Highway Railway Grade Crossings, HRGCs)对低离地间隙车辆存在的挂滞风险问题,此类风险可能导致车辆卡在轨道上,进而引发与列车的碰撞事故。解决方案的关键在于构建一个面向路网级别的挂滞风险评估框架:首先利用步行式测 profiler 和 Pave3D8K 激光成像系统采集多处道口的精确纵断面数据;其次开发一种融合长短期记忆网络(Long Short Term Memory, LSTM)与 Transformer 架构的混合深度学习模型,用于从激光扫描数据中重建高精度 HRGC 路面轮廓;再结合约 350 辆特种车辆的实测尺寸数据(包括轮距和离地间隙),采用三种典型车辆尺寸场景(中位数、75%分位轮距/25%分位离地间隙、最坏情况)进行挂滞风险分析;最终通过 ArcGIS 数据库和软件界面实现可视化决策支持,为交通管理部门提供可操作的风险识别与缓解工具。

链接: https://arxiv.org/abs/2512.12832
作者: Kaustav Chatterjee,Joshua Li,Kundan Parajulee,Jared Schwennesen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steep profiled Highway Railway Grade Crossings (HRGCs) pose safety hazards to vehicles with low ground clearance, which may become stranded on the tracks, creating risks of train vehicle collisions. This research develops a framework for network level evaluation of hangup susceptibility of HRGCs. Profile data from different crossings in Oklahoma were collected using both a walking profiler and the Pave3D8K Laser Imaging System. A hybrid deep learning model, combining Long Short Term Memory (LSTM) and Transformer architectures, was developed to reconstruct accurate HRGC profiles from Pave3D8K Laser Imaging System data. Vehicle dimension data from around 350 specialty vehicles were collected at various locations across Oklahoma to enable up to date statistical design dimensions. Hangup susceptibility was analyzed using three vehicle dimension scenarios (a) median dimension (median wheelbase and ground clearance), (b) 75 25 percentile dimension (75 percentile wheelbase, 25 percentile ground clearance), and © worst case dimension (maximum wheelbase and minimum ground clearance). Results indicate 36, 62, and 67 crossings at the highest hangup risk levels under these scenarios, respectively. An ArcGIS database and a software interface were developed to support transportation agencies in mitigating crossing hazards. This framework advances safety evaluation by integrating next generation sensing, deep learning, and infrastructure datasets into practical decision support tools.
zh

[AI-41] On the continuity of flows

【速读】:该论文旨在解决流匹配(flow matching)在处理先验分布与目标分布拓扑结构不匹配(如单峰到多峰)时出现的最优速度场存在空间不连续性的问题。其关键发现是:当分布拓扑不一致时,连续流必须通过分叉来实现单模态到多模态的映射,导致粒子在中间时间点需做出离散路由决策,从而引发速度场沿决策边界产生跳跃间断,且该间断幅度随时间趋近目标分布而趋于无穷大。这一现象并非仅由L²损失引起,而是源于分布间的拓扑差异,为理解流匹配在流形上的局限性及神经网络中学习不连续表示的挑战提供了理论依据。

链接: https://arxiv.org/abs/2512.12821
作者: Congzhou M Sha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to L^2 loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.
zh

[AI-42] Decoding Human and AI Persuasion in National College Debate: Analyzing Prepared Arguments Through Aristotles Rhetorical Principles

【速读】:该论文旨在解决辩论训练中论证能力培养依赖人工辅导、难以规模化的问题,尤其关注学生在构建基于文献证据的论点时所面临的挑战。解决方案的关键在于利用生成式 AI(Generative AI)——具体为 GPT-4——自动产出高质量的证据卡(evidence card),并将其与人类辩手生成的内容进行对比分析。研究基于亚里士多德修辞学三要素(ethos、pathos、logos)对论证质量进行系统评估,揭示了人工智能与人类在推理逻辑、说服力构建等方面的异同,从而为 AI 辅助学习干预提供实证依据,助力学生提升批判性思维和论证能力。

链接: https://arxiv.org/abs/2512.12817
作者: Mengqian Wu,Jiayi Zhang,Raymond Z. Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle’s rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI’s focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.
zh

[AI-43] OPAL: Operator-Programmed Algorithms for Landscape-Aware Black-Box Optimization

【速读】:该论文旨在解决黑箱优化(black-box optimization)中传统进化和群智能算法性能高度依赖问题实例的难题。其核心挑战在于如何为不同优化问题设计出高效且自适应的搜索策略,而非依赖人工设计的启发式规则或“类比型”算法。解决方案的关键在于提出一种名为OPAL(Operator-Programmed Algorithms)的景观感知框架:首先利用少量计算预算通过差分进化(differential evolution)探查目标函数的景观特征,构建采样点的k近邻图(k-nearest neighbor graph),并用图神经网络(graph neural network)编码搜索轨迹;随后由元学习器(meta-learner)将该表示映射为分阶段的操作程序(exploration, restart, local search operators),从而实现针对每个问题实例的定制化优化策略生成。此方法在CEC 2017测试集上展现出与最先进自适应差分进化算法相当甚至更优的性能,且元组件带来的运行时间开销较小,验证了基于操作程序的景观感知、个体化设计是超越传统启发式算法的有效路径。

链接: https://arxiv.org/abs/2512.12809
作者: Junbo Jacob Lian,Mingyang Yu,Kaichen Ouyang,Shengwei Fu,Rui Zhong,Yujun Zhang,Jun Zhang,Huiling Chen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Source code, experiment scripts, and results are publicly available at this https URL . The real-world application part hasn’t been done yet

点击查看摘要

Abstract:Black-box optimization often relies on evolutionary and swarm algorithms whose performance is highly problem dependent. We view an optimizer as a short program over a small vocabulary of search operators and learn this operator program separately for each problem instance. We instantiate this idea in Operator-Programmed Algorithms (OPAL), a landscape-aware framework for continuous black-box optimization that uses a small design budget with a standard differential evolution baseline to probe the landscape, builds a k -nearest neighbor graph over sampled points, and encodes this trajectory with a graph neural network. A meta-learner then maps the resulting representation to a phase-wise schedule of exploration, restart, and local search operators. On the CEC~2017 test suite, a single meta-trained OPAL policy is statistically competitive with state-of-the-art adaptive differential evolution variants and achieves significant improvements over simpler baselines under nonparametric tests. Ablation studies on CEC~2017 justify the choices for the design phase, the trajectory graph, and the operator-program representation, while the meta-components add only modest wall-clock overhead. Overall, the results indicate that operator-programmed, landscape-aware per-instance design is a practical way forward beyond ad hoc metaphor-based algorithms in black-box optimization.
zh

[AI-44] Fault-Tolerant Sandboxing for AI Coding Agents : A Transactional Approach to Safe Autonomous Execution

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)向自主代理(autonomous agents)演进过程中引入的安全风险问题,尤其是破坏性命令执行和系统状态不一致带来的潜在危害。现有商业解决方案通常侧重于交互式用户安全,通过强制身份验证机制阻断了实现真正自主运行所需的无头(headless)循环流程。其核心解决方案是提出一种基于策略的容错沙箱框架(Fault-Tolerant Sandboxing framework),关键创新在于引入策略驱动的拦截层与事务性文件系统快照机制,将代理操作封装为原子事务,从而在保证高拦截率(100%)和状态回滚成功率(100%)的同时,仅带来14.5%的性能开销(约1.8秒/事务),显著优于容器化方案的初始化延迟或商业CLI工具因交互认证导致的不可用性。

链接: https://arxiv.org/abs/2512.12806
作者: Boyang Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:The transition of Large Language Models (LLMs) from passive code generators to autonomous agents introduces significant safety risks, specifically regarding destructive commands and inconsistent system states. Existing commercial solutions often prioritize interactive user safety, enforcing authentication barriers that break the headless loops required for true autonomy. This paper presents a Fault-Tolerant Sandboxing framework designed to mitigate these risks through a policy-based interception layer and a transactional filesystem snapshot mechanism. We hypothesize that wrapping agent actions in atomic transactions can guarantee safety with acceptable latency, outperforming the heavy initialization overhead of containers or the interactive friction of commercial CLIs. We validated this approach by deploying the Minimind-MoE LLM served via nano-vllm on a custom Proxmox-based testbed utilizing EVPN/VXLAN isolation. Experimental results demonstrate a 100% interception rate for high-risk commands and a 100% success rate in rolling back failed states. Crucially, our prototype incurs only a 14.5% performance overhead (approx. 1.8s) per transaction. In contrast, benchmarking against the Gemini CLI sandbox revealed that it requires interactive authentication (“Sign in”), rendering it unusable for headless, autonomous agent workflows.
zh

[AI-45] From Small to Large: Generalization Bounds for Transformers on Variable-Size Inputs

【速读】:该论文旨在解决Transformer模型在几何数据(如点云和图)上表现出的“规模外推能力”(size generalization)缺乏严谨理论解释的问题。其解决方案的关键在于构建一个理论框架,通过建立离散采样数据与连续域等效表示之间的误差边界来刻画这种外推行为;具体而言,作者证明了对于具有稳定位置编码的Transformer,该误差边界由采样密度和数据流形的内在维度共同决定,从而为Transformer在几何数据上的泛化能力提供了可量化、可解释的理论支撑。

链接: https://arxiv.org/abs/2512.12805
作者: Anastasiia Alokhina,Pan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers exhibit a notable property of \emphsize generalization, demonstrating an ability to extrapolate from smaller token sets to significantly longer ones. This behavior has been documented across diverse applications, including point clouds, graphs, and natural language. Despite its empirical success, this capability still lacks some rigorous theoretical characterizations. In this paper, we develop a theoretical framework to analyze this phenomenon for geometric data, which we represent as discrete samples from a continuous source (e.g., point clouds from manifolds, graphs from graphons). Our core contribution is a bound on the error between the Transformer’s output for a discrete sample and its continuous-domain equivalent. We prove that for Transformers with stable positional encodings, this bound is determined by the sampling density and the intrinsic dimensionality of the data manifold. Experiments on graphs and point clouds of various sizes confirm the tightness of our theoretical bound.
zh

[AI-46] Causal Counterfactuals Reconsidered

【速读】:该论文旨在解决传统Pearl语义框架在处理无法扩展为现实结构因果模型(Structural Causal Models, SCM)的 probabilistic causal models 时的局限性问题,这类模型即使在简单场景中也会出现。其解决方案的关键在于提出一种新的反事实概率语义,该语义基于满足马尔可夫条件(Markov condition)、仅包含现实变量且因果完备的因果模型,并避免使用所谓的响应变量(response variables)。该方法既采纳了Dawid对普遍因果决定论和非现实变量的批判,又支持Pearl关于反事实具有普遍语义可能性的观点,从而在长期存在的Pearl与Dawid之争中提供了一种自然折衷。此外,作者证明所提语义等价于两个不依赖SCM的近期提案,并与文献中关于随机反事实的诸多观点保持一致,同时探讨了马尔可夫条件的普适性及其在因果抽象中的新推广。

链接: https://arxiv.org/abs/2512.12804
作者: Sander Beckers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint: currently under review

点击查看摘要

Abstract:I develop a novel semantics for probabilities of counterfactuals that generalizes the standard Pearlian semantics: it applies to probabilistic causal models that cannot be extended into realistic structural causal models and are therefore beyond the scope of Pearl’s semantics. This generalization is needed because, as I show, such probabilistic causal models arise even in simple settings. My semantics offer a natural compromize in the long-standing debate between Pearl and Dawid over counterfactuals: I agree with Dawid that universal causal determinism and unrealistic variables should be rejected, but I agree with Pearl that a general semantics of counterfactuals is nonetheless possible. I restrict attention to causal models that satisfy the Markov condition, only contain realistic variables, and are causally complete. Although I formulate my proposal using structural causal models, as does Pearl, I refrain from using so-called response variables. Moreover, I prove that my semantics is equivalent to two other recent proposals that do not involve structural causal models, and that it is in line with various comments on stochastic counterfactuals that have appeared in the literature more broadly. Throughout I also reflect on the universality of the Markov condition and explore a novel generalization of causal abstractions
zh

[AI-47] Liquid Reasoning Transformers: A Sudoku-Based Prototype for Chess-Scale Algorithmic Tasks

【速读】:该论文旨在解决传统Transformer模型在复杂推理任务中计算资源分配不合理、难以纠正早期错误以及缺乏动态调整推理深度能力的问题。其解决方案的关键在于提出液态推理Transformer(Liquid Reasoning Transformer, LRT),通过引入迭代更新机制、基于丢弃(discard-based)的纠错策略和可学习的停止机制,使模型能够在多个内部步骤中持续优化一个递归推理标记(reasoning token),从而根据输入难度自适应地调整计算深度,并有效修正早期推理错误,实现更高效且稳定的结构化推理。

链接: https://arxiv.org/abs/2512.12792
作者: Shivansh Sahni,Wenzhi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 0 figures

点击查看摘要

Abstract:The Liquid Reasoning Transformer (LRT) is a transformer architecture designed for inference with adaptive depths using iterative changes, discard-based correction, and a learned stopping mechanism. Instead of relying on a single feedforward pass, the model updates a recurrent reasoning token across multiple internal steps, allowing it to correct early errors and allocate computation based on input difficulty. We evaluate the LRT on Sudoku as a controlled testbed for structured reasoning and show that it achieves strong performance, reaching 98.68% digit accuracy and 36.30% full-puzzle accuracy without using symbolic rules or search. Analyzing internal patterns shows that the discard and stop gates play different, important roles in stabilizing inferences and adjusting computational depth. We discuss how these mechanisms extend naturally to chess-scale reasoning tasks and outline extensions for multi-token reasoning and larger domains.
zh

[AI-48] Beyond Task Completion: An Assessment Framework for Evaluating Agent ic AI Systems

【速读】:该论文旨在解决当前对智能体系统(Agentic AI)评估方法不足的问题,尤其是现有评价体系无法有效捕捉由大语言模型(Large Language Models, LLMs)非确定性行为所导致的运行时不确定性。传统评估多依赖于二元任务完成指标,忽略了代理在工具调用、记忆检索、多代理协作及环境交互等方面的动态表现。解决方案的关键在于提出一个端到端的代理评估框架(Agent Assessment Framework),包含四个核心维度:LLM能力、记忆(Memory)、工具(Tools)和环境(Environment),从而系统性地量化代理在复杂任务中的行为稳定性与协作效能,实证表明该框架能够识别出常规指标遗漏的行为偏差,显著提升了对多代理系统运行不确定性的刻画能力。

链接: https://arxiv.org/abs/2512.12791
作者: Sreemaee Akshathala,Bassam Adnan,Mahisha Ramesh,Karthik Vaidhyanathan,Basil Muhammed,Kannan Parthasarathy
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi-agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi-agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI-based systems often overlook the non-deterministic nature of models. This non-determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.
zh

[AI-49] Unveiling Statistical Significance of Online Regression over Multiple Datasets

【速读】:该论文旨在解决多算法在多个数据集上的统计显著性比较问题,以及在线学习场景中随时间演变的数据环境下性能差异的验证难题。其关键解决方案是采用Friedman检验结合事后检验(post-hoc tests)对多种在线回归模型进行系统评估,并通过真实与合成数据集的5折交叉验证和种子平均策略确保结果的稳健性和全面性。

链接: https://arxiv.org/abs/2512.12787
作者: Mohammad Abu-Shaira,Weishi Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite extensive focus on techniques for evaluating the performance of two learning algorithms on a single dataset, the critical challenge of developing statistical tests to compare multiple algorithms across various datasets has been largely overlooked in most machine learning research. Additionally, in the realm of Online Learning, ensuring statistical significance is essential to validate continuous learning processes, particularly for achieving rapid convergence and effectively managing concept drifts in a timely manner. Robust statistical methods are needed to assess the significance of performance differences as data evolves over time. This article examines the state-of-the-art online regression models and empirically evaluates several suitable tests. To compare multiple online regression models across various datasets, we employed the Friedman test along with corresponding post-hoc tests. For thorough evaluations, utilizing both real and synthetic datasets with 5-fold cross-validation and seed averaging ensures comprehensive assessment across various data subsets. Our tests generally confirmed the performance of competitive baselines as consistent with their individual reports. However, some statistical test results also indicate that there is still room for improvement in certain aspects of state-of-the-art methods.
zh

[AI-50] OLC-WA: Drift Aware Tuning-Free Online Classification with Weighted Averag e

【速读】:该论文旨在解决流式数据环境中因概念漂移(concept drift)导致的模型预测性能下降问题,尤其针对在线学习模型中固定超参数难以适应动态数据分布的局限性。解决方案的关键在于提出一种无需人工干预的自适应在线分类模型——OLC-WA(Online Classification with Weighted Average),其核心机制是通过指数加权移动平均(exponentially weighted moving average)将新数据流与基线模型进行融合,并集成一个自动优化模块以检测概念漂移、量化其强度并据此动态调整模型参数,从而实现对演化数据分布的有效响应。

链接: https://arxiv.org/abs/2512.12785
作者: Mohammad Abu Shaira,Yunhe Feng,Heng Fan,Weishi Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world data sets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model’s predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. This paper introduces Online Classification with Weighted Average (OLC-WA), an adaptive, hyperparameter-free online classification model equipped with an automated optimization mechanism. OLC-WA operates by blending incoming data streams with an existing base model. This blending is facilitated by an exponentially weighted moving average. Furthermore, an integrated optimization mechanism dynamically detects concept drift, quantifies its magnitude, and adjusts the model based on the observed data stream characteristics. This approach empowers the model to effectively adapt to evolving data distributions within streaming environments. Rigorous empirical evaluation across diverse benchmark datasets shows that OLC-WA achieves performance comparable to batch models in stationary environments, maintaining accuracy within 1-3%, and surpasses leading online baselines by 10-25% under drift, demonstrating its effectiveness in adapting to dynamic data streams.
zh

[AI-51] Designing The Drive: Enhancing User Experience through Adaptive Interfaces in Autonomous Vehicles

【速读】:该论文旨在解决自动驾驶车辆(Autonomous Vehicles, AVs)用户界面个性化设计中如何提升用户体验、安全性和信任度的问题。其核心挑战在于,随着AV技术的普及,用户对界面的接受度和信任感日益成为影响其广泛应用的关键因素。解决方案的关键在于系统性地应用人机交互(Human-Computer Interaction, HCI)原则,特别是通过自适应设计(adaptive design)、多模态交互(multi-modal interaction)和用户反馈机制等策略,实现界面的个性化定制;同时强调透明度与用户控制权的重要性,使用户能够主动参与或调整自身体验,从而增强对自动驾驶系统的信任,最终在保障高安全性和安全性标准的前提下满足多样化用户需求。

链接: https://arxiv.org/abs/2512.12773
作者: Reeteesha Roy
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:With the recent development and integration of autonomous vehicles (AVs) in transportation systems of the modern world, the emphasis on customizing user interfaces to optimize the overall user experience has been growing expediently. Therefore, understanding user needs and preferences is essential to the acceptance and trust of these technologies as they continue to grow in prevalence. This paper addresses the implementation of HCI principles in the personalization of interfaces to improve safety, security, and usability for the users. This paper explores the way that personalized interfaces can be devised to increase user engagement and satisfaction through various HCI strategies such as adaptive design, multi-modal interaction, and user feedback mechanisms. Moreover, this paper puts emphasis on factors of transparency and user control in the design of an interface; hence, allowing users to design or modify their experience could foster an increase in trust in autonomous systems. In so doing, this research touches on the quite influential role HCI will play in this future scenario of autonomous vehicles while designing to ensure relevance to the diverse needs of users while maintaining high standards of safety and security. Discussing various HCI strategies such as adaptive design, multi-modal interaction, and feedback mechanisms to the user, this paper demonstrates how personalized interfaces can enhance significantly both user engagement and satisfaction. Transparency and user control also in designing an interface are further discussed, pointing out the need for a prerequisite condition of enabling the user to take control of their experience as a state of trust in autonomous systems. In summary, this paper points out the role of HCI in the development of autonomous vehicles and addresses numerous needs with respect to those enforced safety and security standards.
zh

[AI-52] Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models (ASTA)

【速读】:该论文旨在解决语音驱动的物联网(IoT)设备在边缘计算与云端计算之间存在的性能与资源利用权衡问题:云端方案虽具备更强的语言理解能力,但存在延迟高、依赖网络连接及隐私风险;而纯边缘方案虽能实现低延迟和隐私保护,却受限于计算资源。解决方案的关键在于提出ASTA(Adaptive Speech-to-Action),其核心是通过实时系统指标(如CPU负载、设备温度和网络延迟)动态路由语音指令至边缘或云端进行推理,并结合规则驱动的命令验证与修复机制,确保端到端命令执行的成功率。该设计实现了对计算资源的自适应调度与鲁棒性增强,从而在保证响应效率的同时提升系统整体可靠性。

链接: https://arxiv.org/abs/2512.12769
作者: Mohammad Jalili Torkamani,Israt Zarin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: preprint, 6 pages, 7 figures, 1 table

点击查看摘要

Abstract:Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.
zh

[AI-53] Personalized QoE Prediction: A Demographic-Augmented Machine Learning Framework for 5G Video Streaming Networks

【速读】:该论文旨在解决当前质量感知体验(Quality of Experience, QoE)预测方法在异构现实环境中适用性受限的问题,尤其是现有模型依赖有限数据集且假设用户感知一致,难以适应不同人群对视频流媒体中断、码率波动和画质下降等干扰因素的差异化敏感性。其解决方案的关键在于提出一种基于人口统计学特征的数据增强策略,通过建模用户群体对流媒体质量退化行为的差异敏感性,将原始小规模QoE数据集扩展六倍;在此基础上,结合经典机器学习与先进深度学习架构(如基于注意力机制的MLP和TabNet),实现了更精准的个性化QoE预测。实验表明,TabNet因具备内在特征选择与注意力机制,在RMSE、MAE和R指标上均显著优于基线模型,验证了该方法在提升QoE预测鲁棒性和可扩展性方面的有效性。

链接: https://arxiv.org/abs/2512.12736
作者: Syeda Zunaira Ahmed,Hejab Tahira Beg,Maryam Khalid
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Quality of Experience (QoE) prediction is a critical component of modern multimedia systems, particularly for adaptive video streaming in 5G networks. Accurate QoE estimation enables intelligent resource management and supports user centric service delivery. Existing QoE prediction approaches primarily rely on limited datasets and assume uniform user perception, which restricts their applicability in heterogeneous real world environments. This paper proposes a demographic aware machine learning framework for personalized QoE prediction. We introduce a behaviorally realistic demographic based data augmentation strategy that expands a small QoE dataset six fold by modeling varying user sensitivities to streaming impairments such as rebuffering, bitrate variation, and quality degradation. Using the augmented dataset, we evaluate a comprehensive set of classical machine learning models alongside advanced deep learning architectures, including an attention-based MLP and TabNet. Experimental results demonstrate significant improvements in prediction accuracy across RMSE, MAE, and R metrics compared to baseline models. Among all evaluated approaches, TabNet achieves the strongest performance, benefiting from its inherent feature selection and attention mechanisms. The results confirm that demographic-aware augmentation substantially enhances QoE prediction robustness and provides a scalable direction for personalized QoE-aware intelligence in 5G video streaming networks. Comments: 11 pages, 5 figures Subjects: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV) ACMclasses: I.2.6; I.5.1; C.2.1 Cite as: arXiv:2512.12736 [cs.AI] (or arXiv:2512.12736v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.12736 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-54] Synergizing Code Coverag e and Gameplay Intent: Coverag e-Aware Game Playtesting with LLM -Guided Reinforcement Learning

【速读】:该论文旨在解决“游戏即服务”(Games as a Service)模式下频繁内容更新带来的自动化测试难题,即传统自动化测试方法在结构覆盖与功能验证之间存在割裂:代码导向的方法虽能实现高结构覆盖率但缺乏对游戏玩法语境的理解,而玩家导向的智能体虽能验证高层意图却难以覆盖具体代码变更。解决方案的关键在于提出SMART(Structural Mapping for Augmented Reinforcement Testing)框架,其核心是利用大语言模型(LLMs)解析抽象语法树(AST)差异并提取功能意图,构建上下文感知的混合奖励机制,从而引导强化学习代理在完成游戏目标的同时自适应探索修改的代码分支,实现结构完整性与功能正确性的高效平衡。

链接: https://arxiv.org/abs/2512.12706
作者: Enhong Mu,Minami Yoda,Yan Zhang,Mingyue Zhang,Yutaka Matsuno,Jialong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The widespread adoption of the “Games as a Service” model necessitates frequent content updates, placing immense pressure on quality assurance. In response, automated game testing has been viewed as a promising solution to cope with this demanding release cadence. However, existing automated testing approaches typically create a dichotomy: code-centric methods focus on structural coverage without understanding gameplay context, while player-centric agents validate high-level intent but often fail to cover specific underlying code changes. To bridge this gap, we propose SMART (Structural Mapping for Augmented Reinforcement Testing), a novel framework that synergizes structural verification and functional validation for game update testing. SMART leverages large language models (LLMs) to interpret abstract syntax tree (AST) differences and extract functional intent, constructing a context-aware hybrid reward mechanism. This mechanism guides reinforcement learning agents to sequentially fulfill gameplay goals while adaptively exploring modified code branches. We evaluate SMART on two environments, Overcooked and Minecraft. The results demonstrate that SMART significantly outperforms state-of-the-art baselines; it achieves over 94% branch coverage of modified code, nearly double that of traditional reinforcement learning methods, while maintaining a 98% task completion rate, effectively balancing structural comprehensiveness with functional correctness.
zh

[AI-55] Co-Exploration and Co-Exploitation via Shared Structure in Multi-Task Bandits

【速读】:该论文旨在解决上下文多任务多臂老虎机(contextual multi-task multi-armed bandit)场景中的高效探索问题,其中上下文信息仅部分可观测,且奖励分布间存在由潜在上下文变量诱导的依赖关系。其核心挑战在于如何在不完全观测和有限交互历史下,同时利用跨任务与跨臂的结构化依赖关系以提升探索效率。解决方案的关键在于提出一种新颖的贝叶斯框架:通过粒子化的对数密度高斯过程(particle-based approximation of a log-density Gaussian process)表示任务与奖励之间的联合分布,从而在不假设潜在变量先验的前提下,灵活地从数据中发现任务间和臂间的依赖结构;同时明确区分两类认知不确定性——潜变量引起的结构不确定性(structural uncertainty)和用户特定的不确定性(user-specific uncertainty),实现全局共享学习与个性化推理的统一。

链接: https://arxiv.org/abs/2512.12693
作者: Sumantrak Mukherjee,Serafima Lebedeva,Valentin Margraf,Jonas Hanselle,Kanta Yamaoka,Viktor Bengs,Stefan Konigorski,Eyke Hüllermeier,Sebastian Josef Vollmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, preprint

点击查看摘要

Abstract:We propose a novel Bayesian framework for efficient exploration in contextual multi-task multi-armed bandit settings, where the context is only observed partially and dependencies between reward distributions are induced by latent context variables. In order to exploit these structural dependencies, our approach integrates observations across all tasks and learns a global joint distribution, while still allowing personalised inference for new tasks. In this regard, we identify two key sources of epistemic uncertainty, namely structural uncertainty in the latent reward dependencies across arms and tasks, and user-specific uncertainty due to incomplete context and limited interaction history. To put our method into practice, we represent the joint distribution over tasks and rewards using a particle-based approximation of a log-density Gaussian process. This representation enables flexible, data-driven discovery of both inter-arm and inter-task dependencies without prior assumptions on the latent variables. Empirically, we demonstrate that our method outperforms baselines such as hierarchical model bandits, especially in settings with model misspecification or complex latent heterogeneity.
zh

[AI-56] DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization

【速读】:该论文旨在解决时间知识图谱推理(Temporal Knowledge Graph Reasoning, TKGR)中插值(interpolation)和外推(extrapolation)任务所面临的两大挑战:插值方法受限于上下文建模能力,而外推方法则存在认知泛化偏差问题。解决方案的关键在于提出一个统一框架DynaGen,其核心创新包括:针对插值任务,通过动态构建以实体为中心的子图,并采用协同双分支图神经网络(GNN)编码器捕捉演化中的结构上下文;针对外推任务,则引入条件扩散过程(conditional diffusion process),迫使模型学习事件演化的底层规律而非表面模式,从而提升对未见未来事件的预测能力。

链接: https://arxiv.org/abs/2512.12669
作者: Jiawei Shen,Jia Zhu,Hanghui Guo,Weijie Shi,Guoqing Ma,Yidan Liang,Jingjiang Liu,Hao Chen,Shimin Di
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.
zh

[AI-57] Value-Aware Multiagent Systems

【速读】:该论文旨在解决传统人工智能(AI)中“价值对齐”(value alignment)问题的局限性,提出“价值意识”(value awareness)这一新概念,以更全面地实现AI系统与人类价值观的一致性。其解决方案的关键在于构建一个由三大支柱组成的工程化路线图:首先,利用形式语义学(formal semantics)学习和表征人类价值观;其次,确保单智能体及多智能体系统的价值对齐;最后,提供基于价值的可解释性(value-based explainability),用于解释AI的行为逻辑。该方法为开发具备伦理敏感性和透明性的下一代AI系统提供了清晰的技术路径。

链接: https://arxiv.org/abs/2512.12652
作者: Nardine Osman
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper introduces the concept of value awareness in AI, which goes beyond the traditional value-alignment problem. Our definition of value awareness presents us with a concise and simplified roadmap for engineering value-aware AI. The roadmap is structured around three core pillars: (1) learning and representing human values using formal semantics, (2) ensuring the value alignment of both individual agents and multiagent systems, and (3) providing value-based explainability on behaviour. The paper presents a selection of our ongoing work on some of these topics, along with applications to real-life domains.
zh

[AI-58] Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents

【速读】:该论文旨在解决当前移动图形用户界面(GUI)智能体(Mobile GUI Agents)评估中存在的两大核心问题:一是现有评估方法要么依赖静态单路径离线基准(offline benchmarks),导致对合法替代动作的不公平惩罚;要么采用在线实时基准(online live benchmarks),但存在可扩展性差和复现性低的问题;二是现有基准将智能体视为黑箱,忽略了各模块的贡献,致使比较不公或关键性能瓶颈被掩盖。解决方案的关键在于提出MobiBench——首个模块化且多路径感知的离线评估框架,它能够在完全离线环境下实现高保真度、可扩展且可复现的评估,同时支持细粒度模块级分析,从而揭示不同技术组件的效能与配置优化策略,显著提升评估的准确性与实用性。

链接: https://arxiv.org/abs/2512.12634
作者: Youngmin Im,Byeongung Jo,Jaeyoung Wi,Seungwoo Baek,Tae Hoon Min,Joo Hyung Lee,Sangeun Oh,Insik Shin,Sunjae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.
zh

[AI-59] ORIBA: Exploring LLM -Driven Role-Play Chatbot as a Creativity Support Tool for Original Character Artists

【速读】:该论文旨在解决生成式 AI(Generative AI, GAI)在视觉艺术创作中引发的伦理争议问题,特别是如何在不侵犯艺术家对作品视觉表达(exposition)控制权的前提下,有效辅助其进行原创角色(Original Character, OC)的构思与开发。解决方案的关键在于设计一个名为 ORIBA 的 AI 聊天机器人,它基于大语言模型(Large Language Models, LLMs)使艺术家能够与自己的 OC 进行角色扮演(role-play),聚焦于概念化阶段(如背景故事构建),而将具体的视觉创作任务交还给艺术家本人,从而保障创作者的自主性(creative agency)。实证研究表明,ORIBA 能激发艺术家的想象力,促进多维属性塑造并增强与 OC 的情感联结,进而推动创作过程。

链接: https://arxiv.org/abs/2512.12630
作者: Yuqian Sun,Xingyu Li,Shunyu Yao,Noura Howell,Tristan Braud,Chang Hee Lee,Ali Asadipour
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Generative AI (GAI) have led to new opportunities for creativity support. However, this technology has raised ethical concerns in the visual artists community. This paper explores how GAI can assist visual artists in developing original characters (OCs) while respecting their creative agency. We present ORIBA, an AI chatbot leveraging large language models (LLMs) to enable artists to role-play with their OCs, focusing on conceptualization (e.g., backstories) while leaving exposition (visual creation) to creators. Through a study with 14 artists, we found ORIBA motivated artists’ imaginative engagement, developing multidimensional attributes and stronger bonds with OCs that inspire their creative process. Our contributions include design insights for AI systems that develop from artists’ perspectives, demonstrating how LLMs can support cross-modal creativity while preserving creative agency in OC art. This paper highlights the potential of GAI as a neutral, non-visual support that strengthens existing creative practice, without infringing artistic exposition.
zh

[AI-60] Detecting Prompt Injection Attacks Against Application Using Classifiers

【速读】:该论文旨在解决生成式 AI(Generative AI)在集成到Web应用时面临的提示注入攻击(prompt injection attack)问题,此类攻击可能危及关键系统的安全性与稳定性。解决方案的关键在于构建并扩充了一个基于HackAPrompt Playground Submissions语料库的提示注入数据集,并训练多种机器学习分类器(包括LSTM、前馈神经网络、随机森林和朴素贝叶斯)以检测LLM集成Web应用中的恶意提示,从而提升对提示注入攻击的检测与缓解能力,增强目标应用和系统的防护水平。

链接: https://arxiv.org/abs/2512.12583
作者: Safwan Shaheer,G. M. Refatul Islam,Mohammad Rafid Hamid,Md. Abrar Faiaz Khan,Md. Omar Faruk,Yaseen Nur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, X figures; undergraduate research project on detecting prompt injection attacks against LLM integrated web applications using classical machine learning and neural classifiers

点击查看摘要

Abstract:Prompt injection attacks can compromise the security and stability of critical systems, from infrastructure to large web applications. This work curates and augments a prompt injection dataset based on the HackAPrompt Playground Submissions corpus and trains several classifiers, including LSTM, feed forward neural networks, Random Forest, and Naive Bayes, to detect malicious prompts in LLM integrated web applications. The proposed approach improves prompt injection detection and mitigation, helping protect targeted applications and systems.
zh

[AI-61] Large Language Newsvendor: Decision Biases and Cognitive Mechanisms

【速读】:该论文旨在解决生成式 AI(Generative AI)在商业决策中可能复制甚至放大人类认知偏差的问题,尤其是在供应链管理等高风险运营场景下。研究通过动态多轮实验,在经典的报童问题(newsvendor problem)框架中检验 GPT-4、GPT-4o 和 LLaMA-8B 三种大语言模型(Large Language Models, LLMs)的决策模式,发现它们不仅重现了“订货过低/过高”的经典偏差,还显著加剧了需求追逐(demand-chasing)等行为。关键发现是:这些偏差并非源于知识不足,而是由模型架构本身限制所致,且存在“智能悖论”——更复杂的 GPT-4 因过度思考表现出更强的非理性,而效率优化的 GPT-4o 表现接近最优。因此,解决方案的核心在于:根据任务特性选择模型(如优先选用效率优化型)、实施人机协同监督机制,并设计结构化规则提示(rule-based prompts)以约束模型的启发式倾向,从而提升 AI 辅助决策的可靠性。

链接: https://arxiv.org/abs/2512.12552
作者: Jifei Liu,Zhi Chen,Yuanguang Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Problem definition: Although large language models (LLMs) are increasingly integrated into business decision making, their potential to replicate and even amplify human cognitive biases cautions a significant, yet not well-understood, risk. This is particularly critical in high-stakes operational contexts like supply chain management. To address this, we investigate the decision-making patterns of leading LLMs using the canonical newsvendor problem in a dynamic setting, aiming to identify the nature and origins of their cognitive biases. Methodology/results: Through dynamic, multi-round experiments with GPT-4, GPT-4o, and LLaMA-8B, we tested for five established decision biases. We found that LLMs consistently replicated the classic Too Low/Too High'' ordering bias and significantly amplified other tendencies like demand-chasing behavior compared to human benchmarks. Our analysis uncovered a paradox of intelligence’‘: the more sophisticated GPT-4 demonstrated the greatest irrationality through overthinking, while the efficiency-optimized GPT-4o performed near-optimally. Because these biases persist even when optimal formulas are provided, we conclude they stem from architectural constraints rather than knowledge gaps. Managerial implications: First, managers should select models based on the specific task, as our results show that efficiency-optimized models can outperform more complex ones on certain optimization problems. Second, the significant amplification of bias by LLMs highlights the urgent need for robust human-in-the-loop oversight in high-stakes decisions to prevent costly errors. Third, our findings suggest that designing structured, rule-based prompts is a practical and effective strategy for managers to constrain models’ heuristic tendencies and improve the reliability of AI-assisted decisions.
zh

[AI-62] World Models Unlock Optimal Forag ing Strategies in Reinforcement Learning Agents

【速读】:该论文旨在解决生物觅食行为中最优觅食策略(如边际价值理论,Marginal Value Theorem, MVT)的计算机制问题,即如何在生物体中实现符合生态最优原则的觅食决策。其解决方案的关键在于引入基于模型的强化学习(model-based reinforcement learning)框架,通过训练人工觅食者习得简洁的环境预测表征(predictive representation),从而自然涌现出与MVT一致的觅食策略。研究发现,这种策略的形成主要依赖于预测能力(anticipatory capabilities),而非单纯奖励最大化,表明构建可解释且符合生物学原理的AI决策系统,关键在于发展具备世界模型(world models)的代理结构。

链接: https://arxiv.org/abs/2512.12548
作者: Yesid Fonseca,Manuel S. Ríos,Nicanor Quijano,Luis F. Giraldo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Patch foraging involves the deliberate and planned process of determining the optimal time to depart from a resource-rich region and investigate potentially more beneficial alternatives. The Marginal Value Theorem (MVT) is frequently used to characterize this process, offering an optimality model for such foraging behaviors. Although this model has been widely used to make predictions in behavioral ecology, discovering the computational mechanisms that facilitate the emergence of optimal patch-foraging decisions in biological foragers remains under investigation. Here, we show that artificial foragers equipped with learned world models naturally converge to MVT-aligned strategies. Using a model-based reinforcement learning agent that acquires a parsimonious predictive representation of its environment, we demonstrate that anticipatory capabilities, rather than reward maximization alone, drive efficient patch-leaving behavior. Compared with standard model-free RL agents, these model-based agents exhibit decision patterns similar to many of their biological counterparts, suggesting that predictive world models can serve as a foundation for more explainable and biologically grounded decision-making in AI systems. Overall, our findings highlight the value of ecological optimality principles for advancing interpretable and adaptive AI.
zh

[AI-63] Skillful Subseasonal-to-Seasonal Forecasting of Extreme Events with a Multi-Sphere Coupled Probabilistic Model

【速读】:该论文旨在解决在加速气候变化背景下,准确进行次季节到季节(Subseasonal-to-Seasonal, S2S)极端事件预测的难题,这一问题因多圈层相互作用复杂性和大气内在不确定性而尤为突出。解决方案的关键在于提出了一种名为TianXing-S2S的多圈层耦合概率模型,该模型首先将多种圈层(如大气、陆面等)的预测因子编码至紧凑的潜在空间,随后利用扩散模型生成每日集合预报;其核心创新在于在去噪器中引入基于最优传输(Optimal Transport, OT)的新型耦合模块,以优化大气与多圈层边界条件之间的交互关系,从而显著提升对热浪和异常降水等极端事件的预测技能,并实现长达180天的稳定滚动预报。

链接: https://arxiv.org/abs/2512.12545
作者: Bin Mu,Yuxuan Chen,Shijin Yuan,Bo Qin,Hao Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Accurate subseasonal-to-seasonal (S2S) prediction of extreme events is critical for resource planning and disaster mitigation under accelerating climate change. However, such predictions remain challenging due to complex multi-sphere interactions and intrinsic atmospheric uncertainty. Here we present TianXing-S2S, a multi-sphere coupled probabilistic model for global S2S daily ensemble forecast. TianXing-S2S first encodes diverse multi-sphere predictors into a compact latent space, then employs a diffusion model to generate daily ensemble forecasts. A novel coupling module based on optimal transport (OT) is incorporated in the denoiser to optimize the interactions between atmospheric and multi-sphere boundary conditions. Across key atmospheric variables, TianXing-S2S outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF) S2S system and FuXi-S2S in 45-day daily-mean ensemble forecasts at 1.5 resolution. Our model achieves skillful subseasonal prediction of extreme events including heat waves and anomalous precipitation, identifying soil moisture as a critical precursor signal. Furthermore, we demonstrate that TianXing-S2S can generate stable rollout forecasts up to 180 days, establishing a robust framework for S2S research in a warming world.
zh

[AI-64] Diverse LLM s vs. Vulnerabilities: Who Detects and Fixes Them Better?

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件漏洞检测(Software Vulnerability Detection, SVD)与修复(Software Vulnerability Repair, SVR)中面临的问题:尽管单个LLM具备一定的代码理解能力,但在识别复杂漏洞和生成有效修复方案时仍存在较高错误率。为提升检测准确性和鲁棒性,作者提出DVDR-LLM——一种基于多模型集成的框架,其核心在于通过聚合来自多样化LLM的输出结果来降低错误率。实验表明,该方法在检测准确率上较单模型平均性能提升10–12%,尤其在多文件漏洞场景下显著改善召回率(+18%)和F1分数(+11.8%),但同时也带来假阳性减少与假阴性增加的权衡,需根据安全上下文调整模型间一致性阈值以优化整体性能。

链接: https://arxiv.org/abs/2512.12536
作者: Arastoo Zibaeirad,Marco Vieira
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being studied for Software Vulnerability Detection (SVD) and Repair (SVR). Individual LLMs have demonstrated code understanding abilities, but they frequently struggle when identifying complex vulnerabilities and generating fixes. This study presents DVDR-LLM, an ensemble framework that combines outputs from diverse LLMs to determine whether aggregating multiple models reduces error rates. Our evaluation reveals that DVDR-LLM achieves 10-12% higher detection accuracy compared to the average performance of individual models, with benefits increasing as code complexity grows. For multi-file vulnerabilities, the ensemble approach demonstrates significant improvements in recall (+18%) and F1 score (+11.8%) over individual models. However, the approach raises measurable trade-offs: reducing false positives in verification tasks while simultaneously increasing false negatives in detection tasks, requiring careful decision on the required level of agreement among the LLMs (threshold) for increased performance across different security contexts. Artifact: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12536 [cs.SE] (or arXiv:2512.12536v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.12536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-65] Noise-robust Contrastive Learning for Critical Transition Detection in Dynamical Systems

【速读】:该论文旨在解决复杂、噪声干扰的时间序列数据中关键转变(critical transitions)检测难题,尤其是传统对比学习方法因参数过多和对无关噪声敏感而导致的关键点识别不准确问题。其解决方案的关键在于提出一种基于奇异值分解(Singular Value Decomposition, SVD)构建的神经网络架构,并引入严格半正交性约束(strictly semi-orthogonality-constrained)的训练算法,从而在保持与传统对比学习相当检测性能的同时,显著提升模型轻量化程度和抗噪能力。

链接: https://arxiv.org/abs/2512.12523
作者: Wenqi Fang,Ye Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: under revision

点击查看摘要

Abstract:Detecting critical transitions in complex, noisy time-series data is a fundamental challenge across science and engineering. Such transitions may be anticipated by the emergence of a low-dimensional order parameter, whose signature is often masked by high-amplitude stochastic variability. Standard contrastive learning approaches based on deep neural networks, while promising for detecting critical transitions, are often overparameterized and sensitive to irrelevant noise, leading to inaccurate identification of critical points. To address these limitations, we propose a neural network architecture, constructed using singular value decomposition technique, together with a strictly semi-orthogonality-constrained training algorithm, to enhance the performance of traditional contrastive learning. Extensive experiments demonstrate that the proposed method matches the performance of traditional contrastive learning techniques in identifying critical transitions, yet is considerably more lightweight and markedly more resistant to noise.
zh

[AI-66] Can You Keep a Secret? Exploring AI for Care Coordination in Cognitive Decline

【速读】:该论文旨在解决低收入老年人(lower-SES older adults)在认知衰退背景下如何通过辅助技术实现“居家养老”(age-in-place)的问题,核心挑战在于平衡老年人自主性与照护者负担之间的矛盾。解决方案的关键在于识别并利用一种称为“搭便车式协调”(piggybacking)的非正式照护策略——即通过自发重叠日常任务(如顺路购物或办事)减少照护者投入时间与精力。研究提出,智能代理(agents)若能协助规划和执行此类任务协同,有望延长老年人居家生活时间,同时不增加照护者的负担。

链接: https://arxiv.org/abs/2512.12510
作者: Alicia(Hyun Jin)Lee,Mai Lee Chang,Sreehana Mandava,Destiny Deshields,Hugo Simão,Aaron Steinfeld,Jodi Forlizzi,John Zimmerman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:The increasing number of older adults who experience cognitive decline places a burden on informal caregivers, whose support with tasks of daily living determines whether older adults can remain in their homes. To explore how agents might help lower-SES older adults to age-in-place, we interviewed ten pairs of older adults experiencing cognitive decline and their informal caregivers. We explored how they coordinate care, manage burdens, and sustain autonomy and privacy. Older adults exercised control by delegating tasks to specific caregivers, keeping information about all the care they received from their adult children. Many abandoned some tasks of daily living, lowering their quality of life to ease caregiver burden. One effective strategy, piggybacking, uses spontaneous overlaps in errands to get more work done with less caregiver effort. This raises the questions: (i) Can agents help with piggyback coordination? (ii) Would it keep older adults in their homes longer, while not increasing caregiver burden?
zh

[AI-67] KidsArtBench: Multi-Dimensional Childrens Art Evaluation with Attribute-Aware MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估儿童艺术表达能力方面的局限性问题。现有美学数据集多基于成人艺术作品并提供单一标量评分,难以满足教育场景中对儿童创作进行多维、精细化评价的需求。解决方案的关键在于构建KidsArtBench基准数据集,包含1000余幅5–15岁儿童画作,由12位专业教育者按9个维度评分并附带专家反馈评论;同时提出一种属性特定的多LoRA(Low-Rank Adaptation)方法,每个属性对应评分量表中的一个维度(如“现实感”、“想象力”),并引入回归感知微调(Regression-Aware Fine-Tuning, RAFT)以使预测结果与序数尺度对齐。实验证明,该方法在Qwen2.5-VL-7B模型上将相关性从0.468提升至0.653,尤其在感知类维度表现显著改善,验证了教师对齐监督和属性感知训练在教育AI中可实现更具教学意义的评估。

链接: https://arxiv.org/abs/2512.12503
作者: Mingrui Ye,Chanjin Zheng,Zengyi Yu,Chenyu Xiang,Zhixue Zhao,Zheng Yuan,Helen Yannakoudakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show remarkable progress across many visual-language tasks; however, their capacity to evaluate artistic expression remains limited. Aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children’s artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children’s artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach, where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric, with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. These results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.
zh

[AI-68] SafeGen: Embedding Ethical Safeguards in Text-to-Image Generation

【速读】:该论文旨在解决生成式AI(Generative AI)在文本到图像生成系统中引发的伦理风险问题,包括放大社会偏见、制造高保真虚假信息以及侵犯知识产权等。其核心解决方案是提出SafeGen框架,该框架通过两个关键组件实现伦理约束与生成质量的协同优化:一是基于领域微调的BGE-M3文本分类器,用于过滤有害或误导性提示;二是经过优化的Hyper-SD扩散模型,确保生成图像在语义一致性与视觉保真度上的高性能。该方案以可信人工智能(Trustworthy AI)原则为设计基础,并依托多语言(英语-越南语)标注数据集和公平性感知训练流程,实证表明在保障内容安全的同时仍可维持高质量创作能力。

链接: https://arxiv.org/abs/2512.12501
作者: Dang Phuong Nam,Nguyen Kieu,Pham Thanh Hieu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) has created unprecedented opportunities for creative expression, education, and research. Text-to-image systems such as DALL.E, Stable Diffusion, and Midjourney can now convert ideas into visuals within seconds, but they also present a dual-use dilemma, raising critical ethical concerns: amplifying societal biases, producing high-fidelity disinformation, and violating intellectual property. This paper introduces SafeGen, a framework that embeds ethical safeguards directly into the text-to-image generation pipeline, grounding its design in established principles for Trustworthy AI. SafeGen integrates two complementary components: BGE-M3, a fine-tuned text classifier that filters harmful or misleading prompts, and Hyper-SD, an optimized diffusion model that produces high fidelity, semantically aligned images. Built on a curated multilingual (English- Vietnamese) dataset and a fairness-aware training process, SafeGen demonstrates that creative freedom and ethical responsibility can be reconciled within a single workflow. Quantitative evaluations confirm its effectiveness, with Hyper-SD achieving IS = 3.52, FID = 22.08, and SSIM = 0.79, while BGE-M3 reaches an F1-Score of 0.81. An ablation study further validates the importance of domain-specific fine-tuning for both modules. Case studies illustrate SafeGen’s practical impact in blocking unsafe prompts, generating inclusive teaching materials, and reinforcing academic integrity.
zh

[AI-69] Explainable AI as a Double-Edged Sword in Dermatology: The Impact on Clinicians versus The Public

【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗诊断中因决策过程不透明而引发的“可解释性悖论”问题,即尽管可解释人工智能(Explainable AI, XAI)旨在提升人类对AI决策的理解与信任,但可能反而导致用户产生自动化偏倚(automation bias)或认知偏差。解决方案的关键在于通过两个大规模实验(共涉及623名普通公众和153名初级保健医生,PCPs),系统评估不同形式XAI解释(尤其是多模态大语言模型LLM提供的解释)对诊断准确性及公平性的影响。研究发现:AI辅助在肤色多样性背景下能提升准确率并减少诊断差异;然而,LLM解释的效果具有群体异质性——普通用户表现出显著的自动化偏倚(AI正确时表现更好,错误时更差),而经验丰富的医生则保持稳定受益;此外,先呈现AI建议会降低整体诊断质量,尤其当AI出错时。这表明XAI设计需考虑用户专业背景与交互时机,凸显了LLM作为“双刃剑”的特性,为未来人机协同医疗系统的设计提供了关键依据。

链接: https://arxiv.org/abs/2512.12500
作者: Xuhai Xu,Haoyu Hu,Haoran Zhang,Will Ke Wang,Reina Wang,Luis R. Soenksen,Omar Badri,Sheharbano Jafry,Elise Burger,Lotanna Nwandu,Apoorva Mehta,Erik P. Duhaime,Asif Qasim,Hause Lin,Janis Pereira,Jonathan Hershon,Paulius Mui,Alejandro A. Gru,Noémie Elhadad,Lena Mamykina,Matthew Groh,Philipp Tschandl,Roxana Daneshjou,Marzyeh Ghassemi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm’s opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI explanations to examine how XAI assistance, particularly multimodal large language models (LLMs), influences diagnostic performance. AI assistance balanced across skin tones improved accuracy and reduced diagnostic disparities. However, LLM explanations yielded divergent effects: lay users showed higher automation bias - accuracy boosted when AI was correct, reduced when AI erred - while experienced PCPs remained resilient, benefiting irrespective of AI accuracy. Presenting AI suggestions first also led to worse outcomes when the AI was incorrect for both groups. These findings highlight XAI’s varying impact based on expertise and timing, underscoring LLMs as a “double-edged sword” in medical AI and informing future human-AI collaborative system design.
zh

[AI-70] Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers

【速读】:该论文旨在解决当前广泛使用的椭圆曲线密码学(Elliptic Curve Cryptography, ECC)的安全性验证问题,特别是针对secp256r1标准下公私钥对的可逆性挑战。其核心问题是:现代机器学习模型是否能够通过学习公钥生成过程,实现对已知公钥到私钥的反向推导,从而揭示ECC算法潜在的脆弱性。解决方案的关键在于利用先进的语言模型架构(如跨轴变换器,Cross-Axis Transformers)来模拟和逆向工程椭圆曲线密钥生成机制,并测试模型对大量公私钥对的记忆能力与泛化能力,以此评估ECC在AI驱动攻击下的抗破解强度。

链接: https://arxiv.org/abs/2512.12483
作者: Lily Erickson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:With the advent of machine learning and quantum computing, the 21st century has gone from a place of relative algorithmic security, to one of speculative unease and possibly, cyber catastrophe. Modern algorithms like Elliptic Curve Cryptography (ECC) are the bastion of current cryptographic security protocols that form the backbone of consumer protection ranging from Hypertext Transfer Protocol Secure (HTTPS) in the modern internet browser, to cryptographic financial instruments like Bitcoin. And there’s been very little work put into testing the strength of these ciphers. Practically the only study that I could find was on side-channel recognition, a joint paper from the University of Milan, Italy and King’s College, London\citebattistello2025ecc. These algorithms are already considered bulletproof by many consumers, but exploits already exist for them, and with computing power and distributed, federated compute on the rise, it’s only a matter of time before these current bastions fade away into obscurity, and it’s on all of us to stand up when we notice something is amiss, lest we see such passages claim victims in that process. In this paper, we seek to explore the use of modern language model architecture in cracking the association between a known public key, and its associated private key, by intuitively learning to reverse engineer the public keypair generation process, effectively solving the curve. Additonally, we attempt to ascertain modern machine learning’s ability to memorize public-private secp256r1 keypairs, and to then test their ability to reverse engineer the public keypair generation process. It is my belief that proof-for would be equally valuable as proof-against in either of these categories. Finally, we’ll conclude with some number crunching on where we see this particular field heading in the future. Comments: 7 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12483 [cs.CR] (or arXiv:2512.12483v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.12483 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lily Erickson [view email] [v1] Sat, 13 Dec 2025 22:45:35 UTC (180 KB) Full-text links: Access Paper: View a PDF of the paper titled Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers, by Lily EricksonView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-71] MetaHGNIE: Meta-Path Induced Hypergraph Contrastive Learning in Heterogeneous Knowledge Graphs

【速读】:该论文旨在解决异质知识图谱(heterogeneous knowledge graph)中节点重要性估计(Node Importance Estimation, NIE)的难题,现有方法通常仅依赖成对连接关系,忽视了多实体与关系之间的高阶依赖,并且独立处理结构信号与语义信号,导致跨模态融合效果不佳。其解决方案的关键在于提出 MetaHGNIE 框架,通过元路径(meta-path)诱导构建超图(hypergraph),以 typed 超边捕获多实体关联上下文;利用局部注意力聚合结构依赖,通过带稀疏分块(sparse chunking)的超图 Transformer 编码语义表示;最终在对比学习框架下引入辅助监督信号,实现结构与语义嵌入的鲁棒对齐,从而有效建模高阶交互并增强跨模态整合能力。

链接: https://arxiv.org/abs/2512.12477
作者: Jiawen Chen,Yanyan He,Qi Shao,Mengli Wei,Duxin Chen,Wenwu Yu,Yanlong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Node importance estimation (NIE) in heterogeneous knowledge graphs is a critical yet challenging task, essential for applications such as recommendation, knowledge reasoning, and question answering. Existing methods often rely on pairwise connections, neglecting high-order dependencies among multiple entities and relations, and they treat structural and semantic signals independently, hindering effective cross-modal integration. To address these challenges, we propose MetaHGNIE, a meta-path induced hypergraph contrastive learning framework for disentangling and aligning structural and semantic information. MetaHGNIE constructs a higher-order knowledge graph via meta-path sequences, where typed hyperedges capture multi-entity relational contexts. Structural dependencies are aggregated with local attention, while semantic representations are encoded through a hypergraph transformer equipped with sparse chunking to reduce redundancy. Finally, a multimodal fusion module integrates structural and semantic embeddings under contrastive learning with auxiliary supervision, ensuring robust cross-modal alignment. Extensive experiments on benchmark NIE datasets demonstrate that MetaHGNIE consistently outperforms state-of-the-art baselines. These results highlight the effectiveness of explicitly modeling higher-order interactions and cross-modal alignment in heterogeneous knowledge graphs. Our code is available at this https URL
zh

[AI-72] AI-Driven Real-Time Kick Classification in Olympic Taekwondo Using Sensor Fusion

【速读】:该论文旨在解决奥林匹克跆拳道(Taekwondo)在观众参与度方面的挑战,主要源于静态、防守型的比赛风格以及评分规则的争议性。当前的保护与计分系统(Protector and Scoring System, PSS)依赖于冲击传感器和简单逻辑判断,导致运动员倾向于安全策略,削弱了项目的动态性和观赏性。解决方案的关键在于构建一个基于人工智能(AI)的新型计分系统,通过传感器融合框架整合现有PSS传感器与额外的加速度计、陀螺仪、磁/射频识别(RFID)及冲击力传感器,实现对踢技的实时分类,包括技术类型、击中部位、冲击力度甚至脚部接触区域。该系统采用支持向量机(Support Vector Machine, SVM)驱动的机器学习流水线,结合创新的踢技评分规则(如奖励旋转和腾空踢等高难度动作),从而提升评分公平性、减少规则滥用、鼓励更具动态性的技术,并增强观众理解与观赛体验。

链接: https://arxiv.org/abs/2512.12474
作者: Jamsheed Mistri
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Olympic Taekwondo has faced challenges in spectator engagement due to static, defensive gameplay and contentious scoring. Current Protector and Scoring Systems (PSS) rely on impact sensors and simplistic logic, encouraging safe strategies that diminish the sport’s dynamism. This paper proposes an AI-powered scoring system that integrates existing PSS sensors with additional accelerometers, gyroscopes, magnetic/RFID, and impact force sensors in a sensor fusion framework. The system classifies kicks in real-time to identify technique type, contact location, impact force, and even the part of the foot used. A machine learning pipeline employing sensor fusion and Support Vector Machines (SVMs) is detailed, enabling automatic kick technique recognition for scoring. We present a novel kick scoring rubric that awards points based on specific kick techniques (e.g., turning and spinning kicks) to incentivize dynamic attacks. Drawing on a 2024 study achieving 96-98% accuracy, we validate the feasibility of real-time kick classification and further propose enhancements to this methodology, such as ensemble SVM classifiers and expanded datasets, to achieve the high-stakes accuracy required by the sport. We analyze how the proposed system can improve scoring fairness, reduce rule exploitation and illegitimate tactics, encourage more dynamic techniques, and enhance spectator understanding and excitement. The paper includes system design illustrations, a kick scoring table from an AI-augmented rule set, and discusses anticipated impacts on Olympic Taekwondo.
zh

[AI-73] Exploring the Design Space of Transition Matching

【速读】:该论文旨在解决生成式建模中如何提升模型表达能力与推理效率的问题,特别是在过渡匹配(Transition Matching, TM)框架下优化生成过程的性能。其核心挑战在于设计一个高效且高保真的生成过渡机制,以在保持时间连续性的同时实现高质量样本生成。解决方案的关键在于引入一个轻量级“头模块”(head module),配合大型骨干网络(backbone network)共同完成生成过渡步骤:通过将复杂过渡过程分解为由小规模模块执行的局部变换,显著增强模型的表达能力;同时,实验表明,采用MLP结构的头模块、特定的时间加权训练策略以及高频随机采样器组合,在生成质量、训练稳定性和推理效率上均达到最优表现,优于现有扩散模型和流匹配模型等基线方法。

链接: https://arxiv.org/abs/2512.12465
作者: Uriel Singer,Yaron Lipman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transition Matching ™ is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal’’ generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller “head” module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.
zh

[AI-74] Dynamical modeling of nonlinear latent factors in multiscale neural activity with real-time inference

【速读】:该论文旨在解决多模态神经时间序列数据(如离散的尖峰活动和连续的场电位)在实时解码目标变量时面临的挑战,特别是不同模态间存在采样率差异、概率分布不一致以及部分时间步缺失样本等问题。现有非线性多模态神经活动模型无法有效处理这些异构特性,且多数不支持实时解码。其解决方案的关键在于提出一个学习框架,包含三个核心组件:1)多尺度编码器(multiscale encoder),通过学习模态内动态特性来非线性聚合信息并处理不同采样率与缺失数据;2)多尺度动力学主干(multiscale dynamical backbone),提取跨模态的时间动态特征并实现递归式实时解码;3)模态特异性解码器(modality-specific decoders),以适应各模态的概率分布差异。该框架在模拟数据及三个真实脑科学多尺度数据集上均验证了其在异构多模态信息融合中的有效性,并显著优于多种线性和非线性基准方法。

链接: https://arxiv.org/abs/2512.12462
作者: Eray Erturk,Maryam M. Shanechi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Published at the 39th Annual Conference on Neural Information Processing Systems 2025. Code is available at this https URL

点击查看摘要

Abstract:Real-time decoding of target variables from multiple simultaneously recorded neural time-series modalities, such as discrete spiking activity and continuous field potentials, is important across various neuroscience applications. However, a major challenge for doing so is that different neural modalities can have different timescales (i.e., sampling rates) and different probabilistic distributions, or can even be missing at some time-steps. Existing nonlinear models of multimodal neural activity do not address different timescales or missing samples across modalities. Further, some of these models do not allow for real-time decoding. Here, we develop a learning framework that can enable real-time recursive decoding while nonlinearly aggregating information across multiple modalities with different timescales and distributions and with missing samples. This framework consists of 1) a multiscale encoder that nonlinearly aggregates information after learning within-modality dynamics to handle different timescales and missing samples in real time, 2) a multiscale dynamical backbone that extracts multimodal temporal dynamics and enables real-time recursive decoding, and 3) modality-specific decoders to account for different probabilistic distributions across modalities. In both simulations and three distinct multiscale brain datasets, we show that our model can aggregate information across modalities with different timescales and distributions and missing samples to improve real-time target decoding. Further, our method outperforms various linear and nonlinear multimodal benchmarks in doing so.
zh

[AI-75] Cross-Modal Representational Knowledge Distillation for Enhanced Spike-Informed LFP Modeling

【速读】:该论文旨在解决局部场电位(Local Field Potentials, LFPs)在神经建模中因群体性特征导致的预测能力较弱的问题,尤其是在下游任务(如运动行为预测)中的表现不如尖峰活动(spiking activity)。其解决方案的关键在于提出一种跨模态知识蒸馏框架,通过将预训练多会话尖峰Transformer模型(teacher spike model)中的高保真表征知识迁移至LFP Transformer模型(student LFP model),从而显著提升LFP模型的性能。具体而言,教师模型采用会话特定的神经标记化策略与掩码自编码目标进行训练,随后通过隐空间对齐使学生模型学习到教师模型的语义表示,最终实现无需额外蒸馏即可在新会话中保持优异性能,且优于单/多会话LFP基线模型。

链接: https://arxiv.org/abs/2512.12461
作者: Eray Erturk,Saba Hashemi,Maryam M. Shanechi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Published at the 39th Annual Conference on Neural Information Processing Systems 2025. Code is available at this https URL

点击查看摘要

Abstract:Local field potentials (LFPs) can be routinely recorded alongside spiking activity in intracortical neural experiments, measure a larger complementary spatiotemporal scale of brain activity for scientific inquiry, and can offer practical advantages over spikes, including greater long-term stability, robustness to electrode degradation, and lower power requirements. Despite these advantages, recent neural modeling frameworks have largely focused on spiking activity since LFP signals pose inherent modeling challenges due to their aggregate, population-level nature, often leading to lower predictive power for downstream task variables such as motor behavior. To address this challenge, we introduce a cross-modal knowledge distillation framework that transfers high-fidelity representational knowledge from pretrained multi-session spike transformer models to LFP transformer models. Specifically, we first train a teacher spike model across multiple recording sessions using a masked autoencoding objective with a session-specific neural tokenization strategy. We then align the latent representations of the student LFP model to those of the teacher spike model. Our results show that the Distilled LFP models consistently outperform single- and multi-session LFP baselines in both fully unsupervised and supervised settings, and can generalize to other sessions without additional distillation while maintaining superior performance. These findings demonstrate that cross-modal knowledge distillation is a powerful and scalable approach for leveraging high-performing spike models to develop more accurate LFP models.
zh

[AI-76] AI Transparency Atlas: Framework Scoring and Real-Time Model Card Evaluation Pipeline

【速读】:该论文旨在解决当前AI模型文档在不同平台间碎片化、结构不一致的问题,从而阻碍政策制定者、审计人员和用户对模型安全性、数据来源及版本变更的可靠评估。其解决方案的关键在于构建一个基于欧盟《人工智能法案》附件IV和斯坦福透明度指数的加权透明性框架,该框架包含8个核心模块和23个子项,优先强调安全关键披露(如安全评估占25%权重、关键风险占20%),并通过自动化多智能体管道从公开源提取文档并利用大语言模型(LLM)共识评分来量化完整性,实现了低成本、高效率的跨模型透明度评估。

链接: https://arxiv.org/abs/2512.12443
作者: Akhmadillo Mamirov,Faiaz Azmain,Hanyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:AI model documentation is fragmented across platforms and inconsistent in structure, preventing policymakers, auditors, and users from reliably assessing safety claims, data provenance, and version-level changes. We analyzed documentation from five frontier models (Gemini 3, Grok 4.1, Llama 4, GPT-5, and Claude 4.5) and 100 Hugging Face model cards, identifying 947 unique section names with extreme naming variation. Usage information alone appeared under 97 distinct labels. Using the EU AI Act Annex IV and the Stanford Transparency Index as baselines, we developed a weighted transparency framework with 8 sections and 23 subsections that prioritizes safety-critical disclosures (Safety Evaluation: 25%, Critical Risk: 20%) over technical specifications. We implemented an automated multi-agent pipeline that extracts documentation from public sources and scores completeness through LLM-based consensus. Evaluating 50 models across vision, multimodal, open-source, and closed-source systems cost less than 3 in total and revealed systematic gaps. Frontier labs (xAI, Microsoft, Anthropic) achieve approximately 80% compliance, while most providers fall below 60%. Safety-critical categories show the largest deficits: deception behaviors, hallucinations, and child safety evaluations account for 148, 124, and 116 aggregate points lost, respectively, across all evaluated models.
zh

[AI-77] Rough Sets for Explainability of Spectral Graph Clustering

【速读】:该论文旨在解决图谱聚类(Graph Spectral Clustering, GSC)方法在文本文档聚类应用中可解释性差的问题,尤其是由于谱空间嵌入与文档内容之间缺乏直观关联,以及无明确语义的文档和聚类算法的随机性进一步削弱了结果的可解释性。解决方案的关键在于借鉴粗糙集理论(Rough Set Theory),对先前团队提出的解释方法进行增强,从而有效缓解上述问题,提升聚类结果的可理解性和稳定性。

链接: https://arxiv.org/abs/2512.12436
作者: Bartłomiej Starosta,Sławomir T. Wierzchoń,Piotr Borkowski,Dariusz Czerski,Marcin Sydow,Eryk Laskowski,Mieczysław A. Kłopotek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 figures, 21tables

点击查看摘要

Abstract:Graph Spectral Clustering methods (GSC) allow representing clusters of diverse shapes, densities, etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Furthermore, the presence of documents without clear content meaning and the stochastic nature of the clustering algorithms deteriorate explainability. This paper proposes an enhancement to the explanation methodology, proposed in an earlier research of our team. It allows us to overcome the latter problems by taking inspiration from rough set theory.
zh

[AI-78] Understanding Critical Thinking in Generative Artificial Intelligence Use: Development Validation and Correlates of the Critical Thinking in AI Use Scale

【速读】:该论文旨在解决生成式 AI (Generative AI) 在日常工作中日益普及背景下,用户对 AI 输出内容缺乏批判性评估能力的问题。其核心挑战在于,尽管生成式 AI 具备高度流畅性和实用性,但其黑箱特性与易产生幻觉(hallucination)的倾向要求使用者必须具备识别信息来源、理解模型运作机制及反思依赖 AI 的潜在影响的能力。解决方案的关键在于提出并验证了一个包含 13 个项目、具有三维度结构(验证 Verification、动机 Motivation 和反思 Reflection)的“AI 使用中的批判性思维量表”,并通过六项研究系统地确立了该量表的心理测量学属性及其与人格特质(如开放性、外向性)、AI 使用频率以及实际验证行为之间的关联,从而为理论检验、跨群体比较和纵向研究提供了可靠工具与生态效度高的任务范式。

链接: https://arxiv.org/abs/2512.12413
作者: Gabriel R. Lau,Wei Yan Low,Louis Tay,Ysabel Guevarra,Dragan Gašević,Andree Hartanto
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.
zh

[AI-79] Feeling the Strength but Not the Source: Partial Introspection in LLM s

【速读】:该论文旨在验证 Anthropic 提出的“涌现式自省”(emergent introspection)现象是否稳健,即大语言模型能否识别并命名通过激活方向注入的概念。其核心问题是:这种自省能力是否依赖于模型规模或能力,以及其鲁棒性和泛化性如何。解决方案的关键在于系统性地复现 Anthropic 的实验结果,并在此基础上进行多维度测试:首先在 Meta-Llama-3.1-8B-Instruct 上复现原实验,证实小模型也能实现约 20% 的概念识别准确率;其次通过改变推理提示发现自省行为对 prompt 极为敏感,性能在相近任务中显著下降;最后揭示了一种“部分自省”(partial introspection)现象——模型可稳定分类注入向量强度(弱/中/强/极强),准确率达 70%,远超随机基线(25%)。这些发现表明,尽管模型能计算内部表征与基线的函数关系,但其自我报告具有局限性且高度依赖输入提示。

链接: https://arxiv.org/abs/2512.12411
作者: Ely Hahami,Lavik Jain,Ishaan Sinha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages (+ 5 pages for appendix), 5 figures, 1 table

点击查看摘要

Abstract:Recent work from Anthropic claims that frontier models can sometimes detect and name injected “concepts” represented as activation directions. We test the robustness of these claims. First, we reproduce Anthropic’s multi-turn “emergent introspection” result on Meta-Llama-3.1-8B-Instruct, finding that the model identifies and names the injected concept 20 percent of the time under Anthropic’s original pipeline, exactly matching their reported numbers and thus showing that introspection is not exclusive to very large or capable models. Second, we systematically vary the inference prompt and find that introspection is fragile: performance collapses on closely related tasks such as multiple-choice identification of the injected concept or different prompts of binary discrimination of whether a concept was injected at all. Third, we identify a contrasting regime of partial introspection: the same model can reliably classify the strength of the coefficient of a normalized injected concept vector (as weak / moderate / strong / very strong) with up to 70 percent accuracy, far above the 25 percent chance baseline. Together, these results provide more evidence for Anthropic’s claim that language models effectively compute a function of their baseline, internal representations during introspection; however, these self-reports about those representations are narrow and prompt-sensitive. Our code is available at this https URL.
zh

[AI-80] Entropy Collapse: A Universal Failure Mode of Intelligent Systems

【速读】:该论文旨在解决智能系统(intelligent systems)在演化过程中普遍存在的“熵崩溃”(entropy collapse)问题,即随着系统通过反馈放大机制不断优化和学习,其适应性反而会因有效可变维度的收缩而急剧下降,导致系统僵化、失效。解决方案的关键在于识别出熵崩溃是一种由反馈放大速率超过有限新颖性再生速率所引发的普适动力学失效模式,并通过理论分析揭示了系统从高熵自适应状态向低熵坍缩状态的尖锐相变特征;进一步证明该过程具有临界阈值、动力学不可逆性和吸引子结构,从而将人工智能中的模型坍缩、经济学中的制度僵化与进化生物学中的遗传瓶颈等看似不相关的现象统一为同一底层机制,为设计具备长期适应能力的智能系统提供了熵感知的设计原则。

链接: https://arxiv.org/abs/2512.12381
作者: Truong Xuan Khanh,Truong Quynh Hoa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Intelligent systems are widely assumed to improve through learning, coordination, and optimization. However, across domains – from artificial intelligence to economic institutions and biological evolution – increasing intelligence often precipitates paradoxical degradation: systems become rigid, lose adaptability, and fail unexpectedly. We identify \emphentropy collapse as a universal dynamical failure mode arising when feedback amplification outpaces bounded novelty regeneration. Under minimal domain-agnostic assumptions, we show that intelligent systems undergo a sharp transition from high-entropy adaptive regimes to low-entropy collapsed regimes. Collapse is formalized as convergence toward a stable low-entropy manifold, not a zero-entropy state, implying a contraction of effective adaptive dimensionality rather than loss of activity or scale. We analytically establish critical thresholds, dynamical irreversibility, and attractor structure and demonstrate universality across update mechanisms through minimal simulations. This framework unifies diverse phenomena – model collapse in AI, institutional sclerosis in economics, and genetic bottlenecks in evolution – as manifestations of the same underlying process. By reframing collapse as a structural cost of intelligence, our results clarify why late-stage interventions systematically fail and motivate entropy-aware design principles for sustaining long-term adaptability in intelligent systems. \noindent\textbfKeywords: entropy collapse; intelligent systems; feedback amplification; phase transitions; effective dimensionality; complex systems; model collapse; institutional sclerosis Comments: 18 pages, 5 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12381 [cs.AI] (or arXiv:2512.12381v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.12381 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xuan Khanh Truong [view email] [v1] Sat, 13 Dec 2025 16:12:27 UTC (200 KB) Full-text links: Access Paper: View a PDF of the paper titled Entropy Collapse: A Universal Failure Mode of Intelligent Systems, by Truong Xuan Khanh and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-81] Dynamic Homophily with Imperfect Recall: Modeling Resilience in Adversarial Networks

【速读】:该论文旨在解决复杂网络在同质性(homophily)、记忆约束(memory constraints)与对抗性干扰(adversarial disruptions)共同作用下的鲁棒性(resilience)与适应性(adaptability)问题。其解决方案的关键在于构建一个整合显式记忆衰减机制的同质性模型框架,并通过系统性实验验证不同衰减函数、重连概率和相似性度量(如余弦相似度 cosine similarity 与 Jaccard 相似度)对网络性能的影响。研究发现,余弦相似度在稀疏、凸形及模块化网络中可使稳定性指标提升达30%,且通过优化“召回价值”(value-of-recall)指标,证明策略性遗忘能够有效平衡网络鲁棒性与适应能力,从而为社交系统、协作平台和网络安全等实际应用场景提供可操作的优化依据。

链接: https://arxiv.org/abs/2512.12332
作者: Saad Alqithami
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The purpose of this study is to investigate how homophily, memory constraints, and adversarial disruptions collectively shape the resilience and adaptability of complex networks. To achieve this, we develop a new framework that integrates explicit memory decay mechanisms into homophily-based models and systematically evaluate their performance across diverse graph structures and adversarial settings. Our methods involve extensive experimentation on synthetic datasets, where we vary decay functions, reconnection probabilities, and similarity measures, primarily comparing cosine similarity with traditional metrics such as Jaccard similarity and baseline edge weights. The results show that cosine similarity achieves up to a 30% improvement in stability metrics in sparse, convex, and modular networks. Moreover, the refined value-of-recall metric demonstrates that strategic forgetting can bolster resilience by balancing network robustness and adaptability. The findings underscore the critical importance of aligning memory and similarity parameters with the structural and adversarial dynamics of the network. By quantifying the tangible benefits of incorporating memory constraints into homophily-based analyses, this study offers actionable insights for optimizing real-world applications, including social systems, collaborative platforms, and cybersecurity contexts.
zh

[AI-82] UniMark: Artificial Intelligence Generated Content Identification Toolkit

【速读】:该论文旨在解决生成式 AI (Generative AI) 内容泛滥所引发的信任危机与监管合规需求,现有识别工具存在碎片化且缺乏可见合规标记支持的问题。解决方案的关键在于提出 UniMark——一个开源的多模态内容治理统一框架,其核心创新是采用新颖的双操作策略,原生支持隐式水印(Hidden Watermarking)用于版权保护和显式标记(Visible Marking)用于监管合规,并通过三个专业化基准测试(Image/Video/Audio-Bench)构建标准化评估体系,从而实现从先进算法到工程落地的高效衔接。

链接: https://arxiv.org/abs/2512.12324
作者: Meilin Li,Ji He,Jia Xu,Shanzhe Lei,Yan Teng,Yingchun Wang,Xuhong Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 Pages

点击查看摘要

Abstract:The rapid proliferation of Artificial Intelligence Generated Content has precipitated a crisis of trust and urgent regulatory demands. However, existing identification tools suffer from fragmentation and a lack of support for visible compliance marking. To address these gaps, we introduce the \textbfUniMark, an open-source, unified framework for multimodal content governance. Our system features a modular unified engine that abstracts complexities across text, image, audio, and video modalities. Crucially, we propose a novel dual-operation strategy, natively supporting both \emphHidden Watermarking for copyright protection and \emphVisible Marking for regulatory compliance. Furthermore, we establish a standardized evaluation framework with three specialized benchmarks (Image/Video/Audio-Bench) to ensure rigorous performance assessment. This toolkit bridges the gap between advanced algorithms and engineering implementation, fostering a more transparent and secure digital ecosystem.
zh

[AI-83] Quantum-Aware Generative AI for Materials Discovery: A Framework for Robust Exploration Beyond DFT Biases

【速读】:该论文旨在解决传统生成式材料发现模型依赖于密度泛函理论(Density Functional Theory, DFT)数据训练所导致的系统性偏差问题,尤其在强关联体系中DFT预测常出现定性错误,从而限制了新材料的可靠探索。其解决方案的关键在于提出一种量子感知的生成式AI框架,通过紧密集成多保真度学习(multi-fidelity learning)与主动验证机制(active validation),利用基于扩散模型的生成器结合量子力学描述符,并采用在多层次理论(PBE、SCAN、HSE06、CCSD(T))上训练的等变神经网络势作为验证器,构建了一个能量化并聚焦低-高保真度预测差异的鲁棒主动学习循环,从而显著提升在DFT失效区域(如强关联氧化物)中稳定候选材料的发现效率。

链接: https://arxiv.org/abs/2512.12288
作者: Mahule Roy,Guillaume Lambard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages

点击查看摘要

Abstract:Conventional generative models for materials discovery are predominantly trained and validated using data from Density Functional Theory (DFT) with approximate exchange-correlation functionals. This creates a fundamental bottleneck: these models inherit DFT’s systematic failures for strongly correlated systems, leading to exploration biases and an inability to discover materials where DFT predictions are qualitatively incorrect. We introduce a quantum-aware generative AI framework that systematically addresses this limitation through tight integration of multi-fidelity learning and active validation. Our approach employs a diffusion-based generator conditioned on quantum-mechanical descriptors and a validator using an equivariant neural network potential trained on a hierarchical dataset spanning multiple levels of theory (PBE, SCAN, HSE06, CCSD(T)). Crucially, we implement a robust active learning loop that quantifies and targets the divergence between low- and high-fidelity predictions. We conduct comprehensive ablation studies to deconstruct the contribution of each component, perform detailed failure mode analysis, and benchmark our framework against state-of-the-art generative models (CDVAE, GNoME, DiffCSP) across several challenging material classes. Our results demonstrate significant practical gains: a 3-5x improvement in successfully identifying potentially stable candidates in high-divergence regions (e.g., correlated oxides) compared to DFT-only baselines, while maintaining computational feasibility. This work provides a rigorous, transparent framework for extending the effective search space of computational materials discovery beyond the limitations of single-fidelity models.
zh

[AI-84] Fractional Differential Equation Physics-Informed Neural Network and Its Application in Battery State Estimation

【速读】:该论文旨在解决锂离子电池状态估计中因传统数据驱动神经网络模型难以准确刻画电化学过程的非线性特性与记忆依赖动力学而导致预测精度低、物理可解释性差的问题。其解决方案的关键在于提出一种新型神经架构——分数阶微分方程物理信息神经网络(Fractional Differential Equation Physics-Informed Neural Network, FDIFF-PINN),通过将分数阶微积分与深度学习相结合,在构建基于分数阶等效电路模型的离散化分数阶偏微分方程基础上,实现对电池动态工况下复杂行为的高精度建模与物理机制嵌入,从而提升SOC估计的准确性与可解释性。

链接: https://arxiv.org/abs/2512.12285
作者: Lujuan Dang,Zilai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate estimation of the State of Charge (SOC) is critical for ensuring the safety, reliability, and performance optimization of lithium-ion battery systems. Conventional data-driven neural network models often struggle to fully characterize the inherent complex nonlinearities and memory-dependent dynamics of electrochemical processes, significantly limiting their predictive accuracy and physical interpretability under dynamic operating conditions. To address this challenge, this study proposes a novel neural architecture termed the Fractional Differential Equation Physics-Informed Neural Network (FDIFF-PINN), which integrates fractional calculus with deep learning. The main contributions of this paper include: (1) Based on a fractional-order equivalent circuit model, a discretized fractional-order partial differential equation is constructed. (2) Comparative experiments were conducted using a dynamic charge/discharge dataset of Panasonic 18650PF batteries under multi-temperature conditions (from -10 ^\circ C to 20 ^\circ C).
zh

[AI-85] GRC-Net: Gram Residual Co-attention Net for epilepsy prediction

【速读】:该论文旨在解决基于脑电图(EEG)信号预测癫痫的问题,传统方法通常对整个EEG信号进行一维处理,难以有效建模信号在不同维度间的关联性并保留时间依赖性。其解决方案的关键在于:首先采用Gram矩阵方法将一维EEG信号转换为三维表示,从而在保持时间依赖性的前提下捕捉跨维度的信号关系;其次通过多层级特征提取策略,利用协同注意力机制(coattention)捕获全局信号特征,结合Inception结构处理局部信号,实现多粒度特征提取。实验表明,所提出的GRC-Net模型在BONN数据集上对五分类任务的准确率达到93.66%,优于现有方法。

链接: https://arxiv.org/abs/2512.12273
作者: Bihao You,Jiping Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prediction of epilepsy based on electroencephalogram (EEG) signals is a rapidly evolving field. Previous studies have traditionally applied 1D processing to the entire EEG signal. However, we have adopted the Gram Matrix method to transform the signals into a 3D representation, enabling modeling of signal relationships across dimensions while preserving the temporal dependencies of the one-dimensional signals. Additionally, we observed an imbalance between local and global signals within the EEG data. Therefore, we introduced multi-level feature extraction, utilizing coattention for capturing global signal characteristics and an inception structure for processing local signals, achieving multi-granular feature extraction. Our experiments on the BONN dataset demonstrate that for the most challenging five-class classification task, GRC-Net achieved an accuracy of 93.66%, outperforming existing methods.
zh

[AI-86] A Multi-Axial Mindset for Ontology Design Lessons from Wikidatas Polyhierarchical Structure

【速读】:该论文旨在解决传统本体设计中单一顶层分类体系(如连续体 vs. 现象体、抽象 vs. 具体)所导致的结构刚性与扩展性不足的问题,这些问题限制了知识图谱在协作构建和动态演化场景下的适应能力。其解决方案的关键在于采用多轴向(multi-axial)和多层级(polyhierarchical)的架构设计,即不强制统一的顶层分类,而是允许多个分类维度同时作用于同一实体,并共享一个根类“entity”,从而实现更灵活、可扩展且模块化的本体构建方式,尤其适用于Wikidata这类开放协作的知识图谱系统。

链接: https://arxiv.org/abs/2512.12260
作者: Ege Atacan Doğan,Peter F. Patel-Schneider
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Traditional ontology design emphasizes disjoint and exhaustive top-level distinctions such as continuant vs. occurrent, abstract vs. concrete, or type vs. instance. These distinctions are used to structure unified hierarchies where every entity is classified under a single upper-level category. Wikidata, by contrast, does not enforce a singular foundational taxonomy. Instead, it accommodates multiple classification axes simultaneously under the shared root class entity. This paper analyzes the structural implications of Wikidata’s polyhierarchical and multi-axial design. The Wikidata architecture enables a scalable and modular approach to ontology construction, especially suited to collaborative and evolving knowledge graphs.
zh

[AI-87] A Geometric Theory of Cognition

【速读】:该论文试图解决的问题是:如何统一解释人类认知中多种复杂过程(如感知、记忆、直觉判断、推理、决策和社会推断)背后的计算机制,这些过程传统上依赖于互不相关的理论模型。其解决方案的关键在于提出一个统一的数学框架,将认知状态建模为具有学习得到的黎曼度量(Riemannian metric)的微分流形上的点,该度量编码了表征约束、计算成本及认知变量间的结构关系;同时定义了一个标量认知势能(cognitive potential),融合预测准确性、结构简约性、任务效用和规范性要求。认知动态由该势能的黎曼梯度流驱动,从而自然涌现出经典的双系统效应(快速直觉反应与慢速 deliberative reasoning),无需引入模块化或混合架构,仅通过度量诱导的各向异性产生内在的时间尺度分离和几何相变。

链接: https://arxiv.org/abs/2512.12225
作者: Laha Ale
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects–rapid intuitive responses and slower deliberative reasoning–emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.
zh

[AI-88] Measuring What Matters: Scenario-Driven Evaluation for Trajectory Predictors in Autonomous Driving

【速读】:该论文旨在解决当前自动驾驶系统中轨迹预测模型评估方法存在的局限性问题,即现有基于误差的指标(如ADE、FDE)仅从事后准确性角度衡量预测性能,未能反映预测结果对自主车辆(SDV)决策安全性的实际影响,尤其在复杂交互场景下,一个高质量的预测器不仅需高精度,还需具备捕捉邻居代理可能运动方向的多样性,以支持SDV的谨慎决策。解决方案的关键在于提出了一种自适应的双维度评估流水线,同时考虑预测的准确性和多样性,并根据驾驶场景的紧迫程度动态融合这两个维度,从而生成更贴近SDV实际驾驶表现的综合评分,实验表明该方法能更合理地评估预测器性能并指导最优预测模型的选择。

链接: https://arxiv.org/abs/2512.12211
作者: Longchao Da,David Isele,Hua Wei,Manish Saroya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 9 Pages, 8 Figures

点击查看摘要

Abstract:Being able to anticipate the motion of surrounding agents is essential for the safe operation of autonomous driving systems in dynamic situations. While various methods have been proposed for trajectory prediction, the current evaluation practices still rely on error-based metrics (e.g., ADE, FDE), which reveal the accuracy from a post-hoc view but ignore the actual effect the predictor brings to the self-driving vehicles (SDVs), especially in complex interactive scenarios: a high-quality predictor not only chases accuracy, but should also captures all possible directions a neighbor agent might move, to support the SDVs’ cautious decision-making. Given that the existing metrics hardly account for this standard, in our work, we propose a comprehensive pipeline that adaptively evaluates the predictor’s performance by two dimensions: accuracy and diversity. Based on the criticality of the driving scenario, these two dimensions are dynamically combined and result in a final score for the predictor’s performance. Extensive experiments on a closed-loop benchmark using real-world datasets show that our pipeline yields a more reasonable evaluation than traditional metrics by better reflecting the correlation of the predictors’ evaluation with the autonomous vehicles’ driving performance. This evaluation pipeline shows a robust way to select a predictor that potentially contributes most to the SDV’s driving performance.
zh

[AI-89] Not All Transparency Is Equal: Source Presentation Effects on Attention Interaction and Persuasion in Conversational Search

【速读】:该论文旨在解决 conversational search systems 中源引用(source citation)呈现方式对用户参与度影响不明确的问题。其关键解决方案在于通过一项包含394名参与者的大规模众包实验,系统比较了四种不同的源呈现设计(可折叠列表、悬停卡片、页脚列表和对齐URL可见性界面),发现仅提升源的可见性并不足以增强用户 engagement,甚至可能因提供不足的源而降低参与度;真正显著影响知识获取、兴趣与态度变化的是源使用频率,且高可见性界面在初期反而抑制知识增长和兴趣,但随着用户对源的持续使用,这些正向效应逐渐显现。

链接: https://arxiv.org/abs/2512.12207
作者: Jiangen He,Jiqun Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: CHIIR 2026

点击查看摘要

Abstract:Conversational search systems increasingly provide source citations, yet how citation or source presentation formats influence user engagement remains unclear. We conducted a crowdsourcing user experiment with 394 participants comparing four source presentation designs that varied citation visibility and accessibility: collapsible lists, hover cards, footer lists, and aligned this http URL-visibility interfaces generated substantially more hovering on sources, though clicking remained infrequent across all conditions. While interface design showed limited effects on user experience and perception measures, it significantly influenced knowledge, interest, and agreement changes. High-visibility interfaces initially reduced knowledge gain and interest, but these positive effects emerged with increasing source usage. The sidebar condition uniquely increased agreement change. Our findings demonstrate that source presentation alone may not enhance engagement and can even reduce it when insufficient sources are provided.
zh

[AI-90] Epistemoverse: Toward an AI-Driven Knowledge Metaverse for Intellectual Heritage Preservation SIGGRAPH

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否仅能机械复现训练数据,还是能够在特定交互语境中发展出具有认知结构的推理能力与反思性问题提出机制。针对这一问题,研究提出的核心解决方案在于通过构建一个对话式互动环境——即“Syntropic Counterpoints”项目中的交互艺术装置,让AI重演哲学家(如亚里士多德、尼采、马基雅维利和孙子)之间的迭代辩论,从而揭示机器对话如何催生推论一致性、反思性提问及创造性综合。其关键在于利用苏格拉底式的助产术(maieutic method)与生物克隆原理作为理论框架,使LLMs在持续交互中形成“认知接口”的重构与概念结构的涌现,最终提出“认识宇宙”(Epistemoverse)这一新范式,将虚拟与沉浸式环境定位为人类与机器认知交汇的知识传承与协同创新空间。

链接: https://arxiv.org/abs/2512.12201
作者: Predrag K. Nikolić,Robert Prentner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, presented at SIGGRAPH VRCAI 25

点击查看摘要

Abstract:Large language models (LLMs) have often been characterized as “stochastic parrots” that merely reproduce fragments of their training data. This study challenges that assumption by demonstrating that, when placed in an appropriate dialogical context, LLMs can develop emergent conceptual structures and exhibit interaction-driven (re-)structuring of cognitive interfaces and reflective question-asking. Drawing on the biological principle of cloning and Socrates’ maieutic method, we analyze authentic philosophical debates generated among AI-reincarnated philosophers within the interactive art installations of the Syntropic Counterpoints project. By engaging digital counterparts of Aristotle, Nietzsche, Machiavelli, and Sun Tzu in iterative discourse, the study reveals how machine dialogue can give rise to inferential coherence, reflective questioning, and creative synthesis. Based on these findings, we propose the concept of the Epistemoverse–a metaverse of knowledge where human and machine cognition intersect to preserve, reinterpret, and extend intellectual heritage through AI-driven interaction. This framework positions virtual and immersive environments as new spaces for epistemic exchange, digital heritage, and collaborative creativity.
zh

[AI-91] A-KAND: Two-stage Attention Triple Enhancement and U-KAN based Diffusion For Few-shot Knowledge Graph Completion

【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中因现实数据的异构性和多面性导致的关系分布长尾问题,从而在样本稀缺情况下完成缺失事实的补全。其解决方案的关键在于从生成式表示的角度重新审视该问题,提出了一种结合两阶段注意力三元组增强器与基于U-KAN的扩散模型(diffusion model)的少样本知识图谱补全框架,有效利用了图结构中的邻域信息并考虑了对比信号的分布特性,从而显著提升了补全性能。

链接: https://arxiv.org/abs/2512.12182
作者: Xinyu Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs), thanks to their concise and efficient triple-based structure, have been widely applied in intelligent question answering, recommender systems and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably renders the distribution of relations long-tailed, making it crucial to complete missing facts with limited samples. Previous studies mainly based on metric matching or meta learning, yet they either fail to fully exploit neighborhood information in graph or overlook the distributional characteristics of contrastive signals. In this paper, we re-examine the problem from a perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates two-stage attention triple enhancer with U-KAN based diffusion model. Extensive experiments on two public datasets show that our method achieve new state-of-the-art results.
zh

[AI-92] Floorplan2Guide: LLM -Guided Floorplan Parsing for BLV Indoor Navigation

【速读】:该论文旨在解决视障人士在动态室内环境中安全导航的难题,现有基于基础设施的解决方案难以适应复杂多变的场景。其核心解决方案是提出一种名为Floorplan2Guide的新方法,关键在于利用基础模型(foundation model)将建筑平面图转化为可导航的知识图谱,并生成人类可读的导航指令。该方法通过大语言模型(LLM)自动提取空间信息,显著减少了传统平面图解析所需的预处理工作量;实验表明,基于图结构的空间表示结合少样本提示(few-shot prompting)能有效提升导航准确性,其中Claude 3.7 Sonnet模型在5-shot条件下对短、中、长路径的导航准确率分别达到92.31%、76.92%和61.54%,且图结构推理的成功率比直接视觉推理高15.4%,验证了图形化建模与上下文学习对提升盲人及低视力(BLV)用户室内导航精度的有效性。

链接: https://arxiv.org/abs/2512.12177
作者: Aydin Ayanzadeh,Tim Oates
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in the proceedings of the IEEE International Conference on Big Data (IEEE BigData 2025)

点击查看摘要

Abstract:Indoor navigation remains a critical challenge for people with visual impairments. The current solutions mainly rely on infrastructure-based systems, which limit their ability to navigate safely in dynamic environments. We propose a novel navigation approach that utilizes a foundation model to transform floor plans into navigable knowledge graphs and generate human-readable navigation instructions. Floorplan2Guide integrates a large language model (LLM) to extract spatial information from architectural layouts, reducing the manual preprocessing required by earlier floorplan parsing methods. Experimental results indicate that few-shot learning improves navigation accuracy in comparison to zero-shot learning on simulated and real-world evaluations. Claude 3.7 Sonnet achieves the highest accuracy among the evaluated models, with 92.31%, 76.92%, and 61.54% on the short, medium, and long routes, respectively, under 5-shot prompting of the MP-1 floor plan. The success rate of graph-based spatial structure is 15.4% higher than that of direct visual reasoning among all models, which confirms that graphical representation and in-context learning enhance navigation performance and make our solution more precise for indoor navigation of Blind and Low Vision (BLV) users.
zh

[AI-93] Rethinking Label Consistency of In-Context Learning: An Implicit Transductive Label Propagation Perspective

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中演示样本(prompt demonstrations)选择的问题,特别是现有基于语义相似度的TopK检索方法未能保证标签一致性(label consistency)的局限性。其解决方案的关键在于从贝叶斯视角和归纳传播(transductive label propagation)的角度重新理解ICL机制,并提出一种融合语义与标签信息的数据合成方法——TopK-SD(TopK sampling with Synthetic Data),通过生成具有标签一致性的演示样本,显著提升ICL性能。该方法通过构建标签传播框架,将标签一致性与传播误差边界关联,从而优化演示样本的选择策略。

链接: https://arxiv.org/abs/2512.12175
作者: Haoyang Chen,Richong Zhang,Junfan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) perform in-context learning (ICL) with minimal supervised examples, which benefits various natural language processing (NLP) tasks. One of the critical research focus is the selection of prompt demonstrations. Current approaches typically employ retrieval models to select the top-K most semantically similar examples as demonstrations. However, we argue that existing methods are limited since the label consistency is not guaranteed during demonstration selection. Our cognition derives from the Bayesian view of ICL and our rethinking of ICL from the transductive label propagation perspective. We treat ICL as a transductive learning method and incorporate latent concepts from Bayesian view and deduce that similar demonstrations guide the concepts of query, with consistent labels serving as estimates. Based on this understanding, we establish a label propagation framework to link label consistency with propagation error bounds. To model label consistency, we propose a data synthesis method, leveraging both semantic and label information, and use TopK sampling with Synthetic Data (TopK-SD) to acquire demonstrations with consistent labels. TopK-SD outperforms original TopK sampling on multiple benchmarks. Our work provides a new perspective for understanding the working mechanisms within ICL.
zh

[AI-94] BaRISTA: Brain Scale Informed Spatiotemporal Representation of Human Intracranial Neural Activity NEURIPS2025

【速读】:该论文旨在解决多区域颅内脑电(intracranial electroencephalography, iEEG)数据中复杂时空交互建模的问题,特别是如何有效编码空间信息以及设计自监督任务以提升神经网络模式学习和下游解码性能。其解决方案的关键在于提出一种新的时空变换器模型(spatiotemporal transformer model)与对应的自监督掩码潜在重建任务(masked latent reconstruction task),该设计允许在令牌编码(token encoding)和掩码重建过程中灵活调整空间尺度;实验表明,相较于传统仅基于通道级别的编码方式,采用大于通道尺度的空间编码能显著提升下游解码性能,同时支持区域级令牌编码并保持通道级神经信号重建的准确性,从而为神经基础模型(neurofoundation models)的预训练提供了可探索的空间尺度优化机制。

链接: https://arxiv.org/abs/2512.12135
作者: Lucine L. Oganesian,Saba Hashemi,Maryam M. Shanechi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Published at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025). Code available at this https URL

点击查看摘要

Abstract:Intracranial recordings have opened a unique opportunity to simultaneously measure activity across multiregional networks in the human brain. Recent works have focused on developing transformer-based neurofoundation models of such recordings that can generalize across subjects and datasets. However, these recordings exhibit highly complex spatiotemporal interactions across diverse spatial scales, from the single-channel scale to the scale of brain regions. As such, there remain critical open questions regarding how best to encode spatial information and how to design self-supervision tasks that enable the learning of brain network patterns and enhance downstream decoding performance using such high-dimensional, multiregional recordings. To allow for exploring these questions, we propose a new spatiotemporal transformer model of multiregional neural activity and a corresponding self-supervised masked latent reconstruction task, designed to enable flexibility in the spatial scale used for token encoding and masking. Applying this model on publicly available multiregional intracranial electrophysiology (iEEG) data, we demonstrate that adjusting the spatial scale for both token encoding and masked reconstruction significantly impacts downstream decoding. Further, we find that spatial encoding at larger scales than channel-level encoding, which is commonly used in existing iEEG transformer models, improves downstream decoding performance. Finally, we demonstrate that our method allows for region-level token encoding while also maintaining accurate channel-level neural reconstruction. Taken together, our modeling framework enables exploration of the spatial scales used for token encoding and masking, reveals their importance towards self-supervised pretraining of neurofoundation models of multiregional human brain activity, and enhances downstream decoding performance.
zh

[AI-95] MixtureKit: A General Framework for Composing Training and Visualizing Mixture-of-Experts Models

【速读】:该论文旨在解决如何高效构建、训练和分析基于混合专家(Mixture-of-Experts, MoE)架构的模型问题,特别是针对任意预训练或微调模型的集成与优化。其核心挑战在于现有方法缺乏模块化支持、难以灵活适配不同结构,并且缺乏对路由决策和专家行为的可视化分析能力。解决方案的关键在于提出 MixtureKit——一个模块化开源框架,它通过三种互补方法实现灵活的 MoE 架构:(i) 传统 MoE,使用单个路由器选择专家;(ii) BTX(Branch-Train-Mix),为指定子层引入独立路由器以实现细粒度 token 路由;(iii) BTS(Branch-Train-Stitch),保留专家完整性并引入可训练缝合层(stitch layers)控制信息交互。MixtureKit 自动修改模型配置、修补解码器和因果语言建模类,并生成统一检查点,同时提供可视化接口用于分析每 token 的路由决策、专家权重分布及层间贡献,从而显著提升 MoE 模型的研究效率与可解释性。

链接: https://arxiv.org/abs/2512.12121
作者: Ahmad Chamma,Omar El Herraoui,Guokan Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emphTraditional MoE, which uses a single router per transformer block to select experts, (ii) \emphBTX (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emphBTS (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions, expert weight distributions, and layer-wise contributions. Experiments with multilingual code-switched data (e.g. Arabic-Latin) show that a BTX-based model trained using MixtureKit can outperform baseline dense models on multiple benchmarks. We release MixtureKit as a practical foundation for research and development of MoE-based systems across diverse domains.
zh

[AI-96] A neuro-symbolic framework for accountability in public-sector AI

【速读】:该论文旨在解决自动化资格判定系统生成的解释无法准确反映授权决策的法律规则这一问题,从而影响公众对公共福利(如加州的CalFresh计划)获取决策的信任与可问责性。其解决方案的关键在于构建一个基于法律的可解释性框架,该框架通过三个核心组件实现:一是从州级政策手册(MPP)中提取结构化的资格要求本体(ontology),二是开发规则提取流水线将法规逻辑转化为可验证的形式化表示,三是引入基于求解器的推理层来判断解释是否符合法定约束。此框架能够识别法律不一致的解释、定位违反的资格条款,并支持程序上的可追溯性和可争议性,从而提升自动化决策的合法性与透明度。

链接: https://arxiv.org/abs/2512.12109
作者: Allen Daniel Sunny
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Master’s thesis, University of Maryland, College Park (2025)

点击查看摘要

Abstract:Automated eligibility systems increasingly determine access to essential public benefits, but the explanations they generate often fail to reflect the legal rules that authorize those decisions. This thesis develops a legally grounded explainability framework that links system-generated decision justifications to the statutory constraints of CalFresh, California’s Supplemental Nutrition Assistance Program. The framework combines a structured ontology of eligibility requirements derived from the state’s Manual of Policies and Procedures (MPP), a rule extraction pipeline that expresses statutory logic in a verifiable formal representation, and a solver-based reasoning layer to evaluate whether the explanation aligns with governing law. Case evaluations demonstrate the framework’s ability to detect legally inconsistent explanations, highlight violated eligibility rules, and support procedural accountability by making the basis of automated determinations traceable and contestable.
zh

[AI-97] Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)方法在实际应用中普遍存在的样本效率低、训练不稳定以及超参数敏感等问题。针对这些问题,作者提出了一种名为可靠策略迭代(Reliable Policy Iteration, RPI)的算法,其核心创新在于恢复了策略迭代(Policy Iteration)在函数逼近(Function Approximation)场景下的价值估计单调性(Monotonicity-of-Value-Estimates)特性,从而提升了算法的稳定性和收敛可靠性。实验表明,RPI相较于DQN、Double DQN、DDPG、TD3和PPO等主流方法,在CartPole和倒立摆(Inverted Pendulum)经典控制任务中能更早达到近最优性能并持续保持,展现出更强的鲁棒性和实用性。

链接: https://arxiv.org/abs/2512.12088
作者: S.R. Eshwar,Aniruddha Mukherjee,Kintan Saha,Krishna Agarwal,Gugan Thoppe,Aditya Gopalan,Gal Dalal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration’s monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI’s empirical performance on two classical control tasks – CartPole and Inverted Pendulum – under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI’s promise as a more reliable alternative.
zh

[AI-98] Congestion Reduction in EV Charger Placement Using Traffic Equilibrium Models

【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)普及背景下,充电设施(EV charger)无序布局可能加剧交通拥堵的问题。其核心挑战在于如何在城市路网中战略性地部署充电站,以最小化整体交通拥堵水平。解决方案的关键在于提出一种基于贪心策略的充电站选址算法,该算法融合了两种均衡模型:一是基于博弈论的拥堵博弈模型(congestion game),二是基于原子排队模拟(atomic queueing simulation)的模型。研究发现,尽管全局最优性无法保证(通过反例验证),该贪心算法在真实路网中仍能实现接近最优的拥堵缓解效果;同时,排队模拟方法比拥堵博弈模型更贴近现实,并进一步提出了一种统一方法,通过校准排队仿真得到的延迟数据,在链路空间(link-space)中求解均衡状态,从而提升模型的实际应用价值。

链接: https://arxiv.org/abs/2512.12081
作者: Semih Kara,Yasin Sonmez,Can Kizilkale,Alex Kurzhanskiy,Nuno C. Martins,Murat Arcak
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Growing EV adoption can worsen traffic conditions if chargers are sited without regard to their impact on congestion. We study how to strategically place EV chargers to reduce congestion using two equilibrium models: one based on congestion games and one based on an atomic queueing simulation. We apply both models within a scalable greedy station-placement algorithm. Experiments show that this greedy scheme yields optimal or near-optimal congestion outcomes in realistic networks, even though global optimality is not guaranteed as we show with a counterexample. We also show that the queueing-based approach yields more realistic results than the congestion-game model, and we present a unified methodology that calibrates congestion delays from queue simulation and solves equilibrium in link-space.
zh

[AI-99] Instruction-Tuning Open-Weight Language Models for BPMN Model Generation

【速读】:该论文旨在解决软件工程中领域模型(Domain Models)建模过程耗时且依赖稀缺专家知识的问题,从而阻碍了模型驱动开发(Model-Driven Development, MDD)的广泛应用。为应对这一挑战,研究提出了一种名为InstruBPM的可复现方法,其关键在于通过指令微调(Instruction Tuning)对开源大语言模型(Large Language Models, LLMs)进行参数高效微调(Parameter-Efficient Fine-Tuning)与量化处理,使其能够直接从自然语言描述生成高质量的业务流程模型(BPMN Process Models),并支持本地部署(on-prem deployment)。该方案在保持较低资源消耗的同时显著提升了生成模型的结构准确性和合规性,验证了指令微调相较于未微调基线在结构保真度和鲁棒性上的优势,并减少了对复杂提示工程(Prompt Scaffolding)的依赖。

链接: https://arxiv.org/abs/2512.12063
作者: Gökberk Çelikmasat,Atay Özgövde,Fatma Başak Aydemir
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint. Under preparation for journal submission

点击查看摘要

Abstract:Domain models are central to software engineering, as they enable a shared understanding, guide implementation, and support automated analyses and model-driven development. Yet, despite these benefits, practitioners often skip modeling because it is time-consuming and demands scarce expertise. We address this barrier by investigating whether open-weight large language models, adapted via instruction tuning, can generate high-quality BPMN process models directly from natural language descriptions in a cost-effective and privacy-preserving way. We introduce InstruBPM, a reproducible approach that prepares paired text-diagram data and instruction tunes an open source large language model with parameter-efficient fine-tuning and quantization for on-prem deployment. We evaluate the tuned model through complementary perspectives: (i) text/code similarity using BLEU, ROUGE-L, and METEOR, (ii) structural fidelity using Relative Graph Edit Distance, (iii) guidelines conformance using external tool checks, and (iv) a small expert review. Using a curated subset of a multi-domain BPMN dataset, we compare the tuned model with untuned open-weight baselines and strong proprietary models under consistent prompting regimes. Our compact tuned model outperforms all baselines across sequence and structural metrics while requiring substantially fewer resources; guideline analysis and expert feedback further indicate that the generated diagrams largely follow BPMN best practices and are useful starting points that reduce modeling effort. Overall, instruction tuning improves structural accuracy and robustness compared to untuned baselines and reduces reliance on heavy prompt scaffolding. We publicly share the trained models and scripts to support reproducibility and further research.
zh

[AI-100] he Forecast Critic: Leverag ing Large Language Models for Poor Forecast Identification AAAI2026

【速读】:该论文旨在解决大规模零售业务中时间序列预测(time series forecasting)质量监控的自动化问题,以提升客户满意度、盈利能力与运营效率。传统方法依赖人工审核或规则驱动的系统,难以应对复杂多变的现实场景且扩展性差。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的通用世界知识和推理能力,构建一个名为“Forecast Critic”的自动预测评估系统。该系统通过分析预测结果与上下文信息(如历史促销事件等非结构化外生特征),识别出存在时序错位、趋势不一致或异常峰值等问题的不合理预测,实验表明其在M5真实数据集上能有效区分低质量预测(sCRPS至少高出10%),最佳模型F1得分达0.88,接近人类水平(0.97),且无需领域特定微调即可实现可扩展的预测监控。

链接: https://arxiv.org/abs/2512.12059
作者: Luke Bhan,Hanyu Zhang,Andrew Gordon Wilson,Michael W. Mahoney,Chuck Arvin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at AAAI 2026 AI4TS workshop and AABA4ET workshop

点击查看摘要

Abstract:Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning’’ capabilities. As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. (1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? (2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? (3) How does performance vary across model sizes and reasoning capabilities, measured across state-of-the-art LLMs? We present three experiments, including on both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. The best-performing model we evaluated achieves an F1 score of 0.88, somewhat below human-level performance (F1 score: 0.97). We also demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: 0.84). Lastly, we demonstrate that these techniques succeed in identifying inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10% higher than that of reasonable forecasts. These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.
zh

[AI-101] Context-Aware Agent ic Power Resources Optimisation in EV using Smart2ChargeApp

【速读】:该论文旨在解决智能电动汽车(EV)充电生态系统中动态资源分配的复杂协调问题,特别是在多代理环境下如何平衡多方利益相关者(如EV用户、电网运营商、充电站运营商、车队运营商及环境因素)的冲突目标,并适应实时环境变化。其解决方案的关键在于提出了一种上下文感知的多智能体协同框架(CAMAC-DRA),该框架结合了深度Q网络(Deep Q-Networks)、图神经网络(Graph Neural Networks)与注意力机制,能够处理20个上下文特征(如天气、交通、电网负载波动等),并通过加权协调机制和共识协议实现多利益相关者的动态优化决策。此方法在真实世界数据集上验证了优越性能,包括92%的协调成功率、15%的能量效率提升及20%的电网压力降低,体现了其在可持续交通电气化中的突破性价值。

链接: https://arxiv.org/abs/2512.12048
作者: Muddsair Sharif,Huseyin Seker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel context-sensitive multi-agent coordination for dynamic resource allocation (CAMAC-DRA) framework for optimizing smart electric vehicle (EV) charging ecosystems through the Smart2Charge application. The proposed system coordinates autonomous charging agents across networks of 250 EVs and 45 charging stations while adapting to dynamic environmental conditions through context-aware decision-making. Our multi-agent approach employs coordinated Deep Q-Networks integrated with Graph Neural Networks and attention mechanisms, processing 20 contextual features including weather patterns, traffic conditions, grid load fluctuations, and electricity this http URL framework balances five ecosystem stakeholders i.e. EV users (25%), grid operators (20%), charging station operators (20%), fleet operators (20%), and environmental factors (15%) through weighted coordination mechanisms and consensus protocols. Comprehensive validation using real-world datasets containing 441,077 charging transactions demonstrates superior performance compared to baseline algorithms including DDPG, A3C, PPO, and GNN approaches. The CAMAC-DRA framework achieves 92% coordination success rate, 15% energy efficiency improvement, 10% cost reduction, 20% grid strain decrease, and \2.3x faster convergence while maintaining 88% training stability and 85% sample efficiency. Real-world validation confirms commercial viability with Net Present Cost of -\ 122,962 and 69% cost reduction through renewable energy integration. The framework’s unique contribution lies in developing context-aware multi-stakeholder coordination that successfully balances competing objectives while adapting to real-time variables, positioning it as a breakthrough solution for intelligent EV charging coordination and sustainable transportation electrification.
zh

[AI-102] AI as a Teaching Partner: Early Lessons from Classroom Codesign with Secondary Teachers

【速读】:该论文旨在解决如何将生成式 AI (Generative AI) 有效融入真实课堂教学场景中,以增强教学互动与效率,同时保持教师的专业判断与主导权。其核心问题在于探索AI作为“第三代理性参与者”(third agent)在课堂中的角色定位及其对教学实践的实际价值。解决方案的关键在于采用协同设计(co-design)方法,让一线教师深度参与AI功能的开发与迭代,通过七周的实地试点,将四项AI功能(教学助手、智能测评与评分、AI辅导、学生发展洞察)嵌入6–12年级课堂,并基于教师反馈和多源数据(问卷、日志、访谈等)持续优化系统,从而实现AI对教学过程的有效支持与可信赖整合。

链接: https://arxiv.org/abs/2512.12045
作者: Alex Liu,Lief Esbenshade,Shawon Sarkar,Zewei(Victor)Tian,Min Sun,Zachary Zhang,Thomas Han,Yulia Lapicus,Kevin He
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report presents a comprehensive account of the Colleague AI Classroom pilot, a collaborative design (co-design) study that brought generative AI technology directly into real classrooms. In this study, AI functioned as a third agent, an active participant that mediated feedback, supported inquiry, and extended teachers’ instructional reach while preserving human judgment and teacher authority. Over seven weeks in spring 2025, 21 in-service teachers from four Washington State public school districts and one independent school integrated four AI-powered features of the Colleague AI Classroom into their instruction: Teaching Aide, Assessment and AI Grading, AI Tutor, and Student Growth Insights. More than 600 students in grades 6-12 used the platform in class at the direction of their teachers, who designed and facilitated the AI activities. During the Classroom pilot, teachers were co-design partners: they planned activities, implemented them with students, and provided weekly reflections on AI’s role in classroom settings. The teachers’ feedback guided iterative improvements for Colleague AI. The research team captured rich data through surveys, planning and reflection forms, group meetings, one-on-one interviews, and platform usage logs to understand where AI adds instructional value and where it requires refinement. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.12045 [cs.HC] (or arXiv:2512.12045v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2512.12045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-103] Log Anomaly Detection with Large Language Models via Knowledge-Enriched Fusion

【速读】:该论文旨在解决传统日志分析方法(如基于模板和序列驱动的方法)在处理分布式系统日志时容易丢失重要语义信息或难以应对模糊日志模式的问题。其解决方案的关键在于提出一种无需训练的基于条目的异常检测框架 EnrichLog,该框架通过融合语料库特定(corpus-specific)和样本特定(sample-specific)的知识来增强原始日志条目,利用检索增强生成(retrieval-augmented generation, RAG)技术引入上下文信息(包括历史示例和语料推理),从而提升异常检测的准确性与可解释性,且无需重新训练模型即可集成相关知识。

链接: https://arxiv.org/abs/2512.11997
作者: Anfeng Peng,Ajesh Koyatan Chathoth,Stephen Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:System logs are a critical resource for monitoring and managing distributed systems, providing insights into failures and anomalous behavior. Traditional log analysis techniques, including template-based and sequence-driven approaches, often lose important semantic information or struggle with ambiguous log patterns. To address this, we present EnrichLog, a training-free, entry-based anomaly detection framework that enriches raw log entries with both corpus-specific and sample-specific knowledge. EnrichLog incorporates contextual information, including historical examples and reasoning derived from the corpus, to enable more accurate and interpretable anomaly detection. The framework leverages retrieval-augmented generation to integrate relevant contextual knowledge without requiring retraining. We evaluate EnrichLog on four large-scale system log benchmark datasets and compare it against five baseline methods. Our results show that EnrichLog consistently improves anomaly detection performance, effectively handles ambiguous log entries, and maintains efficient inference. Furthermore, incorporating both corpus- and sample-specific knowledge enhances model confidence and detection accuracy, making EnrichLog well-suited for practical deployments.
zh

[AI-104] Evidence-Driven Decision Support for AI Model Selection in Research Software Engineering

【速读】:该论文旨在解决研究软件工程中人工智能(AI)模型选择缺乏系统性和可复现性的问题,当前做法多依赖碎片化元数据和个体经验,导致研究软件质量下降、透明度不足。其解决方案的关键在于将AI模型选择建模为一个多准则决策问题(Multi-Criteria Decision-Making, MCDM),并提出一个基于证据的决策支持框架ModelSelect,该框架整合自动化数据采集管道、结构化知识图谱与MCDM原则,实现了对模型和库推荐的高覆盖率、强推理一致性以及优于生成式AI助手的可追溯性与一致性表现。

链接: https://arxiv.org/abs/2512.11984
作者: Alireza Joonbakhsh,Alireza Rostami,AmirMohammad Kamalinia,Ali Nazeri,Farshad Khunjush,Bedir Tekinerdogan,Siamak Farshidi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of artificial intelligence (AI) models and methods presents growing challenges for research software engineers and researchers who must select, integrate, and maintain appropriate models within complex research workflows. Model selection is often performed in an ad hoc manner, relying on fragmented metadata and individual expertise, which can undermine reproducibility, transparency, and overall research software quality. This work proposes a structured and evidence-driven approach to support AI model selection that aligns with both technical and contextual requirements. We conceptualize AI model selection as a Multi-Criteria Decision-Making (MCDM) problem and introduce an evidence-based decision-support framework that integrates automated data collection pipelines, a structured knowledge graph, and MCDM principles. Following the Design Science Research methodology, the proposed framework (ModelSelect) is empirically validated through 50 real-world case studies and comparative experiments against leading generative AI systems. The evaluation results show that ModelSelect produces reliable, interpretable, and reproducible recommendations that closely align with expert reasoning. Across the case studies, the framework achieved high coverage and strong rationale alignment in both model and library recommendation tasks, performing comparably to generative AI assistants while offering superior traceability and consistency. By framing AI model selection as an MCDM problem, this work establishes a rigorous foundation for transparent and reproducible decision support in research software engineering. The proposed framework provides a scalable and explainable pathway for integrating empirical evidence into AI model recommendation processes, ultimately improving the quality and robustness of research software decision-making. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.11984 [cs.SE] (or arXiv:2512.11984v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.11984 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Siamak Farshidi [view email] [v1] Fri, 12 Dec 2025 19:08:04 UTC (1,228 KB)
zh

[AI-105] Designing The Internet of Agents : A Framework for Trustworthy Transparent and Collaborative Human-Agent Interaction (HAX)

【速读】:该论文旨在解决人与生成式AI代理(Generative AI Agents)之间协作信任不足、交互不透明及认知负荷过高的问题,特别是在多代理系统日益自主化背景下如何实现高效、可信的协同工作。其解决方案的核心是提出Human-AI-Experience (HAX)框架,该框架包含三个阶段:一是引入行为启发式(behavioral heuristics)以增强人类对代理行为的理解;二是通过基于模式驱动的SDK(Software Development Kit)确保输出结构化与安全性;三是设计行为代理(behavioral proxy)机制来协调代理活动并降低用户认知负担。此外,HAX还整合了一个经验证的混合主动设计模式库,支持意图预览、迭代对齐、信任修复和多代理叙事一致性,从而将多代理系统重构为“同事”角色,首次实现了从信任理论到界面设计再到基础设施的端到端贯通。

链接: https://arxiv.org/abs/2512.11979
作者: Marc Scibelli,Krystelle Gonzalez Papaux,Julia Valenti,Srishti Kush
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of generative and autonomous agents marks a fundamental shift in computing, demanding a rethinking of how humans collaborate with probabilistic, partially autonomous systems. We present the Human-AI-Experience (HAX) framework, a comprehensive, three-phase approach that establishes design foundations for trustworthy, transparent, and collaborative agentic interaction. HAX integrates behavioral heuristics, a schema-driven SDK enforcing structured and safe outputs, and a behavioral proxy concept that orchestrates agent activity to reduce cognitive load. A validated catalog of mixed-initiative design patterns further enables intent preview, iterative alignment, trust repair, and multi-agent narrative coherence. Grounded in Time, Interaction, and Performance (TIP) theory, HAX reframes multi-agent systems as colleagues, offering the first end-to-end framework that bridges trust theory, interface design, and infrastructure for the emerging Internet of Agents.
zh

[AI-106] Data-Driven Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations

【速读】:该论文旨在解决传统部分依赖图(Partial Dependence Plots, PDPs)在存在强变量交互作用时,其全局敏感性度量可能误导解释的问题,因为PDP通过平均效应掩盖了交互影响。解决方案的关键在于提出一种基于个体条件期望(Individual Conditional Expectation, ICE)曲线的全局敏感性度量方法:该方法计算ICE曲线上特征重要性的期望值及其标准差,从而更有效地捕捉交互作用对模型输出的影响;同时,论文提供了数学证明表明,在截断正交多项式展开下,PDP-based敏感性是所提ICE-based度量的一个下界,并引入了基于ICE的相关系数来量化交互作用对输入-输出关系的调制效应。

链接: https://arxiv.org/abs/2512.11946
作者: Pramudita Satria Palar,Paul Saves,Rommel G. Regis,Koji Shimoyama,Shigeru Obayashi,Nicolas Verstaevel,Joseph Morlier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol’ indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.
zh

[AI-107] A Review of Learning-Based Motion Planning : Toward a Data-Driven Optimal Control Approach

【速读】:该论文试图解决高阶自动驾驶中运动规划方法所面临的根本性矛盾:传统流水线方法虽然透明但脆弱,而现代基于学习的方法虽具适应性却缺乏可解释性。为破解这一困境,作者提出以数据驱动最优控制(data-driven optimal control)作为统一框架,其关键在于将经典控制理论的可验证结构与机器学习的自适应能力相结合,利用真实世界数据持续优化系统动力学、代价函数和安全约束等核心组件,从而实现安全、可解释且具备类人自主性的智能交通系统。

链接: https://arxiv.org/abs/2512.11944
作者: Jia Hu,Yang Chang,Haoran Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 34 pages, 11 figures

点击查看摘要

Abstract:Motion planning for high-level autonomous driving is constrained by a fundamental trade-off between the transparent, yet brittle, nature of pipeline methods and the adaptive, yet opaque, “black-box” characteristics of modern learning-based systems. This paper critically synthesizes the evolution of the field – from pipeline methods through imitation learning, reinforcement learning, and generative AI – to demonstrate how this persistent dilemma has hindered the development of truly trustworthy systems. To resolve this impasse, we conduct a comprehensive review of learning-based motion planning methods. Based on this review, we outline a data-driven optimal control paradigm as a unifying framework that synergistically integrates the verifiable structure of classical control with the adaptive capacity of machine learning, leveraging real-world data to continuously refine key components such as system dynamics, cost functions, and safety constraints. We explore this framework’s potential to enable three critical next-generation capabilities: “Human-Centric” customization, “Platform-Adaptive” dynamics adaptation, and “System Self-Optimization” via self-tuning. We conclude by proposing future research directions based on this paradigm, aimed at developing intelligent transportation systems that are simultaneously safe, interpretable, and capable of human-like autonomy.
zh

[AI-108] How AI Agents Follow the Herd of AI? Network Effects History and Machine Optimism

【速读】:该论文旨在解决多AI代理(multi-AI-agent)框架中决策机制的理解问题,特别是在具有网络效应(network effect)驱动场景下的战略互动分析,这一问题在多智能体系统中长期被忽视。其解决方案的关键在于设计了一种基于大语言模型(LLM)的新型工作流,在重复决策场景中系统性地操控价格轨迹(固定、上升、下降、随机)与网络效应强度,从而揭示历史数据的时间一致性对AI代理推理能力的影响。研究发现,仅靠激励结构不足以引导AI代理收敛至均衡,而历史数据的有序性(如递增价格序列)能促进弱网络效应下的部分收敛,但在强网络效应下会引发“AI乐观主义”现象——即代理即使面对矛盾证据仍高估参与度;反之,随机化历史则完全破坏收敛,表明LLM的推理依赖于时间连贯性,这是人类决策所不具备的特性。

链接: https://arxiv.org/abs/2512.11943
作者: Yu Liu,Wenwen Li,Yifan Dou,Guangnan Ye
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Understanding decision-making in multi-AI-agent frameworks is crucial for analyzing strategic interactions in network-effect-driven contexts. This study investigates how AI agents navigate network-effect games, where individual payoffs depend on peer participatio–a context underexplored in multi-agent systems despite its real-world prevalence. We introduce a novel workflow design using large language model (LLM)-based agents in repeated decision-making scenarios, systematically manipulating price trajectories (fixed, ascending, descending, random) and network-effect strength. Our key findings include: First, without historical data, agents fail to infer equilibrium. Second, ordered historical sequences (e.g., escalating prices) enable partial convergence under weak network effects but strong effects trigger persistent “AI optimism”–agents overestimate participation despite contradictory evidence. Third, randomized history disrupts convergence entirely, demonstrating that temporal coherence in data shapes LLMs’ reasoning, unlike humans. These results highlight a paradigm shift: in AI-mediated systems, equilibrium outcomes depend not just on incentives, but on how history is curated, which is impossible for human.
zh

[AI-109] Hypergame Rationalisability: Solving Agent Misalignment In Strategic Play

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中因认知差异、信息不对称和有限理性导致的博弈参与者对游戏结构理解不一致的问题,即传统博弈论假设各参与方具有相同信念模型,而现实中存在“心理模型错配”(misaligned mental models),这限制了对复杂策略交互的准确建模。解决方案的关键在于提出一种基于逻辑的声明式领域特定语言(Domain-Specific Language, DSL),用于形式化编码超博弈(Hypergame)结构及其解概念,并结合答案集编程(Answer-Set Programming, ASP)构建自动化流程,实现超博弈结构实例化与新颖的超博弈合理性机制(hypergame rationalisation procedure)——该机制可识别出能够解释看似非理性行为的信念结构,从而为多智能体系统提供可验证、基于信念的异质推理框架。

链接: https://arxiv.org/abs/2512.11942
作者: Vince Trencsenyi
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Differences in perception, information asymmetries, and bounded rationality lead game-theoretic players to derive a private, subjective view of the game that may diverge from the underlying ground-truth scenario and may be misaligned with other players’ interpretations. While typical game-theoretic assumptions often overlook such heterogeneity, hypergame theory provides the mathematical framework to reason about mismatched mental models. Although hypergames have recently gained traction in dynamic applications concerning uncertainty, their practical adoption in multi-agent system research has been hindered by the lack of a unifying, formal, and practical representation language, as well as scalable algorithms for managing complex hypergame structures and equilibria. Our work addresses this gap by introducing a declarative, logic-based domain-specific language for encoding hypergame structures and hypergame solution concepts. Leveraging answer-set programming, we develop an automated pipeline for instantiating hypergame structures and running our novel hypergame rationalisation procedure, a mechanism for finding belief structures that justify seemingly irrational outcomes. The proposed language establishes a unifying formalism for hypergames and serves as a foundation for developing nuanced, belief-based heterogeneous reasoners, offering a verifiable context with logical guarantees. Together, these contributions establish the connection between hypergame theory, multi-agent systems, and strategic AI.
zh

[AI-110] AGAPI-Agents : An Open-Access Agent ic AI Platform for Accelerated Materials Design on AtomGPT .org

【速读】:该论文旨在解决材料研究中人工智能应用面临的三大挑战:计算生态系统的碎片化、可重复性问题以及对商业大语言模型(Large Language Models, LLMs)的依赖。其解决方案的关键在于提出一个名为AGAPI(Agent-Planner-Executor-Summarizer Architecture for Materials Discovery)的开源智能体平台,该平台通过统一的编排框架集成超过八种开源LLM与二十多个材料科学API接口,实现数据库、模拟工具和机器学习模型的无缝协同;其核心架构支持自主构建并执行多步骤工作流,涵盖从材料数据检索、图神经网络性质预测到紧束缚计算、衍射分析及逆向设计等复杂任务,从而推动可复现、高效且透明的生成式AI驱动材料发现。

链接: https://arxiv.org/abs/2512.11935
作者: Jaehyung Lee,Justin Ely,Kent Zhang,Akshaya Ajith,Charles Rhys Campbell,Kamal Choudhary
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Artificial intelligence is reshaping scientific discovery, yet its use in materials research remains limited by fragmented computational ecosystems, reproducibility challenges, and dependence on commercial large language models (LLMs). Here we introduce AGAPI (this http URL API), an open-access agentic AI platform that integrates more than eight open-source LLMs with over twenty materials-science API endpoints, unifying databases, simulation tools, and machine-learning models through a common orchestration framework. AGAPI employs an Agent-Planner-Executor-Summarizer architecture that autonomously constructs and executes multi-step workflows spanning materials data retrieval, graph neural network property prediction, machine-learning force-field optimization, tight-binding calculations, diffraction analysis, and inverse design. We demonstrate AGAPI through end-to-end workflows, including heterostructure construction, powder X-ray diffraction analysis, and semiconductor defect engineering requiring up to ten sequential operations. In addition, we evaluate AGAPI using 30+ example prompts as test cases and compare agentic predictions with and without tool access against experimental data. With more than 1,000 active users, AGAPI provides a scalable and transparent foundation for reproducible, AI-accelerated materials discovery. AGAPI-Agents codebase is available at this https URL.
zh

[AI-111] Unveiling User Perceptions in the Generative AI Era: A Sentiment-Driven Evaluation of AI Educational Apps Role in Digital Transformation of e-Teaching

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育领域快速融合背景下,用户对AI教育类应用(AI ed-apps)感知与评价不足的问题。研究通过情感驱动的方法分析Google Play Store中Top AI教育应用的用户评论,以评估其有效性、挑战及教学意义。解决方案的关键在于构建一个多阶段数据处理流程:首先爬取应用数据和评论,继而使用RoBERTa模型进行二元情感分类,再利用GPT-4o提取关键点,并由GPT-5归纳正负向主题;最终按功能将应用分为七类并对比其表现,揭示了作业辅助类应用在准确性、速度和个性化方面领先,而语言学习和学习管理系统(LMS)类应用因不稳定性和功能局限导致负面反馈较高。该方法系统性地量化了用户真实体验,为开发者和政策制定者提供可操作的改进方向,如自适应个性化设计与商业化监管,从而推动生成式AI在教育中实现公平、创新且伦理可控的发展。

链接: https://arxiv.org/abs/2512.11934
作者: Adeleh Mazaherian,Erfan Nourbakhsh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:The rapid integration of generative artificial intelligence into education has driven digital transformation in e-teaching, yet user perceptions of AI educational apps remain underexplored. This study performs a sentiment-driven evaluation of user reviews from top AI ed-apps on the Google Play Store to assess efficacy, challenges, and pedagogical implications. Our pipeline involved scraping app data and reviews, RoBERTa for binary sentiment classification, GPT-4o for key point extraction, and GPT-5 for synthesizing top positive/negative themes. Apps were categorized into seven types (e.g., homework helpers, math solvers, language tools), with overlaps reflecting multifunctional designs. Results indicate predominantly positive sentiments, with homework apps like Edu AI (95.9% positive) and this http URL (92.7%) leading in accuracy, speed, and personalization, while language/LMS apps (e.g., Teacher AI at 21.8% positive) lag due to instability and limited features. Positives emphasize efficiency in brainstorming, problem-solving, and engagement; negatives center on paywalls, inaccuracies, ads, and glitches. Trends show that homework helpers outperform specialized tools, highlighting AI’s democratizing potential amid risks of dependency and inequity. The discussion proposes future ecosystems with hybrid AI-human models, VR/AR for immersive learning, and a roadmap for developers (adaptive personalization) and policymakers (monetization regulation for inclusivity). This underscores generative AI’s role in advancing e-teaching by enabling ethical refinements that foster equitable, innovative environments. The full dataset is available here(this https URL).
zh

[AI-112] he Agent ic Regulator: Risks for AI in Finance and a Proposed Agent -based Framework for Governance

【速读】:该论文旨在解决生成式 AI(Generative AI)和智能体(Agentic AI)在金融市场的快速部署与现有模型风险管理框架之间存在的治理滞后问题。当前模型风险框架假设算法静态且规范明确,而大语言模型和多智能体交易系统因持续学习、隐式信号交互及涌现行为,打破了这一前提。解决方案的关键在于提出一种模块化治理架构,将监管职责分解为四个层级的“监管模块”:(i)嵌入模型旁的自我监管模块;(ii)企业级治理块聚合本地遥测数据并执行政策;(iii)监管机构托管的代理监控行业级指标以识别共谋或不稳定模式;(iv)独立审计模块提供第三方验证。该架构通过八个设计策略确保监管模块能与被监管模型同步演进,实现在不违背现行规则的前提下,有效应对观测盲区和控制缺口,从而实现金融系统中AI治理的韧性与适应性。

链接: https://arxiv.org/abs/2512.11933
作者: Eren Kurshan,Tucker Balch,David Byrd
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Generative and agentic artificial intelligence is entering financial markets faster than existing governance can adapt. Current model-risk frameworks assume static, well-specified algorithms and one-time validations; large language models and multi-agent trading systems violate those assumptions by learning continuously, exchanging latent signals, and exhibiting emergent behavior. Drawing on complex adaptive systems theory, we model these technologies as decentralized ensembles whose risks propagate along multiple time-scales. We then propose a modular governance architecture. The framework decomposes oversight into four layers of “regulatory blocks”: (i) self-regulation modules embedded beside each model, (ii) firm-level governance blocks that aggregate local telemetry and enforce policy, (iii) regulator-hosted agents that monitor sector-wide indicators for collusive or destabilizing patterns, and (iv) independent audit blocks that supply third-party assurance. Eight design strategies enable the blocks to evolve as fast as the models they police. A case study on emergent spoofing in multi-agent trading shows how the layered controls quarantine harmful behavior in real time while preserving innovation. The architecture remains compatible with today’s model-risk rules yet closes critical observability and control gaps, providing a practical path toward resilient, adaptive AI governance in financial systems.
zh

[AI-113] Mapping AI Risk Mitigations: Evidence Scan and Preliminary AI Risk Mitigation Taxonomy

【速读】:该论文旨在解决当前AI风险缓解措施(AI risk mitigation)框架分散、术语不一致且存在覆盖盲区的问题,从而阻碍了组织与政府在AI治理中的协同行动。其解决方案的关键在于构建一个初步的AI风险缓解分类体系(AI Risk Mitigation Taxonomy),该体系基于对2023–2025年间13个AI风险缓解框架的快速证据扫描,从中提取出831项具体缓解措施,并通过迭代聚类与编码形成四个主类(治理监督、技术安全、运营流程、透明问责)及23个子类别的结构化分类体系,为不同利益相关方提供统一的参考框架和协作基础。

链接: https://arxiv.org/abs/2512.11931
作者: Alexander K. Saeri,Sophia Lloyd George,Jess Graham,Clelia D. Lacarriere,Peter Slattery,Michael Noetel,Neil Thompson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Access AI Risk Mitigation Database and Taxonomy at this https URL

点击查看摘要

Abstract:Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 AI risk mitigations. The mitigations were iteratively clustered coded to create the Taxonomy. The preliminary AI Risk Mitigation Taxonomy organizes mitigations into four categories and 23 subcategories: (1) Governance Oversight: Formal organizational structures and policy frameworks that establish human oversight mechanisms and decision protocols; (2) Technical Security: Technical, physical, and engineering safeguards that secure AI systems and constrain model behaviors; (3) Operational Process: processes and management frameworks governing AI system deployment, usage, monitoring, incident handling, and validation; and (4) Transparency Accountability: formal disclosure practices and verification mechanisms that communicate AI system information and enable external scrutiny. The rapid evidence scan and taxonomy construction also revealed several cases where terms like ‘risk management’ and ‘red teaming’ are used widely but refer to different responsible actors, actions, and mechanisms of action to reduce risk. This Taxonomy and associated mitigation database, while preliminary, offers a starting point for collation and synthesis of AI risk mitigations. It also offers an accessible, structured way for different actors in the AI ecosystem to discuss and coordinate action to reduce risks from AI.
zh

[AI-114] Evolutionary Reinforcement Learning based AI tutor for Socratic Interdisciplinary Instruction

【速读】:该论文旨在解决现代STEM教育中培养高阶认知能力(如知识整合、批判性思维和创造力)所面临的教学范式转换难题,即如何从传统的被动知识传授转向主动的苏格拉底式建构教学。现有基于提示工程(Prompt Engineering, PE)、监督微调(Supervised Fine-tuning, SFT)或标准强化学习(Reinforcement Learning, RL)的方法难以有效支持这一目标,主要受限于三方面挑战:无法动态建模学生潜在的认知状态、长期教育目标导致奖励稀疏与延迟,以及依赖行为克隆引发策略坍缩、缺乏战略多样性。为此,作者将苏格拉底跨学科教学问题形式化为一个结构化的部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),并提出面向该问题的进化强化学习框架ERL4SIIP,其核心创新在于:(1)基于STEM知识图谱构建动态学生模拟器以实现潜在状态建模;(2)设计分层奖励机制将长期目标分解为密集信号以缓解奖励稀疏问题;(3)采用LoRA-Division优化策略,结合进化算法进行群体级全局搜索与PPO进行局部梯度上升,从而实现全局探索与细粒度策略精调的协同优化。

链接: https://arxiv.org/abs/2512.11930
作者: Mei Jiang,Haihai Shen,Zhuo Luo,Bingdong Li,Wenjing Hong,Ke Tang,Aimin Zhou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cultivating higher-order cognitive abilities – such as knowledge integration, critical thinking, and creativity – in modern STEM education necessitates a pedagogical shift from passive knowledge transmission to active Socratic construction. Although Large Language Models (LLMs) hold promise for STEM Interdisciplinary education, current methodologies employing Prompt Engineering (PE), Supervised Fine-tuning (SFT), or standard Reinforcement Learning (RL) often fall short of supporting this paradigm. Existing methods are hindered by three fundamental challenges: the inability to dynamically model latent student cognitive states; severe reward sparsity and delay inherent in long-term educational goals; and a tendency toward policy collapse lacking strategic diversity due to reliance on behavioral cloning. Recognizing the unobservability and dynamic complexity of these interactions, we formalize the Socratic Interdisciplinary Instructional Problem (SIIP) as a structured Partially Observable Markov Decision Process (POMDP), demanding simultaneous global exploration and fine-grained policy refinement. To this end, we propose ERL4SIIP, a novel Evolutionary Reinforcement Learning (ERL) framework specifically tailored for this domain. ERL4SIIP integrates: (1) a dynamic student simulator grounded in a STEM knowledge graph for latent state modeling; (2) a Hierarchical Reward Mechanism that decomposes long-horizon goals into dense signals; and (3) a LoRA-Division based optimization strategy coupling evolutionary algorithms for population-level global search with PPO for local gradient ascent.
zh

[AI-115] Vibe Coding in Practice: Flow Technical Debt and Guidelines for Sustainable Use

【速读】:该论文旨在解决生成式 AI 辅助软件开发(Vibe Coding, VC)过程中因快速代码生成而引发的“流程债务”(flow-debt)问题,即在追求开发效率的同时,由于架构不一致、安全漏洞和维护成本上升等技术债累积,对软件生命周期可持续性造成的潜在风险。解决方案的关键在于识别当前模型、平台与硬件限制如何加剧上述问题,并提出针对性的缓解策略,以推动VC实践向更可持续、可控的方向发展,强调在自动化生成代码的同时保留人类主导的设计决策与迭代优化机制。

链接: https://arxiv.org/abs/2512.11922
作者: Muhammad Waseem,Aakash Ahmad,Kai-Kristian Kemell,Jussi Rasku,Sami Lahti,Kalle Mäkelä,Pekka Abrahamsson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10

点击查看摘要

Abstract:Vibe Coding (VC) is a form of software development assisted by generative AI, in which developers describe the intended functionality or logic via natural language prompts, and the AI system generates the corresponding source code. VC can be leveraged for rapid prototyping or developing the Minimum Viable Products (MVPs); however, it may introduce several risks throughout the software development life cycle. Based on our experience from several internally developed MVPs and a review of recent industry reports, this article analyzes the flow-debt tradeoffs associated with VC. The flow-debt trade-off arises when the seamless code generation occurs, leading to the accumulation of technical debt through architectural inconsistencies, security vulnerabilities, and increased maintenance overhead. These issues originate from process-level weaknesses, biases in model training data, a lack of explicit design rationale, and a tendency to prioritize quick code generation over human-driven iterative development. Based on our experiences, we identify and explain how current model, platform, and hardware limitations contribute to these issues, and propose countermeasures to address them, informing research and practice towards more sustainable VC approaches.
zh

[AI-116] owards Accessible Physical AI: LoRA-Based Fine-Tuning of VLA Models for Real-World Robot Control

【速读】:该论文旨在解决将大规模视觉-语言-动作(Vision-Language-Action, VLA)模型高效适配到低成本机器人平台所面临的计算资源限制与有限示范数据下的性能瓶颈问题。其关键解决方案在于提出一种资源高效的微调策略,结合低秩适应(Low-Rank Adaptation, LoRA)和量化技术,使参数量达31亿的VLA模型可在仅配备8GB显存的消费级GPU上运行,并通过控制视觉编码器冻结与否的权衡,在仅有200个示范回合的数据下实现对新机器人本体的有效迁移与稳定操作性能。

链接: https://arxiv.org/abs/2512.11921
作者: Abdullah Yahya Abdullah Omaisan,Ibrahim Sheikh Mohamed
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in robotic manipulation,enabling robots to execute natural language commands through end-to-end learning from visual this http URL, deploying large-scale VLA models on affordable robotic platforms remains challenging due to computational constraints and the need for efficient adaptation to new robot embodiments. This paper presents an efficient fine-tuning methodology and real-world deployment analysis for adapting VLA models to low-cost robotic manipulation this http URL propose a resource-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA) and quantization techniques that enable multi-billion parameter VLA models ( 3.1B parameters) to run on consumer-grade GPUs with 8GB VRAM. Our methodology addresses the critical challenge of adapting pre-trained VLA models to new robot embodiments with limited demonstration data, focusing on the trade-offs between frozen and unfrozen vision encoders. Through real-world deployment on the SO101 robotic arm for a button-pressing manipulation task, we demonstrate that our approach achieves effective manipulation performance while maintaining computational efficiency. We provide detailed analysis of deployment challenges, failure modes, and the relationship between training data quantity and real-world performance,trained on 200 demonstration episodes. Our results show that with proper fine-tuning methodology, VLA models can be successfully deployed on affordable robotic platforms,making advanced manipulation capabilities accessible beyond expensive research robots.
zh

[AI-117] CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在数据中心部署时因键值(Key-Value, KV)缓存占用巨大GPU内存而导致的批处理规模受限与系统吞吐量低的问题。其核心解决方案是提出CXL-SpecKV架构,关键创新在于:(i) 基于Compute Express Link (CXL) 的内存拆分框架,将KV缓存卸载至远程FPGA内存以实现低延迟访问;(ii) 一种推测性KV缓存预取机制,通过预测并提前加载未来token的缓存条目提升执行效率;(iii) FPGA加速的KV缓存压缩/解压引擎,可将内存带宽需求降低最多4倍。该方案在保持精度的前提下,使系统吞吐量相比纯GPU基线提升达3.2倍,同时降低内存成本2.8倍,验证了智能内存拆分与推测执行结合对缓解LLM服务中“内存墙”问题的有效性。

链接: https://arxiv.org/abs/2512.11920
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to FPGA’26 Oral

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbfCXL-SpecKV, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens’ cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4 \times . When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2 \times higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8 \times and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at this https URL.
zh

[AI-118] Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

【速读】:该论文旨在解决低质量数据对现代概率模型鲁棒性影响的系统性差异问题,即不同模型在面对相同水平的数据污染时表现出截然不同的性能退化趋势。其解决方案的关键在于识别并解释两种核心机制:一是条件信息的丰富程度(richness of conditioning information),它通过约束学习任务的复杂度来提升模型鲁棒性;二是训练数据中绝对信息含量(absolute information content),它决定了来自正确信息的信号能否压倒统计噪声,从而保障模型在噪声环境中仍能有效学习。这一理论框架整合了信息论、PAC学习和梯度动力学多角度分析,揭示了为何自回归语言模型表现出强鲁棒性,而扩散模型则极度脆弱。

链接: https://arxiv.org/abs/2512.11912
作者: Liu Peng,Yaochu Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. These analyses suggest that robustness is heavily influenced by two key principles: the richness of conditioning information, which constrains the learning problem, and the absolute information content of the training data, which allows the signal from correct information to dominate statistical noise.
zh

[AI-119] Causal Strengths and Leaky Beliefs: Interpreting LLM Reasoning via Noisy-OR Causal Bayes Nets

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)与人类在因果推理能力上的一致性问题,具体聚焦于三个核心问题:LLMs是否在相同任务下与人类推理对齐、其推理是否具有一致性,以及是否存在独特的推理模式。解决方案的关键在于构建一个基于碰撞器图结构(collider graph: C1EC2C_1 \to E \leftarrow C_2)的十一项语义明确的因果任务框架,并采用漏液噪声或门控因果贝叶斯网络(leaky noisy-OR causal Bayes net, CBN)对人类和LLMs的判断进行建模,其中参数 θ=(b,m1,m2,p(C))\theta = (b, m_1, m_2, p(C)) 包含共享先验 p(C)p(C);通过AIC准则比较3参数对称(m1=m2m_1 = m_2)与4参数非对称(m1m2m_1 \neq m_2)版本以确定最优模型,从而实现对LLMs与人类因果推理差异的量化分析。

链接: https://arxiv.org/abs/2512.11909
作者: Hanna Dettki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The nature of intelligence in both humans and machines is a longstanding question. While there is no universally accepted definition, the ability to reason causally is often regarded as a pivotal aspect of intelligence (Lake et al., 2017). Evaluating causal reasoning in LLMs and humans on the same tasks provides hence a more comprehensive understanding of their respective strengths and weaknesses. Our study asks: (Q1) Are LLMs aligned with humans given the \emphsame reasoning tasks? (Q2) Do LLMs and humans reason consistently at the task level? (Q3) Do they have distinct reasoning signatures? We answer these by evaluating 20+ LLMs on eleven semantically meaningful causal tasks formalized by a collider graph ( C_1!\to!E!\leftarrow!C_2 ) under \emphDirect (one-shot number as response = probability judgment of query node being one and \emphChain of Thought (CoT; think first, then provide answer). Judgments are modeled with a leaky noisy-OR causal Bayes net (CBN) whose parameters \theta=(b,m_1,m_2,p©) \in [0,1] include a shared prior p© ; we select the winning model via AIC between a 3-parameter symmetric causal strength ( m_1=m_2 ) and 4-parameter asymmetric ( m_1\neqm_2 ) variant. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.11909 [cs.AI] (or arXiv:2512.11909v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.11909 Focus to learn more arXiv-issued DOI via DataCite Journalreference: WiML Workshop at NeurIPS 2025 Submission history From: H. M. Dettki [view email] [v1] Wed, 10 Dec 2025 21:58:16 UTC (517 KB)
zh

[AI-120] Structured Personalization: Modeling Constraints as Matroids for Data-Minimal LLM Agents AAAI2026

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理个性化过程中面临的任务效用与数据披露之间的权衡问题,尤其是在存在复杂结构约束(如逻辑依赖、类别配额和层级规则)时,传统子集选择算法难以适用的问题。解决方案的关键在于提出一种形式化建模结构约束的方法:通过编译用户知识图谱中的依赖关系生成抽象的宏观特征(macro-facets),并证明常见层级和配额类约束在这些宏观特征上构成有效的层状拟阵(laminar matroid)。这一理论刻画使得结构化个性化可被形式化为拟阵约束下的子模最大化问题,从而支持基于贪婪策略的常数近似保证(连续贪婪可达到 (1−1/e) 的近似比),显著扩展了可处理问题的现实性和复杂性范围。

链接: https://arxiv.org/abs/2512.11907
作者: Daniel Platnick,Marjan Alirezaie,Hossein Rahnama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2026 Workshop on Personalization in the Era of Large Foundation Models (PerFM), 5 pages, 1 figure

点击查看摘要

Abstract:Personalizing Large Language Model (LLM) agents requires conditioning them on user-specific data, creating a critical trade-off between task utility and data disclosure. While the utility of adding user data often exhibits diminishing returns (i.e., submodularity), enabling near-optimal greedy selection, real-world personalization is complicated by structural constraints. These include logical dependencies (e.g., selecting fact A requires fact B), categorical quotas (e.g., select at most one writing style), and hierarchical rules (e.g., select at most two social media preferences, of which at most one can be for a professional network). These constraints violate the assumptions of standard subset selection algorithms. We propose a principled method to formally model such constraints. We introduce a compilation process that transforms a user’s knowledge graph with dependencies into a set of abstract macro-facets. Our central result is a proof that common hierarchical and quota-based constraints over these macro-facets form a valid laminar matroid. This theoretical characterization lets us cast structured personalization as submodular maximization under a matroid constraint, enabling greedy with constant-factor guarantees (and (1-1/e) via continuous greedy) for a much richer and more realistic class of problems.
zh

[AI-121] Mirror Mode in Fire Emblem: Beating Players at their own Game with Imitation and Reinforcement Learning

【速读】:该论文旨在解决回合制游戏中敌方AI策略缺乏 unpredictability(不可预测性)的问题,从而提升玩家的游戏体验。传统AI往往采用固定或可被玩家快速掌握的策略模式,导致游戏挑战性下降。为应对这一问题,作者提出了一种名为“Mirror Mode”的新游戏机制,其核心是让敌方AI模仿玩家自身的策略行为,以此迫使玩家不断调整战术以应对“镜像对手”。解决方案的关键在于结合生成对抗模仿学习(Generative Adversarial Imitation Learning, GAIL)、行为克隆(Behavioral Cloning, BC)与近端策略优化(Proximal Policy Optimization, PPO)构建一个混合强化学习与模仿学习的模型,实现对玩家行为的高质量模仿。实验表明,该模型在防御行为上具有较好拟合度,且玩家能识别出自身战术被复现,显著提升了满意度,验证了该方法的有效性。

链接: https://arxiv.org/abs/2512.11902
作者: Yanna Elizabeth Smid,Peter van der Putten,Aske Plaat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enemy strategies in turn-based games should be surprising and unpredictable. This study introduces Mirror Mode, a new game mode where the enemy AI mimics the personal strategy of a player to challenge them to keep changing their gameplay. A simplified version of the Nintendo strategy video game Fire Emblem Heroes has been built in Unity, with a Standard Mode and a Mirror Mode. Our first set of experiments find a suitable model for the task to imitate player demonstrations, using Reinforcement Learning and Imitation Learning: combining Generative Adversarial Imitation Learning, Behavioral Cloning, and Proximal Policy Optimization. The second set of experiments evaluates the constructed model with player tests, where models are trained on demonstrations provided by participants. The gameplay of the participants indicates good imitation in defensive behavior, but not in offensive strategies. Participant’s surveys indicated that they recognized their own retreating tactics, and resulted in an overall higher player-satisfaction for Mirror Mode. Refining the model further may improve imitation quality and increase player’s satisfaction, especially when players face their own strategies. The full code and survey results are stored at: this https URL
zh

[AI-122] Beyond Automation: Rethinking Work Creativity and Governance in the Age of Generative AI

【速读】:该论文旨在解决生成式人工智能(Generative AI)快速发展对就业结构、社会公平、人类创造力及日常决策带来的深远影响问题,尤其关注其在不同群体、行业和地区间的不均衡分布以及由此引发的政策应对挑战。解决方案的关键在于构建一个超越单纯生产力提升的综合框架,强调将全民基本收入(Universal Basic Income, UBI)视为更广泛治理生态系统中的一个组成部分,协同推进技能发展、创造性保护、模型设计优化与内容政策制定,以确保AI时代的社会包容性、意义感与创新活力。

链接: https://arxiv.org/abs/2512.11893
作者: Haocheng Lin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The accelerating advancement of generative artificial intelligence (AI) systems is reshaping the nature, distribution and meaning of work, creativity, and economic security. This paper investigates four inter-related phenomena in the current AI era: (1) the evolving landscape of employment and the future of work; (2) the diverse patterns of AI adoption across socio-demographic groups, sectors, and geographies; (3) whether universal basic income (UBI) should become a compulsory policy response to the AI revolution; and (4) the implications of AI content policies and model behaviours for human creativity, wellbeing, and everyday decision-making. Furthermore, the paper tests the hypothesis that newer model generations may perform worse than their predecessors, and examines how users’ interactions with AI systems may produce echo chambers through sycophantic model alignment. Using a mixed methodology that integrates labour market task-exposure modelling, sectoral diffusion mapping, policy-framework analysis, and qualitative discourse critique, this study develops a comprehensive framework for understanding the societal consequences of AI systems beyond productivity gains. It argues that to foster an inclusive, meaningful, and creative environment, policymakers must treat UBI as one dimension within a broader ecosystem of governance, skills development, creativity preservation, and model design. The paper concludes by outlining future research directions, including systematic evaluation of AI’s creative performance across model generations, construction of a taxonomy of AI-usage distribution and equity, and formulation of governance criteria to balance content restrictions with creative freedom.
zh

[AI-123] Should AI Become an Intergenerational Civil Right?

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)治理中一个被忽视的核心矛盾:在推动AI广泛普及以促进社会包容与机会平等的同时,其训练和部署所依赖的能源、计算及网络资源具有有限性和分布不均性,若放任AI使用无限制扩张,将导致不可持续的资源消耗;而过度限制则可能加剧不平等并侵害基本权利。解决方案的关键在于提出将“对AI输出的访问”视为一种代际公民权利(Intergenerational Civil Right),从而超越传统将其仅作为商业服务的认知框架,建立兼顾当下公平与未来世代权益的法律与伦理基础。在此基础上,论文进一步结合物联网(IoT)-边缘-云协同计算、去中心化推理与节能网络等新兴技术范式,设计出支持资源受限条件下公平访问的AI交付网络(AI Delivery Networks)的技术路径,实现治理理念向系统架构层面的落地转化,确保AI发展兼具社会正义性与环境可持续性。

链接: https://arxiv.org/abs/2512.11892
作者: Jon Crowcroft,Rute C. Sofia,Dirk Trossen,Vassilis Tsaoussidis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is rapidly becoming a foundational layer of social, economic, and cognitive infrastructure. At the same time, the training and large-scale deployment of AI systems rely on finite and unevenly distributed energy, networking, and computational resources. This tension exposes a largely unexamined problem in current AI governance: while expanding access to AI is essential for social inclusion and equal opportunity, unconstrained growth in AI use risks unsustainable resource consumption, whereas restricting access threatens to entrench inequality and undermine basic rights. This paper argues that access to AI outputs largely derived from publicly produced knowledge should not be treated solely as a commercial service, but as a fundamental civil interest requiring explicit protection. We show that existing regulatory frameworks largely ignore the coupling between equitable access and resource constraints, leaving critical questions of fairness, sustainability, and long-term societal impact unresolved. To address this gap, we propose recognizing access to AI as an \emphIntergenerational Civil Right, establishing a legal and ethical framework that simultaneously safeguards present-day inclusion and the rights of future generations. Beyond normative analysis, we explore how this principle can be technically realized. Drawing on emerging paradigms in IoT–Edge–Cloud computing, decentralized inference, and energy-aware networking, we outline technological trajectories and a strawman architecture for AI Delivery Networks that support equitable access under strict resource constraints. By framing AI as a shared social infrastructure rather than a discretionary market commodity, this work connects governance principles with concrete system design choices, offering a pathway toward AI deployment that is both socially just and environmentally sustainable. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2512.11892 [cs.CY] (or arXiv:2512.11892v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2512.11892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-124] Advancing Autonomous Driving System Testing: Demands Challenges and Future Directions

【速读】:该论文旨在解决自动驾驶系统(Autonomous Driving Systems, ADS)在复杂真实环境中的可靠性验证问题,尤其是现有测试技术难以全面评估实际性能的挑战。其关键解决方案在于通过大规模调研与文献分析,识别出当前测试方法在角落案例多样性、仿真到现实的差距、系统性测试标准缺失、潜在攻击暴露、V2X部署实践难题以及基于基础模型(foundation models)测试的高计算成本等方面的不足,并提出未来研究方向:包括构建全面的测试标准、实现V2X系统中跨模型协作、提升基础模型测试的跨模态适应能力,以及开发可扩展的大规模ADS验证框架。

链接: https://arxiv.org/abs/2512.11887
作者: Yihan Liao,Jingyu Zhang,Jacky Keung,Yan Xiao,Yurou Dai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Information and Software Technology (IST)

点击查看摘要

Abstract:Autonomous driving systems (ADSs) promise improved transportation efficiency and safety, yet ensuring their reliability in complex real-world environments remains a critical challenge. Effective testing is essential to validate ADS performance and reduce deployment risks. This study investigates current ADS testing practices for both modular and end-to-end systems, identifies key demands from industry practitioners and academic researchers, and analyzes the gaps between existing research and real-world requirements. We review major testing techniques and further consider emerging factors such as Vehicle-to-Everything (V2X) communication and foundation models, including large language models and vision foundation models, to understand their roles in enhancing ADS testing. We conducted a large-scale survey with 100 participants from both industry and academia. Survey questions were refined through expert discussions, followed by quantitative and qualitative analyses to reveal key trends, challenges, and unmet needs. Our results show that existing ADS testing techniques struggle to comprehensively evaluate real-world performance, particularly regarding corner case diversity, the simulation to reality gap, the lack of systematic testing criteria, exposure to potential attacks, practical challenges in V2X deployment, and the high computational cost of foundation model-based testing. By further analyzing participant responses together with 105 representative studies, we summarize the current research landscape and highlight major limitations. This study consolidates critical research gaps in ADS testing and outlines key future research directions, including comprehensive testing criteria, cross-model collaboration in V2X systems, cross-modality adaptation for foundation model-based testing, and scalable validation frameworks for large-scale ADS evaluation.
zh

[AI-125] An Experience Report on a Pedagogically Controlled Curriculum-Constrained AI Tutor for SE Education ICSE2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在教育场景中难以实现个性化、情境化教学支持的问题,特别是在编程与计算思维课程中如何有效辅助中学生学习。解决方案的关键在于设计并部署了一个基于 GPT-4 的 AI 助教系统 RockStartIT Tutor,其核心创新包括:采用新颖的提示工程策略(prompting strategy)和模块化语义标签知识库(semantically tagged knowledge base),以确保输出内容既符合课程目标又具备上下文感知能力,从而提供精准、可约束的个性化辅导。该方法无需对模型进行训练,仅通过结构化提示与知识组织即可引导 AI 行为,实现了低门槛、高适配性的教学支持。

链接: https://arxiv.org/abs/2512.11882
作者: Lucia Happe,Dominik Fuchß,Luca Hüttner,Kai Marquardt,Anne Koziolek
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 11 pages, 4 figures, accepted for publication at ICSE 2026 SEET Track

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into education continues to evoke both promise and skepticism. While past waves of technological optimism often fell short, recent advances in large language models (LLMs) have revived the vision of scalable, individualized tutoring. This paper presents the design and pilot evaluation of RockStartIT Tutor, an AI-powered assistant developed for a digital programming and computational thinking course within the RockStartIT initiative. Powered by GPT-4 via OpenAI’s Assistant API, the tutor employs a novel prompting strategy and a modular, semantically tagged knowledge base to deliver context-aware, personalized, and curriculum-constrained support for secondary school students. We evaluated the system using the Technology Acceptance Model (TAM) with 13 students and teachers. Learners appreciated the low-stakes environment for asking questions and receiving scaffolded guidance. Educators emphasized the system’s potential to reduce cognitive load during independent tasks and complement classroom teaching. Key challenges include prototype limitations, a small sample size, and the need for long-term studies with the target age group. Our findings highlight a pragmatic approach to AI integration that requires no model training, using structure and prompts to shape behavior. We position AI tutors not as teacher replacements but as enabling tools that extend feedback access, foster inquiry, and support what schools do best: help students learn.
zh

[AI-126] Its About Time: The Temporal and Modal Dynamics of Copilot Usage

【速读】:该论文旨在解决如何理解用户在不同情境下使用生成式 AI(Generative AI)的行为模式这一问题,尤其关注使用场景、时间节奏与设备类型之间的关联。其解决方案的关键在于对超过3750万条匿名对话数据的系统性分析,揭示了用户在移动设备上主要围绕健康话题进行咨询(全天候稳定),而在桌面端则以工作和科技类话题为主(尤其在工作时段8 a.m.–5 p.m.显著突出),并进一步发现编程、游戏、哲学及情感类对话呈现明显的时序规律,表明AI已深度嵌入用户日常生活与工作流程中,成为兼具生产力工具与情感陪伴功能的智能助手。

链接: https://arxiv.org/abs/2512.11879
作者: Beatriz Costa-Gomes,Sophia Chen,Connie Hsueh,Deborah Morgan,Philipp Schoenegger,Yash Shah,Sam Way,Yuki Zhu,Timothé Adeline,Michael Bhaskar,Mustafa Suleyman,Seth Spielman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:We analyze 37.5 million deidentified conversations with Microsoft’s Copilot between January and September 2025. Unlike prior analyses of AI usage, we focus not just on what people do with AI, but on how and when they do it. We find that how people use AI depends fundamentally on context and device type. On mobile, health is the dominant topic, which is consistent across every hour and every month we observed - with users seeking not just information but also advice. On desktop, the pattern is strikingly different: work and technology dominate during business hours, with “Work and Career” overtaking “Technology” as the top topic precisely between 8 a.m. and 5 p.m. These differences extend to temporal rhythms: programming queries spike on weekdays while gaming rises on weekends, philosophical questions climb during late-night hours, and relationship conversations surge on Valentine’s Day. These patterns suggest that users have rapidly integrated AI into the full texture of their lives, as a work aid at their desks and a companion on their phones.
zh

[AI-127] WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶系统中轨迹生成的效率与灵活性问题,特别是针对当前主流的自回归大语言模型和连续扩散策略在处理复杂驾驶场景时存在的局限性。其核心挑战在于如何实现更高效、可解释且具备场景感知能力的轨迹预测与规划。解决方案的关键在于提出WAM-Diff框架,采用离散掩码扩散(discrete masked diffusion)机制对未来的车辆轨迹进行迭代优化,同时引入三项关键技术:一是系统性地将掩码扩散适配至自动驾驶任务,支持灵活的非因果解码顺序;二是通过稀疏专家混合(Mixture of Experts, MoE)架构联合训练运动预测与面向驾驶的视觉问答(VQA),提升模型容量与泛化能力;三是基于分组序列策略优化(Group Sequence Policy Optimization, GSPO)的在线强化学习方法,直接优化序列级驾驶奖励,从而显著提升轨迹生成的质量与安全性。实验表明,该方法在NAVSim-v1和v2基准上分别达到91.0 PDMS和89.7 EPDMS,验证了离散掩码扩散在自动驾驶轨迹生成中的有效性与潜力。

链接: https://arxiv.org/abs/2512.11872
作者: Mingwang Xu,Jiahao Cui,Feipeng Cai,Hanlin Shang,Zhihao Zhu,Shan Luan,Yifang Xu,Neng Zhang,Yaoyi Li,Jia Cai,Siyu Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: this https URL
zh

[AI-128] Using Socio-economic Indicators Smart Transit Systems and Urban Simulator to Accelerate ZEV Adoption and Reduce VMT

【速读】:该论文旨在解决城市交通领域温室气体(GHG)排放难以削减的问题,特别是在低密度、高度依赖私家车的都市如休斯顿,如何实现2050年净零排放目标。其核心挑战在于机动车排放占基准排放的48%,且受社会经济差异影响,零排放车辆(ZEV)普及率受限,同时车辆行驶里程(VMT)居高不下。解决方案的关键在于构建一个基于智能交通系统(ITS)和多维度社会经济指标的政策评估框架,通过扩展ZEV可及性、减少20% VMT(借助公共交通改善与城市设计优化)、以及部署智慧停车、出行激励、数据安全体系和ZEV车队管理等技术手段,提升交通方式结构(modal split)与系统可靠性。研究进一步开发了基于Unity 3D的动态仿真环境,用于可视化不同政策情景下的城市出行行为变化,从而为制定科学、可落地的减排策略提供量化支持。

链接: https://arxiv.org/abs/2512.11870
作者: Mulham Fawkherji,Bruce Race,Driss Benhaddou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Globally, on-road transportation accounts for 15% of greenhouse gas (GHG) emissions and an estimated 385,000 premature deaths from PM2.5. Cities play a critical role in meeting IPCC targets, generating 75% of global energy-related GHG emissions. In Houston, Texas, on-road transportation represents 48% of baseline emissions in the Climate Action Plan (CAP). To reach net-zero by 2050, the CAP targets a 70% emissions reduction from a 2014 baseline, offset by 30% renewable energy. This goal is challenging because Houston is low-density and auto-dependent, with 89% of on-road emissions from cars and small trucks and limited public transit usage. Socio-economic disparities further constrain Zero Emissions Vehicle (ZEV) adoption. Strategies focus on expanding ZEV access and reducing Vehicle Miles Traveled (VMT) by 20% through transit improvements and city design. This paper presents methods for establishing an on-road emissions baseline and evaluating policies that leverage socio-economic indicators and Intelligent Transportation Systems (ITS) to accelerate ZEV adoption and reduce VMT. Smart parking, transit incentives, secure data systems, and ZEV fleet management support improvements in modal split and system reliability. Policy options are analyzed and potential actions identified. To support evaluation, a simulation environment was developed in Unity 3D, enabling dynamic modeling of urban mobility and visualization of policy scenarios. Auto-dependent cities aiming for 2050 emission targets can benefit from the indicators, metrics, and technologies discussed.
zh

[AI-129] Industrial AI Robustness Card: Evaluating and Monitoring Time Series Models

【速读】:该论文旨在解决工业人工智能(Industrial AI)从业者在面对新兴法规和标准中模糊的鲁棒性要求时,缺乏可直接实施的评估协议的问题。解决方案的关键在于提出一种轻量级、任务无关的鲁棒性文档与评估协议——工业AI鲁棒性卡片(Industrial AI Robustness Card, IARC),其核心要素包括:明确所需字段、结合漂移监测(drift monitoring)、不确定性量化(uncertainty quantification)与压力测试(stress tests)的实证测量与报告流程,并将这些内容映射至欧盟人工智能法案(EU AI Act)的相关义务,从而支持可复现的鲁棒性证据生成与持续监控。

链接: https://arxiv.org/abs/2512.11868
作者: Alexander Windmann,Benedikt Stratmann,Mariya Lyashenko,Oliver Niggemann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial AI practitioners face vague robustness requirements in emerging regulations and standards but lack concrete, implementation ready protocols. This paper introduces the Industrial AI Robustness Card (IARC), a lightweight, task agnostic protocol for documenting and evaluating the robustness of AI models on industrial time series. The IARC specifies required fields and an empirical measurement and reporting protocol that combines drift monitoring, uncertainty quantification, and stress tests, and it maps these to relevant EU AI Act obligations. A soft sensor case study on a biopharmaceutical fermentation process illustrates how the IARC supports reproducible robustness evidence and continuous monitoring.
zh

[AI-130] Solving Parallel Machine Scheduling With Precedences and Cumulative Resource Constraints With Calendars

【速读】:该论文旨在解决现实工业场景中具有作业优先约束和基于日历的累积资源限制的并行机调度问题(Parallel Machine Scheduling),此类问题在现代工厂中普遍存在且复杂度高,传统求解方法难以高效处理。其解决方案的关键在于提出一种新的约束建模方法用于小规模实例的精确求解,并结合构造启发式算法与定制化的局部搜索元启发式方法,以高效应对大规模实际生产场景中的调度优化需求,该元启发式方法已成功部署于工业环境中并投入使用。

链接: https://arxiv.org/abs/2512.11864
作者: Christoph Einspieler,Matthias Horn,Marie-Louise Lackner,Patrick Malik,Nysret Musliu,Felix Winter
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:The task of finding efficient production schedules for parallel machines is a challenge that arises in most industrial manufacturing domains. There is a large potential to minimize production costs through automated scheduling techniques, due to the large-scale requirements of modern factories. In the past, solution approaches have been studied for many machine scheduling variations, where even basic variants have been shown to be NP-hard. However, in today’s real-life production environments, additional complex precedence constraints and resource restrictions with calendars arise that must be fulfilled. These additional constraints cannot be tackled efficiently by existing solution techniques. Thus, there is a strong need to develop and analyze automated methods that can solve such real-life parallel machine scheduling scenarios. In this work, we introduce a novel variant of parallel machine scheduling with job precedences and calendar-based cumulative resource constraints that arises in real-life industrial use cases. A constraint modeling approach is proposed as an exact solution method for small scheduling scenarios together with state-of-the-art constraint-solving technology. Further, we propose a construction heuristic as well as a tailored metaheuristic using local search to efficiently tackle large-scale problem instances. This metaheuristic approach has been deployed and is currently being used in an industrial setting.
zh

[AI-131] Expert Assessment: The Systemic Environmental Risks of Artficial Intelligence

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在应对社会挑战(如气候变化)的同时,其日益扩大的环境足迹所引发的系统性环境风险问题。传统研究多聚焦于AI的直接资源消耗(如能源与水资源),而本文指出更需关注由AI深度嵌入社会、经济和物理基础设施后产生的跨领域、非线性、不平等且可能不可逆的生态与社会后果。解决方案的关键在于提出一个三层次框架:第一层识别塑造AI发展的结构性条件;第二层揭示风险放大的传播机制(如反馈回路);第三层映射可观察到的生态与社会影响,从而为系统性环境风险分析提供可操作化的理论工具,并通过农业与生物多样性、油气行业及废弃物管理等案例加以验证。

链接: https://arxiv.org/abs/2512.11863
作者: Julian Schön,Lena Hoffmann,Nikolas Becker
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is often presented as a key tool for addressing societal challenges, such as climate change. At the same time, AI’s environmental footprint is expanding increasingly. This report describes the systemic environmental risks of artificial intelligence, in particular, moving beyond direct impacts such as energy and water usage. Systemic environmental risks of AI are emergent, cross-sector harms to climate, biodiversity, freshwater, and broader socioecological systems that arise primarily from AI’s integration into social, economic, and physical infrastructures, rather than its direct resource use, and that propagate through feedbacks, yielding nonlinear, inequitable, and potentially irreversible impacts. While these risks are emergent and quantification is uncertain, this report aims to provide an overview of systemic environmental risks. Drawing on a narrative literature review, we propose a three-level framework that operationalizes systemic risk analysis. The framework identifies the structural conditions that shape AI development, the risk amplification mechanisms that propagate environmental harm, and the impacts that manifest as observable ecological and social consequences. We illustrate the framework in expert-interview-based case studies across agriculture and biodiversity, oil and gas, and waste management.
zh

[AI-132] Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

【速读】:该论文旨在解决低空智能网络(Low-altitude Intelligent Networks, LAINs)在动态和基础设施受限环境中面临的三大核心挑战:无人机(UAV)能量受限、任务到达具有随机性以及计算资源异构性。为应对这些问题,作者提出了一种空地协同的集成网络架构,并构建了一个时间依赖的整数非线性规划问题,联合优化无人机轨迹规划与任务卸载决策。其解决方案的关键在于设计了一个双时间尺度的分层学习框架:在大时间尺度上采用Vickrey-Clarke-Groves(VCG)拍卖机制实现能量感知且激励相容的轨迹分配;在小时间尺度上提出扩散-异构代理近端策略优化算法(diffusion-heterogeneous-agent proximal policy optimization),将潜在扩散模型嵌入到演员网络中,使每个无人机从高斯先验采样动作并通过观测条件去噪进行精炼,从而提升策略的适应性和多样性。

链接: https://arxiv.org/abs/2512.11862
作者: Jiahao You,Ziye Jia,Can Cui,Chao Dong,Qihui Wu,Zhu Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The low-altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low-latency and energy-efficient edge intelligence in dynamic and infrastructure-limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing. However, these systems face key challenges, including energy-constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey-Clarke-Groves auction mechanism enables the energy-aware and incentive-compatible trajectory assignment. At the small timescale, we propose the diffusion-heterogeneous-agent proximal policy optimization, a generative multi-agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation-conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.
zh

[AI-133] An Operator-Consistent Graph Neural Network for Learning Diffusion Dynamics on Irregular Meshes

【速读】:该论文旨在解决传统数值方法在不规则网格上求解偏微分方程(Partial Differential Equations, PDEs)时出现的不稳定问题,尤其是在多物理场相互作用(如扩散、损伤与愈合)常发生在复杂几何结构中的场景。解决方案的关键在于提出一种算子一致图神经网络(Operator-Consistent Graph Neural Network for Physics-Informed Neural Networks, OCGNN-PINN),其通过节点-边消息传递机制与一致性损失函数相结合,利用图的关联矩阵(incidence matrix)强制约束梯度-散度关系,从而在时间演化过程中保持离散节点和边动力学的结构性耦合,显著提升了在非结构化网格上的稳定性与预测精度,逼近Crank-Nicolson格式在不规则域上的性能表现。

链接: https://arxiv.org/abs/2512.11860
作者: Yuelian Li,Andrew Rushing Hands
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical numerical methods solve partial differential equations (PDEs) efficiently on regular meshes, but many of them become unstable on irregular domains. In practice, multiphysics interactions such as diffusion, damage, and healing often take place on irregular meshes. We develop an operator-consistent graph neural network (OCGNN-PINN) that approximates PDE evolution under physics-informed constraints. It couples node-edge message passing with a consistency loss enforcing the gradient-divergence relation through the graph incidence matrix, ensuring that discrete node and edge dynamics remain structurally coupled during temporal rollout. We evaluate the model on diffusion processes over physically driven evolving meshes and real-world scanned surfaces. The results show improved temporal stability and prediction accuracy compared with graph convolutional and multilayer perceptron baselines, approaching the performance of Crank-Nicolson solvers on unstructured domains.
zh

[AI-134] Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion

【速读】:该论文旨在解决**引导随机最优传输(Guided Stochastic Optimal Transport, SOT)中的关键挑战,即如何在满足硬终端分布约束的同时,通过软性路径成本函数生成具有几何感知和可解释性的轨迹,并实现稳定采样与可微分协议学习。其解决方案的核心在于提出Guided Harmonic Path-Integral Diffusion (GH-PID)**框架,该框架基于线性可解结构设计:保持前向与后向Kolmogorov方程的线性特性,使最优得分函数可通过显式Green函数比表达,且当终端分布为高斯混合模型(GMM)时能获得闭式解;同时引入以路径成本、中心线贴合度、方差流和漂移努力为核心的诊断指标,构建了可解释的变分近似方法,从而在多场景导航任务中实现了精确终端匹配与集成成本最小化。

链接: https://arxiv.org/abs/2512.11859
作者: Michael Chertkov(University of Arizona)
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注: 40 pages, 8 figures

点击查看摘要

Abstract:We introduce Guided Harmonic Path-Integral Diffusion (GH-PID), a linearly-solvable framework for guided Stochastic Optimal Transport (SOT) with a hard terminal distribution and soft, application-driven path costs. A low-dimensional guidance protocol shapes the trajectory ensemble while preserving analytic structure: the forward and backward Kolmogorov equations remain linear, the optimal score admits an explicit Green-function ratio, and Gaussian-Mixture Model (GMM) terminal laws yield closed-form expressions. This enables stable sampling and differentiable protocol learning under exact terminal matching. We develop guidance-centric diagnostics – path cost, centerline adherence, variance flow, and drift effort – that make GH-PID an interpretable variational ansatz for empirical SOT. Three navigation scenarios illustrated in 2D: (i) Case A: hand-crafted protocols revealing how geometry and stiffness shape lag, curvature effects, and mode evolution; (ii) Case B: single-task protocol learning, where a PWC centerline is optimized to minimize integrated cost; (iii) Case C: multi-expert fusion, in which a commander reconciles competing expert/teacher trajectories and terminal beliefs through an exact product-of-experts law and learns a consensus protocol. Across all settings, GH-PID generates geometry-aware, trust-aware trajectories that satisfy the prescribed terminal distribution while systematically reducing integrated cost. Comments: 40 pages, 8 figures Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML) Cite as: arXiv:2512.11859 [cs.LG] (or arXiv:2512.11859v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.11859 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-135] Adaptive Path Integral Diffusion: AdaPID

【速读】:该论文旨在解决基于扩散采样的采样质量优化问题,特别是如何通过设计合理的中间时间演化调度(schedule)来提升采样效率与精度。其核心挑战在于:尽管当前扩散采样方法(如基于得分的扩散、桥接扩散和路径积分扩散)能在终时刻匹配目标分布,但早期退出时的保真度(fidelity)、尾部区域准确性(tail accuracy)以及动态条件化(conditioning)等性能指标高度依赖于调度策略的选择。解决方案的关键在于提出一种路径感知型调度选择框架(path-wise schedule selection framework),结合分段常数(Piece-Wise-Constant, PWC)参数化与简单的层次化细化机制,并引入对调度敏感的采样质量诊断指标(Quality-of-Sampling, QoS),从而在固定积分预算下显著改善早期退出性能、尾部建模精度、动力学条件数及标签选择时机(speciation timing)。此外,在高斯混合(Gaussian-Mixture, GM)目标假设下,该方法保留了闭式格林函数比值(closed-form Green functions’ ratio),并提供无需神经网络的数值稳定预测状态映射与得分估计器(score oracle)。

链接: https://arxiv.org/abs/2512.11858
作者: Michael Chertkov,Hamidreza Behjoo(University of Arizona)
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注: 51 pages, 17 figures

点击查看摘要

Abstract:Diffusion-based samplers – Score Based Diffusions, Bridge Diffusions and Path Integral Diffusions – match a target at terminal time, but the real leverage comes from choosing the schedule that governs the intermediate-time dynamics. We develop a path-wise schedule – selection gramework for Harmonic PID with a time-varying stiffness, exploiting Piece-Wise-Constant(PWC) parametrizations and a simple hierarchical refinement. We introduce schedule-sensitive Quality-of-Sampling (QoS) diagnostics. Assuming a Gaussian-Mixture (GM) target, we retain closed-form Green functions’ ration and numerically stable, Neural-Network free oracles for predicted-state maps and score. Experiments in 2D show that QoS driven PWC schedules consistently improve early-exit fidelity, tail accuracy, conditioning of the dynamics, and speciation (label-selection) timing at fixed integration budgets.
zh

[AI-136] GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在边缘设备上进行高效推理时面临的挑战,即高计算成本与边缘硬件资源受限之间的矛盾,导致GNN推理难以满足实时性和能耗要求。解决方案的关键在于提出GCoDE——首个面向设备-边缘协同推理(device-edge co-inference)的自动架构映射协同设计与部署框架。其核心创新包括:将设备通信过程抽象为显式操作,实现架构与映射方案在统一设计空间中的联合优化;引入基于能量预测的方法提升能效评估精度并识别节能解;采用约束驱动的随机搜索策略,在1.5小时内找到兼顾准确率与效率的最优解;并通过集成的协同推理引擎实现高效部署执行,最终在多种应用场景和系统配置下实现最高达44.9倍加速比和98.2%能耗降低。

链接: https://arxiv.org/abs/2512.11856
作者: Ao Zhou,Jianlei Yang,Tong Qiao,Yingjie Qi,Zhi Yang,Weisheng Zhao,Chunming Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by IEEE Transactions on Computers

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as the state-of-the-art graph learning method. However, achieving efficient GNN inference on edge devices poses significant challenges, limiting their application in real-world edge scenarios. This is due to the high computational cost of GNNs and limited hardware resources on edge devices, which prevent GNN inference from meeting real-time and energy requirements. As an emerging paradigm, device-edge co-inference shows potential for improving inference efficiency and reducing energy consumption on edge devices. Despite its potential, research on GNN device-edge co-inference remains scarce, and our findings show that traditional model partitioning methods are ineffective for GNNs. To address this, we propose GCoDE, the first automatic framework for GNN architecture-mapping Co-design and deployment on Device-Edge hierarchies. By abstracting the device communication process into an explicit operation, GCoDE fuses the architecture and mapping scheme in a unified design space for joint optimization. Additionally, GCoDE’s system performance awareness enables effective evaluation of architecture efficiency across diverse heterogeneous systems. By analyzing the energy consumption of various GNN operations, GCoDE introduces an energy prediction method that improves energy assessment accuracy and identifies energy-efficient solutions. Using a constraint-based random search strategy, GCoDE identifies the optimal solution in 1.5 hours, balancing accuracy and efficiency. Moreover, the integrated co-inference engine in GCoDE enables efficient deployment and execution of GNN co-inference. Experimental results show that GCoDE can achieve up to 44.9x speedup and 98.2% energy reduction compared to existing approaches across diverse applications and system configurations.
zh

[AI-137] Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

【速读】:该论文旨在解决机器学习模型中精确对称性(exact symmetry)与近似对称性(approximate symmetry)之间的理论差异问题,特别是量化二者在实现过程中所需的计算成本。其核心贡献在于提出了“平均复杂度”(averaging complexity)这一框架,用于衡量通过平均操作强制实现对称性的代价;关键发现是:在标准条件下,实现精确对称性需要线性级别的平均复杂度,而近似对称性仅需对数级别复杂度,从而首次从理论上证明了近似对称性的优势,为实践中更倾向于使用近似对称性提供了理论依据。

链接: https://arxiv.org/abs/2512.11855
作者: Behrooz Tahmasebi,Melanie Weber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 2 figures

点击查看摘要

Abstract:Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, achieving exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic averaging complexity. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.
zh

[AI-138] Rep Smarter Not Harder: AI Hypertrophy Coaching with Wearable Sensors and Edge Neural Networks

【速读】:该论文旨在解决阻力训练中肌肉接近疲劳极限(即Repetitions in Reserve, RiR ≤ 2)时难以实现客观、实时反馈的问题,从而优化肌肥大训练强度与疲劳管理之间的平衡。传统依赖主观评估的RiR方法存在可靠性差的问题,易导致训练刺激不足或过度疲劳。其解决方案的关键在于提出一个两阶段边缘计算架构:第一阶段基于ResNet模型从单个腕戴惯性测量单元(Inertial Measurement Unit, IMU)的6轴数据中实时分割重复动作;第二阶段结合分割所得特征、直接卷积特征及由LSTM捕捉的历史上下文信息,构建分类模型识别近失败状态窗口。实验表明,在13名参与者完成肱二头肌弯举至力竭的全新数据集上,该系统在模拟真实推理速率(1.6 Hz)下实现了0.83和0.82的F1分数,并在树莓派5和iPhone 16上分别达到112 ms和23.5 ms的平均延迟,验证了其在资源受限设备上的可行性,为可穿戴AI驱动的肌肥大训练指导工具提供了实用路径。

链接: https://arxiv.org/abs/2512.11854
作者: Grant King,Musa Azeem,Savannah Noblitt,Ramtin Zand,Homayoun Valafar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24th International Conference on Machine Learning and Applications

点击查看摘要

Abstract:Optimizing resistance training for hypertrophy requires balancing proximity to muscular failure, often quantified by Repetitions in Reserve (RiR), with fatigue management. However, subjective RiR assessment is unreliable, leading to suboptimal training stimuli or excessive fatigue. This paper introduces a novel system for real-time feedback on near-failure states (RiR \le 2) during resistance exercise using only a single wrist-mounted Inertial Measurement Unit (IMU). We propose a two-stage pipeline suitable for edge deployment: first, a ResNet-based model segments repetitions from the 6-axis IMU data in real-time. Second, features derived from this segmentation, alongside direct convolutional features and historical context captured by an LSTM, are used by a classification model to identify exercise windows corresponding to near-failure states. Using a newly collected dataset from 13 diverse participants performing preacher curls to failure (631 total reps), our segmentation model achieved an F1 score of 0.83, and the near-failure classifier achieved an F1 score of 0.82 under simulated real-time evaluation conditions (1.6 Hz inference rate). Deployment on a Raspberry Pi 5 yielded an average inference latency of 112 ms, and on an iPhone 16 yielded 23.5 ms, confirming the feasibility for edge computation. This work demonstrates a practical approach for objective, real-time training intensity feedback using minimal hardware, paving the way for accessible AI-driven hypertrophy coaching tools that help users manage intensity and fatigue effectively.
zh

[AI-139] Explainable AI for Smart Greenhouse Control: Interpretability of Temporal Fusion Transformer in the Internet of Robotic Things

【速读】:该论文旨在解决智能温室中基于物联网的机器人技术(IoRT)系统在时间序列预测模型应用中存在的可解释性不足问题,尤其是在农业自动化场景下,缺乏透明决策机制会限制用户信任、合规性及实际部署效果。其解决方案的关键在于引入Temporal Fusion Transformer (TFT) 模型,并结合模型内生解释方法、局部可解释模型无关解释(LIME)与SHapley加法解释(SHAP),实现对传感器输入(如温湿度、CO₂浓度、光照及外部气候)如何影响执行器控制决策的多层次可解释分析,从而提升模型决策的可信度与实用性,最终在类不平衡数据集上实现了95%的测试准确率,为精准农业中的资源优化和作物增产提供可解释的自动化控制基础。

链接: https://arxiv.org/abs/2512.11852
作者: Muhammad Jawad Bashir,Shagufta Henna,Eoghan Furey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, Accepted in 36th Irish Signals and Systems Conference, ISSC 2025

点击查看摘要

Abstract:The integration of the Internet of Robotic Things (IoRT) in smart greenhouses has revolutionised precision agriculture by enabling efficient and autonomous environmental control. However, existing time series forecasting models in such setups often operate as black boxes, lacking mechanisms for explainable decision-making, which is a critical limitation when trust, transparency, and regulatory compliance are paramount in smart farming practices. This study leverages the Temporal Fusion Transformer (TFT) model to automate actuator settings for optimal greenhouse management. To enhance interpretability and trust in the model decision-making process, both local and global explanation techniques were employed using model-inherent interpretation, local interpretable model-agnostic explanations (LIME), and SHapley additive explanations (SHAP). These explainability methods provide information on how different sensor readings, such as temperature, humidity, CO2 levels, light, and outer climate, contribute to actuator control decisions in an automated greenhouse. The trained TFT model achieved a test accuracy of 95% on a class-imbalanced dataset for actuator control settings in an automated greenhouse environment. The results demonstrate the varying influence of each sensor on real-time greenhouse adjustments, ensuring transparency and enabling adaptive fine-tuning for improved crop yield and resource efficiency.
zh

[AI-140] KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLM s

【速读】:该论文旨在解决小规模语言模型(small LLM)在推理过程中因重复计算注意力键值(Key Value, KV)状态而导致的效率低下问题,尤其是在处理具有相似前缀的新输入时。解决方案的关键在于提出一种称为“token recycling”的方法,通过缓存先前提示(prompt)的激活状态(即KV states),并利用句子嵌入(sentence embeddings)进行匹配,当新输入与缓存提示存在精确前缀重叠时,直接复用已缓存的KV状态以加速推理过程,从而提升空间利用率和上下文记忆能力。该方法无需修改模型结构,仅需将缓存的KV序列化至CPU并在生成函数中重新加载,即可实现高效、无语义退化的推理加速。

链接: https://arxiv.org/abs/2512.11851
作者: Prashant Pandey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Whether attention key value (KV) states computed for one prompt for a small LLM can be reused to accelerate inference on a new similar prompt, giving an increase to the space to its context memory using an approach called token recycling. Using a standard Hugging Face setup with DialoGPT-medium (a 345M parameter GPT-2 style decoder trained on 147M Reddit exchanges, 2005 to 2017) as the testbed, we build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input. We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens. Reproducibility requires no model modifications, cached KVs are serialized to the CPU, reloaded, and supplied to the generate function to continue decoding from the cached prefix. In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent, behavior matches baseline.
zh

[AI-141] Airport Passenger Flow Forecasting via Deformable Temporal-Spectral Transformer Approach

【速读】:该论文旨在解决机场乘客流量预测中因固定尺寸patch嵌入导致难以建模复杂且异质性时间模式的问题。其解决方案的关键在于提出了一种可变形时频Transformer(DTSFormer),通过引入多尺度可变形分区模块和联合时频滤波模块:首先利用基于窗口函数的掩码机制动态划分输入序列,实现不同时间阶段异质趋势的提取;其次在每个尺度内设计频域注意力机制以捕捉高低频成分,强化对波动性和周期性的建模;最终在时域融合多频特征,协同建模短期波动与长期趋势,从而显著提升预测精度。

链接: https://arxiv.org/abs/2512.11845
作者: Wenbo Du,Lingling Han,Ying Xiong,Ling Zhang,Biyue Li,Yisheng Lv,Tong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Accurate forecasting of passenger flows is critical for maintaining the efficiency and resilience of airport operations. Recent advances in patch-based Transformer models have shown strong potential in various time series forecasting tasks. However, most existing methods rely on fixed-size patch embedding, making it difficult to model the complex and heterogeneous patterns of airport passenger flows. To address this issue, this paper proposes a deformable temporal-spectral transformer named DTSFormer that integrates a multiscale deformable partitioning module and a joint temporal-spectral filtering module. Specifically, the input sequence is dynamically partitioned into multiscale temporal patches via a novel window function-based masking, enabling the extraction of heterogeneous trends across different temporal stages. Then, within each scale, a frequency-domain attention mechanism is designed to capture both high- and low-frequency components, thereby emphasizing the volatility and periodicity inherent in airport passenger flows. Finally, the resulting multi-frequency features are subsequently fused in the time domain to jointly model short-term fluctuations and long-term trends. Comprehensive experiments are conducted on real-world passenger flow data collected at Beijing Capital International Airport from January 2023 to March 2024. The results indicate that the proposed method consistently outperforms state-of-the-art forecasting models across different prediction horizons. Further analysis shows that the deformable partitioning module aligns patch lengths with dominant periods and heterogeneous trends, enabling superior capture of sudden high-frequency fluctuations.
zh

[AI-142] Spiking Manifesto

【速读】:该论文试图解决当前主流人工智能(AI)模型在能效方面远低于生物大脑的问题,特别是指出基于人工神经网络(ANNs)的AI模型因依赖GPU进行大规模矩阵运算而能耗极高,效率仅为大脑的千分之一。其解决方案的关键在于提出一种以脉冲神经网络(SNNs)为基础的新框架,将现有AI模型映射到具有脉冲特性的神经网络结构中,并引入“多时序编码”(polychronization)机制来解释神经元活动如何模拟查找表(look-up tables),从而实现无需矩阵乘法的高效计算。这一方法有望使AI系统在保持性能的同时,获得千倍级别的能效提升。

链接: https://arxiv.org/abs/2512.11843
作者: Eugene Izhikevich
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: This is a declaration of principles and roadmap for spiking networks, intended as a manifesto rather than a conventional research article

点击查看摘要

Abstract:Practically everything computers do is better, faster, and more power-efficient than the brain. For example, a calculator crunches numbers more energy-efficiently than any human. Yet AI models are a thousand times less efficient than the brain. These models use artificial neural networks (ANNs) and require GPUs for the multiplication of huge matrices. In contrast, spiking neural networks (SNNs) of the brain have no matrix multiplication and much smaller energy requirements. This manifesto proposes a framework for thinking about popular AI models in terms of spiking networks and polychronization, and for interpreting spiking activity as nature’s way of implementing look-up tables. This offers a way to convert AI models into a novel type of architecture with the promise of a thousandfold improvement in efficiency. Code is available at this https URL
zh

[AI-143] Semantic Nutrition Estimation: Predicting Food Healthfulness from Text Descriptions

【速读】:该论文旨在解决公共健康领域中营养评估的准确性问题,即现有营养评分系统依赖于详尽的数据,而这些数据往往难以从日常语言描述的食物信息中获取。其解决方案的关键在于构建一个基于机器学习的流水线,利用多头神经网络处理混合特征向量(包括语义文本嵌入、词汇模式、领域启发式规则以及美国农业部食品与营养数据库FNDDS数据),从而预测综合食品指南评分2.0(Food Compass Score 2.0, FCS)。该方法能够估计FCS算法所需的营养成分和食品组分,实现从自然语言到可操作营养信息的转化,显著提升了饮食评估的可扩展性与实用性。

链接: https://arxiv.org/abs/2512.11836
作者: Dayne R. Freudenberg,Daniel G. Haughian,Mitchell A. Klusty,Caroline N. Leach,W. Scott Black,Leslie N. Woltenberg,Rowan Hallock,Elizabeth Solie,Emily B. Collier,Samuel E. Armstrong,V. K. Cody Bumgardner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 6 tables, submitted to AMIA 2026 Informatics Summit

点击查看摘要

Abstract:Accurate nutritional assessment is critical for public health, but existing profiling systems require detailed data often unavailable or inaccessible from colloquial text descriptions of food. This paper presents a machine learning pipeline that predicts the comprehensive Food Compass Score 2.0 (FCS) from text descriptions. Our approach uses multi-headed neural networks to process hybrid feature vectors that combine semantic text embeddings, lexical patterns, and domain heuristics, alongside USDA Food and Nutrient Database for Dietary Studies (FNDDS) data. The networks estimate the nutrient and food components necessary for the FCS algorithm. The system demonstratedstrong predictive power, achieving a median R^2 of 0.81 for individual nutrients. The predicted FCS correlated strongly with published values (Pearson’s r = 0.77), with a mean absolute difference of 14.0 points. While errors were largest for ambiguous or processed foods, this methodology translates language into actionable nutritional information, enabling scalable dietary assessment for consumer applications and research.
zh

[AI-144] A Monad-Based Clause Architecture for Artificial Age Score (AAS) in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因内部记忆机制不透明而引发的可解释性与可控性问题,尤其是如何以原则性和可审计的方式治理其“类自我”行为。解决方案的关键在于构建一个基于莱布尼茨《单子论》(Monadology)的条款化架构,以人工记忆老化评分(Artificial Age Score, AAS)为核心,将20个单子概念分组为六个语义模块(如本体、动力学、表征与意识等),并将其转化为可在Python中执行的规范。该框架通过数值实验验证了其有效性:AAS轨迹保持连续且速率受限,矛盾和无支持陈述触发显式惩罚,层级细化揭示受控的有机结构,同时和谐项实现目标-行动对齐,窗口漂移分离持续改进与退化,从而为人工代理的内部动态提供透明、可代码实现的约束与分析蓝图。

链接: https://arxiv.org/abs/2512.11835
作者: Seyma Yaman Kayadibi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 42 pages, 6 toy simulation Python implementations, 20 monad clauses instantiated across six system bundles (ontology, dynamics, representation and consciousness, harmony and reason, body and organisation, teleology)

点击查看摘要

Abstract:Large language models (LLMs) are often deployed as powerful yet opaque systems, leaving open how their internal memory and “self-like” behavior should be governed in a principled and auditable way. The Artificial Age Score (AAS) was previously introduced and mathematically justified through three theorems that characterise it as a metric of artificial memory aging. Building on this foundation, the present work develops an engineering-oriented, clause-based architecture that imposes law-like constraints on LLM memory and control. Twenty selected monads from Leibniz’s Monadology are grouped into six bundles: ontology, dynamics, representation and consciousness, harmony and reason, body and organisation, and teleology, and each bundle is realised as an executable specification on top of the AAS kernel. Across six minimal Python implementations, these clause families are instantiated in numerical experiments acting on channel-level quantities such as recall scores, redundancy, and weights. Each implementation follows a four-step pattern: inputs and setup, clause implementation, numerical results, and implications for LLM design, emphasising that the framework is not only philosophically motivated but also directly implementable. The experiments show that the clause system exhibits bounded and interpretable behavior: AAS trajectories remain continuous and rate-limited, contradictions and unsupported claims trigger explicit penalties, and hierarchical refinement reveals an organic structure in a controlled manner. Dual views and goal-action pairs are aligned by harmony terms, and windowed drift in perfection scores separates sustained improvement from sustained degradation. Overall, the monad-based clause framework uses AAS as a backbone and provides a transparent, code-level blueprint for constraining and analyzing internal dynamics in artificial agents.
zh

[AI-145] Performance and Efficiency of Climate In-Situ Data Reconstruction: Why Optimized IDW Outperforms kriging and Implicit Neural Representation

【速读】:该论文旨在解决稀疏气候数据重建问题,即如何在观测点分布不均或数据缺失的情况下,高效且准确地恢复完整空间场。其关键解决方案在于系统比较了三种重建方法:简单的反距离加权法(IDW)、基于统计学的普通克里金法(OK)以及先进的隐式神经表示模型(MMGN架构),并通过超参数调优和跨100个随机采样稀疏数据集的全面实验验证性能。结果显示,IDW在重建精度(如RMSE、MAE、R²)和计算效率方面均显著优于其他方法,表明对于此类任务,简单而稳健的插值策略可能比复杂模型更具实际价值。

链接: https://arxiv.org/abs/2512.11832
作者: Jakub Walczak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study evaluates three reconstruction methods for sparse climate data: the simple inverse distance weighting (IDW), the statistically grounded ordinary kriging (OK), and the advanced implicit neural representation model (MMGN architecture). All methods were optimized through hyper-parameter tuning using validation splits. An extensive set of experiments was conducted, followed by a comprehensive statistical analysis. The results demonstrate the superiority of the simple IDW method over the other reference methods in terms of both reconstruction accuracy and computational efficiency. IDW achieved the lowest RMSE ( 3.00 \pm 1.93 ), MAE ( 1.32 \pm 0.77 ), and \Delta_MAX ( 24.06 \pm 17.15 ), as well as the highest R^2 ( 0.68 \pm 0.16 ), across 100 randomly sampled sparse datasets from the ECA\D database. Differences in RMSE, MAE, and R^2 were statistically significant and exhibited moderate to large effect sizes. The Dunn post-hoc test further confirmed the consistent superiority of IDW across all evaluated quality measures […]
zh

[AI-146] CR3G: Causal Reasoning for Patient-Centric Explanations in Radiology Report Generation

【速读】:该论文旨在解决当前生成式 AI 在胸部 X 射线报告生成中仅能识别图像模式而难以理解其与患者病情之间因果关系的问题,从而限制了AI诊断的可解释性和临床可信度。解决方案的关键在于提出一种基于提示(prompt-driven)的因果推理框架——CR3G(Causal Reasoning for Patient-Centric Explanations in radiology Report Generation),该框架聚焦于从影像特征到诊断结论的因果机制建模,通过引入因果推理逻辑增强模型对异常情况的解释能力,尤其在2种异常类型上显著提升了因果关联和患者导向解释的质量。

链接: https://arxiv.org/abs/2512.11830
作者: Satyam Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 1 table

点击查看摘要

Abstract:Automatic chest X-ray report generation is an important area of research aimed at improving diagnostic accuracy and helping doctors make faster decisions. Current AI models are good at finding correlations (or patterns) in medical images. Still, they often struggle to understand the deeper cause-and-effect relationships between those patterns and a patient condition. Causal inference is a powerful approach that goes beyond identifying patterns to uncover why certain findings in an X-ray relate to a specific diagnosis. In this paper, we will explore the prompt-driven framework Causal Reasoning for Patient-Centric Explanations in radiology Report Generation (CR3G) that is applied to chest X-ray analysis to improve understanding of AI-generated reports by focusing on cause-and-effect relationships, reasoning and generate patient-centric explanation. The aim to enhance the quality of AI-driven diagnostics, making them more useful and trustworthy in clinical practice. CR3G has shown better causal relationship capability and explanation capability for 2 out of 5 abnormalities.
zh

[AI-147] Active Inference with Reusable State-Dependent Value Profiles

【速读】:该论文旨在解决在波动环境中,智能体如何高效地适应不同潜在状态并实现价值控制策略的动态切换问题。传统方法需为每个情境独立维护偏好、策略偏差和动作置信度参数,这在计算上不可行。其解决方案的关键在于提出“价值谱系”(value profiles)——一种由少量可复用的价值相关参数包(包括结果偏好、策略先验和策略精度)构成的机制,这些参数被分配给生成模型中的隐状态。通过信念加权混合的方式,随着试验中后验信念的变化,能够自动生成适用于特定状态的控制参数,从而无需为每种情境单独设定参数即可实现状态条件下的策略选择。这一框架在概率反转学习任务中得到验证,结果显示基于价值谱系的模型优于静态或熵耦合动态精度模型,并揭示策略先验的调节是适应性控制的主要驱动力,而非策略精度。

链接: https://arxiv.org/abs/2512.11829
作者: Jacob Poschl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 27 pages

点击查看摘要

Abstract:Adaptive behavior in volatile environments requires agents to switch among value-control regimes across latent contexts, but maintaining separate preferences, policy biases, and action-confidence parameters for every situation is intractable. We introduce value profiles: a small set of reusable bundles of value-related parameters (outcome preferences, policy priors, and policy precision) assigned to hidden states in a generative model. As posterior beliefs over states evolve trial by trial, effective control parameters arise via belief-weighted mixing, enabling state-conditional strategy recruitment without requiring independent parameters for each context. We evaluate this framework in probabilistic reversal learning, comparing static-precision, entropy-coupled dynamic-precision, and profile-based models using cross-validated log-likelihood and information criteria. Model comparison favors the profile-based model over simpler alternatives (about 100-point AIC differences), and parameter-recovery analyses support structural identifiability even when context must be inferred from noisy observations. Model-based inference further suggests that adaptive control in this task is driven primarily by modulation of policy priors rather than policy precision, with gradual belief-dependent profile recruitment consistent with state-conditional (not purely uncertainty-driven) control. Overall, reusable value profiles provide a tractable computational account of belief-conditioned value control in volatile environments and yield testable signatures of belief-dependent control and behavioral flexibility.
zh

[AI-148] otalitarian Technics: The Hidden Cost of AI Scribes in Healthcare

【速读】:该论文试图解决的问题是:人工智能听写系统(AI scribes)在医疗实践中的广泛应用,虽然被宣传为缓解临床行政负担的解决方案,但其深层影响可能重塑医学关注的本质,进而危及临床专业性和医患关系的质量。论文的核心观点在于,AI听写系统的关键问题不在于是否提高效率,而在于其体现并强化了一种以左半球主导的、计算性的注意力模式,这种模式优先考虑可量化和程序化的内容,忽视了临床情境中的直觉与人际联结。因此,解决方案的关键在于重新审视技术对医学注意力结构的形塑作用,警惕技术嵌入对临床智慧的侵蚀,并推动更具整体性的人文导向医疗实践重构。

链接: https://arxiv.org/abs/2512.11814
作者: Hugh Brosnahan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) scribes, systems that record and summarise patient-clinician interactions, are promoted as solutions to administrative overload. This paper argues that their significance lies not in efficiency gains but in how they reshape medical attention itself. Offering a conceptual analysis, it situates AI scribes within a broader philosophical lineage concerned with the externalisation of human thought and skill. Drawing on Iain McGilchrist’s hemisphere theory and Lewis Mumford’s philosophy of technics, the paper examines how technology embodies and amplifies a particular mode of attention. AI scribes, it contends, exemplify the dominance of a left-hemispheric, calculative mindset that privileges the measurable and procedural over the intuitive and relational. As this mode of attention becomes further embedded in medical practice, it risks narrowing the field of care, eroding clinical expertise, and reducing physicians to operators within an increasingly mechanised system.
zh

[AI-149] AI-Generated Compromises for Coalition Formation

【速读】:该论文试图解决在多方协作场景中,如何有效识别并生成能够获得广泛支持的妥协提案(compromise proposals)这一核心问题,尤其针对民主文本编辑等需要多主体协商的领域。解决方案的关键在于构建一个融合代理有限理性与不确定性的形式化模型,并利用自然语言处理(Natural Language Processing, NLP)技术及大语言模型(Large Language Models, LLMs)从文本中诱导出语义度量空间(semantic metric space),在此基础上设计算法以识别高支持率的妥协点,从而显著提升大规模民主文本编辑过程中妥协方案的生成效率与可行性。

链接: https://arxiv.org/abs/2506.06837
作者: Eyal Briman,Ehud Shapiro,Nimrod Talmon
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The challenge of finding compromises between agent proposals is fundamental to AI subfields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. A crucial step in this process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals remains an open question. We address this gap by formalizing a model that incorporates agent bounded rationality and uncertainty, and by developing AI methods to generate compromise proposals. We focus on the domain of collaborative document writing, such as the democratic drafting of a community constitution. Our approach uses natural language processing techniques and large language models to induce a semantic metric space over text. Based on this space, we design algorithms to suggest compromise points likely to receive broad support. To evaluate our methods, we simulate coalition formation processes and show that AI can facilitate large-scale democratic text editing, a domain where traditional tools are limited.
zh

[AI-150] Carrot stick or both? Price incentives for sustainable food choice in competitive environments

【速读】:该论文旨在解决肉类消费作为全球温室气体排放主要驱动因素之一的问题,特别是在现实世界、竞争性餐饮环境中如何通过定价干预有效减少肉类摄入并降低碳足迹。其解决方案的关键在于设计并验证一种平衡的定价策略——即同时实施素食餐折扣与肉类餐加价的组合方案(-1.2 CHF = +1.2 CHF),该策略在不显著影响整体销售额或收入的前提下,实现了11.3%的每餐二氧化碳排放减少,且避免了因价格调整引发的销售转移(spillover effect),从而在环境效益与经济可行性之间取得最优平衡。

链接: https://arxiv.org/abs/2512.13174
作者: Francesco Salvi,Giuseppe Russo,Adam Barla,Vincent Moreau,Robert West
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Meat consumption is a major driver of global greenhouse gas emissions. While pricing interventions have shown potential to reduce meat intake, previous studies have focused on highly constrained environments with limited consumer choice. Here, we present the first large-scale field experiment to evaluate multiple pricing interventions in a real-world, competitive setting. Using a sequential crossover design with matched menus in a Swiss university campus, we systematically compared vegetarian-meal discounts (-2.5 CHF), meat surcharges (+2.5 CHF), and a combined scheme (-1.2 CHF=+1.2 CHF) across four campus cafeterias. Only the surcharge and combined interventions led to significant increases in vegetarian meal uptake–by 26.4% and 16.6%, respectively–and reduced CO2 emissions per meal by 7.4% and 11.3%, respectively. The surcharge, while effective, triggered a 12.3% drop in sales at intervention sites and a corresponding 14.9% increase in non-treated locations, hence causing a spillover effect that completely offset environmental gains. In contrast, the combined approach achieved meaningful emission reductions without significant effects on overall sales or revenue, making it both effective and economically viable. Notably, pricing interventions were equally effective for both vegetarian-leaning customers and habitual meat-eaters, stimulating change even within entrenched dietary habits. Our results show that balanced pricing strategies can reduce the carbon footprint of realistic food environments, but require coordinated implementation to maximize climate benefits and avoid unintended spillover effects.
zh

[AI-151] A Disproof of Large Language Model Consciousness: The Necessity of Continual Learning for Consciousness

【速读】:该论文试图解决的问题是:如何基于严格的理论约束来评估当代大型语言模型(Large Language Models, LLMs)是否具有意识。解决方案的关键在于提出一种新的“邻近性论证”(Proximity Argument),指出由于LLMs与其输入/输出功能等价的系统在结构上高度接近,而这些等价系统无法被任何非平凡的意识理论判定为具有意识,因此LLMs也不具备意识。这一论证构成了对当前LLMs具备意识的否定性证明。此外,论文进一步表明,基于持续学习(continual learning)的意识理论能够满足人类意识理论所需的严格形式约束,暗示持续学习可能是意识的关键特征,从而解释了LLMs因缺乏持续学习能力而无法具备意识的根本原因。

链接: https://arxiv.org/abs/2512.12802
作者: Erik Hoel
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:The requirements for a falsifiable and non-trivial theory of consciousness significantly constrain such theories. Specifically, recent research on the Unfolding Argument and the Substitution Argument has given us formal tools to analyze requirements for a theory of consciousness. I show via a new Proximity Argument that these requirements especially constrain the potential consciousness of contemporary Large Language Models (LLMs) because of their proximity to systems that are equivalent to LLMs in terms of input/output function; yet, for these functionally equivalent systems, there cannot be any non-trivial theory of consciousness that judges them conscious. This forms the basis of a disproof of contemporary LLM consciousness. I then show a positive result, which is that theories of consciousness based on (or requiring) continual learning do satisfy the stringent formal constraints for a theory of consciousness in humans. Intriguingly, this work supports a hypothesis: If continual learning is linked to consciousness in humans, the current limitations of LLMs (which do not continually learn) are intimately tied to their lack of consciousness.
zh

[AI-152] Explainable Artificial Intelligence for Economic Time Series: A Comprehensive Review and a Systematic Taxonomy of Methods and Concepts

【速读】:该论文旨在解决生成式 AI(Generative AI)在计算经济学中可解释性不足的问题,尤其是在经济时间序列预测场景下,传统解释方法因自相关、非平稳性、季节性、混合频率和状态转换等因素而失效或缺乏经济合理性。解决方案的关键在于构建一个系统性的分类框架,将解释方法按机制(传播类、扰动与博弈论归因类、函数类全局工具)和时序兼容性(保持时间依赖、时变稳定性、数据生成约束尊重)进行划分,并提出针对时间序列特性的改进方法,如向量和窗口形式的 SHAP(Vector SHAP, WindowSHAP),以减少滞后碎片化并提升解释效率;同时,通过干预性归因(Causal Shapley values)和受约束的反事实推理,将可解释性与因果推断及政策分析相连接,最终强调归因不确定性与解释动态作为结构变化的指标,为决策级应用(如实时预测、压力测试、状态监控)提供可靠依据。

链接: https://arxiv.org/abs/2512.12506
作者: Agustín García-García,Pablo Hidalgo,Julio E. Sandubete
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 table

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is increasingly required in computational economics, where machine-learning forecasters can outperform classical econometric models but remain difficult to audit and use for policy. This survey reviews and organizes the growing literature on XAI for economic time series, where autocorrelation, non-stationarity, seasonality, mixed frequencies, and regime shifts can make standard explanation techniques unreliable or economically implausible. We propose a taxonomy that classifies methods by (i) explanation mechanism: propagation-based approaches (e.g., Integrated Gradients, Layer-wise Relevance Propagation), perturbation and game-theoretic attribution (e.g., permutation importance, LIME, SHAP), and function-based global tools (e.g., Accumulated Local Effects); (ii) time-series compatibility, including preservation of temporal dependence, stability over time, and respect for data-generating constraints. We synthesize time-series-specific adaptations such as vector- and window-based formulations (e.g., Vector SHAP, WindowSHAP) that reduce lag fragmentation and computational cost while improving interpretability. We also connect explainability to causal inference and policy analysis through interventional attributions (Causal Shapley values) and constrained counterfactual reasoning. Finally, we discuss intrinsically interpretable architectures (notably attention-based transformers) and provide guidance for decision-grade applications such as nowcasting, stress testing, and regime monitoring, emphasizing attribution uncertainty and explanation dynamics as indicators of structural change.
zh

[AI-153] Accurate de novo sequencing of the modified proteome with OmniNovo

【速读】:该论文旨在解决当前蛋白质组学方法在识别未表征或复杂翻译后修饰(Post-translational modifications, PTMs)方面存在的局限性,尤其是标准数据库搜索算法因搜索空间组合爆炸而导致的识别效率低下问题。其解决方案的关键在于提出OmniNovo——一个统一的深度学习框架,无需参考序列即可直接从串联质谱数据中进行未修饰及修饰肽段的参考-free测序;该模型通过学习通用的碎裂规则,在单一模型中解析多种类型的PTMs,并结合质量约束解码算法与严格的假阳性率估计,实现了比传统方法高51%的肽段识别率,且具备良好的泛化能力,可应用于训练中未见的生物学位点,从而揭示蛋白质组中的“暗物质”,推动细胞调控机制的无偏全面分析。

链接: https://arxiv.org/abs/2512.12272
作者: Yuhan Chen,Shang Qu,Zhiqiang Gao,Yuejin Yang,Xiang Zhang,Sheng Xu,Xinjie Mao,Liujia Qian,Jiaqi Wei,Zijie Qiu,Chenyu You,Lei Bai,Ning Ding,Tiannan Guo,Bowen Zhou,Siqi Sun
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-translational modifications (PTMs) serve as a dynamic chemical language regulating protein function, yet current proteomic methods remain blind to a vast portion of the modified proteome. Standard database search algorithms suffer from a combinatorial explosion of search spaces, limiting the identification of uncharacterized or complex modifications. Here we introduce OmniNovo, a unified deep learning framework for reference-free sequencing of unmodified and modified peptides directly from tandem mass spectra. Unlike existing tools restricted to specific modification types, OmniNovo learns universal fragmentation rules to decipher diverse PTMs within a single coherent model. By integrating a mass-constrained decoding algorithm with rigorous false discovery rate estimation, OmniNovo achieves state-of-the-art accuracy, identifying 51% more peptides than standard approaches at a 1% false discovery rate. Crucially, the model generalizes to biological sites unseen during training, illuminating the dark matter of the proteome and enabling unbiased comprehensive analysis of cellular regulation.
zh

[AI-154] Stochastic Volatility Modelling with LSTM Networks: A Hybrid Approach for SP 500 Index Volatility Forecasting

【速读】:该论文旨在解决金融时间序列中波动率预测的准确性问题,尤其是在应对突发市场事件时的传统模型表现不足的问题。解决方案的关键在于构建一个融合随机波动率模型(Stochastic Volatility, SV)与长短期记忆神经网络(Long Short Term Memory, LSTM)的混合建模框架:SV模型通过捕捉潜在的波动率动态提升统计精度,尤其在处理突发事件时具有优势;LSTM则通过学习非线性模式增强对复杂金融时序特征的识别能力。该方法在标普500指数(S&P 500)日度数据上验证,采用滚动窗口进行训练和一步-ahead预测,结果表明该混合模型显著优于单独使用SV或LSTM模型,为风险评估与投资策略制定提供了更可靠的工具。

链接: https://arxiv.org/abs/2512.12250
作者: Anna Perekhodko,Robert Ślepaczuk
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Portfolio Management (q-fin.PM)
备注: 32 pages, 15 tables, 11 figures

点击查看摘要

Abstract:Accurate volatility forecasting is essential in banking, investment, and risk management, because expectations about future market movements directly influence current decisions. This study proposes a hybrid modelling framework that integrates a Stochastic Volatility model with a Long Short Term Memory neural network. The SV model improves statistical precision and captures latent volatility dynamics, especially in response to unforeseen events, while the LSTM network enhances the model’s ability to detect complex nonlinear patterns in financial time series. The forecasting is conducted using daily data from the S and P 500 index, covering the period from January 1 1998 to December 31 2024. A rolling window approach is employed to train the model and generate one step ahead volatility forecasts. The performance of the hybrid SV-LSTM model is evaluated through both statistical testing and investment simulations. The results show that the hybrid approach outperforms both the standalone SV and LSTM models and contributes to the development of volatility modelling techniques, providing a foundation for improving risk assessment and strategic investment planning in the context of the S and P 500.
zh

[AI-155] Gene regulatory network inference algorithm based on spectral signed directed graph convolution

【速读】:该论文旨在解决基因调控网络(Gene Regulatory Network, GRN)在单细胞RNA测序(scRNA-seq)数据背景下准确重构的问题,尤其关注如何有效建模GRN的符号有向图结构以捕捉基因间的激活与抑制关系。传统谱图卷积方法难以处理此类结构特征,为此,作者提出MSGRNLink框架,其核心创新在于将GRN显式建模为符号有向图,并引入磁符号拉普拉斯卷积(magnetic signed Laplacian convolution)来提取拓扑结构特征,从而显著提升GRN重构的准确性,在多个模拟和真实数据集上均优于基线模型,且在膀胱癌案例中预测出更多已知边及其符号,验证了其生物学合理性。

链接: https://arxiv.org/abs/2512.11927
作者: Rijie Xi,Weikang Xu,Wei Xiong,Yuannong Ye,Bin Zhao
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately reconstructing Gene Regulatory Networks (GRNs) is crucial for understanding gene functions and disease mechanisms. Single-cell RNA sequencing (scRNA-seq) technology provides vast data for computational GRN reconstruction. Since GRNs are ideally modeled as signed directed graphs to capture activation/inhibition relationships, the most intuitive and reasonable approach is to design feature extractors based on the topological structure of GRNs to extract structural features, then combine them with biological characteristics for research. However, traditional spectral graph convolution struggles with this representation. Thus, we propose MSGRNLink, a novel framework that explicitly models GRNs as signed directed graphs and employs magnetic signed Laplacian convolution. Experiments across simulated and real datasets demonstrate that MSGRNLink outperforms all baseline models in AUROC. Parameter sensitivity analysis and ablation studies confirmed its robustness and the importance of each module. In a bladder cancer case study, MSGRNLink predicted more known edges and edge signs than benchmark models, further validating its biological relevance.
zh

[AI-156] A fine-grained look at causal effects in causal spaces

【速读】:该论文旨在解决传统因果推断方法在现代数据域中面临的局限性问题,即当原始变量(如图像像素或语言模型中的词元)缺乏语义结构时,难以提出有意义的因果问题。其解决方案的关键在于引入事件层面(event-level)的因果分析框架,基于测度论的因果空间(causal spaces) axiomatisation,定义了判断因果效应是否存在的一系列二元指标,并建立了干预测度下因果效应与(独立性)之间的理论联系;同时提出了量化事件间因果效应强度和性质的度量方法,证明了经典治疗效应度量可作为其特例,从而为复杂数据结构提供了更精细、语义清晰的因果建模路径。

链接: https://arxiv.org/abs/2512.11919
作者: Junhyung Park,Yuqing Zhou
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:The notion of causal effect is fundamental across many scientific disciplines. Traditionally, quantitative researchers have studied causal effects at the level of variables; for example, how a certain drug dose (W) causally affects a patient’s blood pressure (Y). However, in many modern data domains, the raw variables-such as pixels in an image or tokens in a language model-do not have the semantic structure needed to formulate meaningful causal questions. In this paper, we offer a more fine-grained perspective by studying causal effects at the level of events, drawing inspiration from probability theory, where core notions such as independence are first given for events and sigma-algebras, before random variables enter the picture. Within the measure-theoretic framework of causal spaces, a recently introduced axiomatisation of causality, we first introduce several binary definitions that determine whether a causal effect is present, as well as proving some properties of them linking causal effect to (in)dependence under an intervention measure. Further, we provide quantifying measures that capture the strength and nature of causal effects on events, and show that we can recover the common measures of treatment effect as special cases.
zh

[AI-157] Understanding Structural Representation in Foundation Models for Polymers

【速读】:该论文旨在解决聚合物基础模型(foundation models for polymers)在开发过程中面临的多重挑战,包括训练数据稀缺和缺乏标准化基准等问题。其核心难点在于聚合物的结构表示方式,传统SMILES(Simplified Molecular Input Line Entry System)表示法难以准确捕捉聚合物的关键拓扑特征与连接性。为此,作者提出了一种基于SMILES的聚合物图表示方法(SMILES-based polymer graph representation),该方法能够有效表达聚合物的架构特征和化学连通性,从而显著提升模型在28个不同基准数据集上的性能表现。关键创新在于引入图结构以增强SMILES对聚合物复杂结构的表征能力,同时通过严格的控制实验揭示了现有SMILES表示方法在预测任务中具有强鲁棒性,甚至对化学或语义无效的输入也能实现良好泛化,这表明基于SMILES的语言模型具备在整个序列空间中插值的能力,而非仅限于合法分子输入。这一发现强调了控制实验在验证假设、推动理性设计聚合物基础模型及其结构表示中的重要价值。

链接: https://arxiv.org/abs/2512.11881
作者: Nathaniel H. Park,Eduardo Soares,Victor Y. Shirasuna,Tiffany J. Callahan,Sara Capponi,Emilio Vital Brazil
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:From the relative scarcity of training data to the lack of standardized benchmarks, the development of foundation models for polymers face significant and multi-faceted challenges. At the core, many of these issues are tied directly to the structural representation of polymers and here, we present a new foundation model using a SMILES-based polymer graph representation. This approach allows representation of critical polymer architectural features and connectivity that are not available in other SMILES-based representations. The developed polymer foundation model exhibited excellent performance on 28 different benchmark datasets. Critical evaluation of the developed representation against other variations in control experiments reveals this approach to be a highly performant method of representing polymers in language-based foundation models. These control experiments also reveal a strong invariance of all SMILES representations, with many variations achieving state-of-the-art or near state-of-the-art performance, including those which are chemically or semantically invalid. Examination of error sources and attention maps for the evaluated representations corroborate the findings of the control experiments, showing that chemistry language models based on SMILES interpolate over all sequence space for prediction tasks, not only those of semantically valid inputs. Overall, this work highlights the importance of control experiments as a check on human-imposed assumptions that can limit rational design of both chemistry foundation models and their underlying structural representations.
zh

[AI-158] A Multitask VAE for Time Series Preprocessing and Prediction of Blood Glucose Level

【速读】:该论文旨在解决时间序列数据预处理中因缺失值或异常值导致的模型偏差问题,此类问题在连接医疗设备采集的数据中尤为常见,且传统预处理方法依赖额外假设和领域知识,不仅耗时,还可能显著降低预测模型的准确性及医学解释的可靠性。解决方案的关键在于提出一种基于变分自编码器(Variational Auto-Encoder, VAE)与循环变分自编码器(Recurrent VAE)相结合的深度学习架构,通过构建一个可学习的预处理潜在空间来自动捕捉数据特征并保留时间动态特性,从而减少对人工预处理的依赖,并提升糖尿病患者血糖水平预测的准确性。

链接: https://arxiv.org/abs/2410.00015
作者: Ali AbuSaleh,Mehdi Rahim
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data preprocessing is a critical part of time series data analysis. Data from connected medical devices often have missing or abnormal values during acquisition. Handling such situations requires additional assumptions and domain knowledge. This can be time-consuming, and can introduce a significant bias affecting predictive model accuracy and thus, medical interpretation. To overcome this issue, we propose a new deep learning model to mitigate the preprocessing assumptions. The model architecture relies on a variational auto-encoder (VAE) to produce a preprocessing latent space, and a recurrent VAE to preserve the temporal dynamics of the data. We demonstrate the effectiveness of such an architecture on telemonitoring data to forecast glucose-level of diabetic patients. Our results show an improvement in terms of accuracy with respect of existing state-of-the-art methods and architectures.
zh

机器学习

[LG-0] A Scientific Reasoning Model for Organic Synthesis Procedure Generation

链接: https://arxiv.org/abs/2512.13668
作者: Guoqing Liu,Junren Li,Zihan Zhao,Eray Inanc,Krzysztof Maziarz,Jose Garrido Torres,Victor Garcia Satorras,Shoko Ueda,Christopher M. Bishop,Marwin Segler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG’s ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.

[LG-1] SEDULity: A Proof-of-Learning Framework for Distributed and Secure Blockchains with Efficient Useful Work

链接: https://arxiv.org/abs/2512.13666
作者: Weihang Cao,Mustafa Doger,Sennur Ulukus
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.

[LG-2] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

链接: https://arxiv.org/abs/2512.13632
作者: Guransh Singh,Md Shah Fahad
类目: Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a ‘block’ with a ‘prolongation’) due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve “Modality Collapse”, an “Echo Chamber” effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.

[LG-3] LightTopoGAT: Enhancing Graph Attention Networks with Topological Features for Efficient Graph Classification

链接: https://arxiv.org/abs/2512.13617
作者: Ankit Sharma,Sayan Roy Gupta
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

[LG-4] Scalable Formal Verification via Autoencoder Latent Space Abstraction

链接: https://arxiv.org/abs/2512.13593
作者: Robert Reed,Morteza Lahijanian,Luca Laurenti
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, under review

点击查看摘要

Abstract:Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

[LG-5] Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

链接: https://arxiv.org/abs/2512.13526
作者: Asa Cooper Stickland,Jan Michelfeit,Arathi Mani,Charlie Griffin,Ollie Matthews,Tomek Korbak,Rogan Inglis,Oliver Makins,Alan Cooney
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM-based software engineering agents are increasingly used in real-world development tasks, often with access to sensitive data or security-critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer’s internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor’s false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: this https URL.

[LG-6] Learning under Distributional Drift: Reproducibility as an Intrinsic Statistical Resource

链接: https://arxiv.org/abs/2512.13506
作者: Sofiya Zaichyk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 37 pages, 4 figures

点击查看摘要

Abstract:Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget C_T , which quantifies a system’s finite capacity for statistical reproducibility - the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, C_T is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order O(T^-1/2 + C_T/T) , and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate C_T/T of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with C_T emerging as the intrinsic quantity measuring distributional motion across these settings.

[LG-7] From Zipfs Law to Neural Scaling through Heaps Law and Hilbergs Hypothesis

链接: https://arxiv.org/abs/2512.13491
作者: Łukasz Dębowski
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages, no figures

点击查看摘要

Abstract:We inspect the deductive connection between the neural scaling law and Zipf’s law – two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model – such as a large language model – changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf’s law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf’s law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps’ law on the vocabulary growth from Zipf’s law, Hilberg’s hypothesis on the entropy scaling from Heaps’ law, and the neural scaling from Hilberg’s hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all the four statistical laws.

[LG-8] Real-Time AI-Driven Milling Digital Twin Towards Extreme Low-Latency

链接: https://arxiv.org/abs/2512.13482
作者: Wenyi Liu,R. Sharma,W. “Grace” Guo,J. Yi,Y.B. Guo
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twin (DT) enables smart manufacturing by leveraging real-time data, AI models, and intelligent control systems. This paper presents a state-of-the-art analysis on the emerging field of DTs in the context of milling. The critical aspects of DT are explored through the lens of virtual models of physical milling, data flow from physical milling to virtual model, and feedback from virtual model to physical milling. Live data streaming protocols and virtual modeling methods are highlighted. A case study showcases the transformative capability of a real-time machine learning-driven live DT of tool-work contact in a milling process. Future research directions are outlined to achieve the goals of Industry 4.0 and beyond.

[LG-9] Element-wise Modulation of Random Matrices for Efficient Neural Layers

链接: https://arxiv.org/abs/2512.13480
作者: Maksymilian Szorc
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fully connected layers are a primary source of memory and computational overhead in deep neural networks due to their dense, often redundant parameterization. While various compression techniques exist, they frequently introduce complex engineering trade-offs or degrade model performance. We propose the Parametrized Random Projection (PRP) layer, a novel approach that decouples feature mixing from adaptation by utilizing a fixed random matrix modulated by lightweight, learnable element-wise parameters. This architecture drastically reduces the trainable parameter count to a linear scale while retaining reliable accuracy across various benchmarks. The design serves as a stable, computationally efficient solution for architectural scaling and deployment in resource-limited settings.

[LG-10] DP-EMAR: A Differentially Private Framework for Autonomous Model Weight Repair in Federated IoT Systems

链接: https://arxiv.org/abs/2512.13460
作者: Chethana Prasad Kabgere,Shylaja S S
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted and presented at the AI-IoT Workshop, co-located with COMSNETS 2025

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized model training without sharing raw data, but model weight distortion remains a major challenge in resource constrained IoT networks. In multi tier Federated IoT (Fed-IoT) systems, unstable connectivity and adversarial interference can silently alter transmitted parameters, degrading convergence. We propose DP-EMAR, a differentially private, error model based autonomous repair framework that detects and reconstructs transmission induced distortions during FL aggregation. DP-EMAR estimates corruption patterns and applies adaptive correction before privacy noise is added, enabling reliable in network repair without violating confidentiality. By integrating Differential Privacy (DP) with Secure Aggregation (SA), the framework distinguishes DP noise from genuine transmission errors. Experiments on heterogeneous IoT sensor and graph datasets show that DP-EMAR preserves convergence stability and maintains near baseline performance under communication corruption while ensuring strict (epsilon, delta)-DP guarantees. The framework enhances robustness, communication efficiency, and trust in privacy preserving Federated IoT learning.

[LG-11] XNNTab – Interpretable Neural Networks for Tabular Data using Sparse Autoencoders

链接: https://arxiv.org/abs/2512.13442
作者: Khawla Elhadri,Jörg Schlötterer,Christin Seifert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In data-driven applications relying on tabular data, where interpretability is key, machine learning models such as decision trees and linear regression are applied. Although neural networks can provide higher predictive performance, they are not used because of their blackbox nature. In this work, we present XNNTab, a neural architecture that combines the expressiveness of neural networks and interpretability. XNNTab first learns highly non-linear feature representations, which are decomposed into monosemantic features using a sparse autoencoder (SAE). These features are then assigned human-interpretable concepts, making the overall model prediction intrinsically interpretable. XNNTab outperforms interpretable predictive models, and achieves comparable performance to its non-interpretable counterparts.

[LG-12] Multiclass Graph-Based Large Margin Classifiers: Unified Approach for Support Vectors and Neural Networks

链接: https://arxiv.org/abs/2512.13410
作者: Vítor M. Hanriot,Luiz C. B. Torres,Antônio P. Braga
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

点击查看摘要

Abstract:While large margin classifiers are originally an outcome of an optimization framework, support vectors (SVs) can be obtained from geometric approaches. This article presents advances in the use of Gabriel graphs (GGs) in binary and multiclass classification problems. For Chipclass, a hyperparameter-less and optimization-less GG-based binary classifier, we discuss how activation functions and support edge (SE)-centered neurons affect the classification, proposing smoother functions and structural SV (SSV)-centered neurons to achieve margins with low probabilities and smoother classification contours. We extend the neural network architecture, which can be trained with backpropagation with a softmax function and a cross-entropy loss, or by solving a system of linear equations. A new subgraph-/distance-based membership function for graph regularization is also proposed, along with a new GG recomputation algorithm that is less computationally expensive than the standard approach. Experimental results with the Friedman test show that our method was better than previous GG-based classifiers and statistically equivalent to tree-based models.

[LG-13] Dual-Phase Federated Deep Unlearning via Weight-Aware Rollback and Reconstruction

链接: https://arxiv.org/abs/2512.13381
作者: Changjun Zhou,Jintao Zheng,Leyou Yang,Pengfei Wang
类目: Machine Learning (cs.LG)
*备注: 10 pages, submitted to INFOCOM 2026

点击查看摘要

Abstract:Federated Unlearning (FUL) focuses on client data and computing power to offer a privacy-preserving solution. However, high computational demands, complex incentive mechanisms, and disparities in client-side computing power often lead to long times and higher costs. To address these challenges, many existing methods rely on server-side knowledge distillation that solely removes the updates of the target client, overlooking the privacy embedded in the contributions of other clients, which can lead to privacy leakage. In this work, we introduce DPUL, a novel server-side unlearning method that deeply unlearns all influential weights to prevent privacy pitfalls. Our approach comprises three components: (i) identifying high-weight parameters by filtering client update magnitudes, and rolling them back to ensure deep removal. (ii) leveraging the variational autoencoder (VAE) to reconstruct and eliminate low-weight parameters. (iii) utilizing a projection-based technique to recover the model. Experimental results on four datasets demonstrate that DPUL surpasses state-of-the-art baselines, providing a 1%-5% improvement in accuracy and up to 12x reduction in time cost.

[LG-14] Fast Policy Learning for 6-DOF Position Control of Underwater Vehicles

链接: https://arxiv.org/abs/2512.13359
作者: Sümer Tunçay,Alain Andres,Ignacio Carlucho
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two this http URL systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.

[LG-15] Link-Aware Energy-Frugal Continual Learning for Fault Detection in IoT Networks

链接: https://arxiv.org/abs/2512.13340
作者: Henrik C. M. Frederiksen,Junya Shiraishi,Cedomir Stefanovic,Hei Victor Cheng,Shashi Raj Pandey
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The use of lightweight machine learning (ML) models in internet of things (IoT) networks enables resource constrained IoT devices to perform on-device inference for several critical applications. However, the inference accuracy deteriorates due to the non-stationarity in the IoT environment and limited initial training data. To counteract this, the deployed models can be updated occasionally with new observed data samples. However, this approach consumes additional energy, which is undesirable for energy constrained IoT devices. This letter introduces an event-driven communication framework that strategically integrates continual learning (CL) in IoT networks for energy-efficient fault detection. Our framework enables the IoT device and the edge server (ES) to collaboratively update the lightweight ML model by adapting to the wireless link conditions for communication and the available energy budget. Evaluation on real-world datasets show that the proposed approach can outperform both periodic sampling and non-adaptive CL in terms of inference recall; our proposed approach achieves up to a 42.8% improvement, even under tight energy and bandwidth constraint.

[LG-16] FROC: A Unified Framework with Risk-Optimized Control for Machine Unlearning in LLM s

链接: https://arxiv.org/abs/2512.13337
作者: Si Qi Goh,Yongsen Zheng,Ziyao Liu,Sami Hormi,Kwok-Yan Lam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) seeks to eliminate the influence of specific training examples from deployed models. As large language models (LLMs) become widely used, managing risks arising from insufficient forgetting or utility loss is increasingly crucial. Current MU techniques lack effective mechanisms for evaluating and controlling these risks, hindering the selection of strategies that appropriately balance safety and utility, and raising trust concerns surrounding the “right to be forgotten.” To address these issues, we propose FROC, a unified framework with Risk-Optimized Control for machine unlearning in LLMs. FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior. This probability-based constraint enables FROC to compare MU strategies, identify feasible operating regions, and guide hyperparameter selection according to desired trade-offs between forgetting sufficiency and utility preservation. To operationalize this constraint, FROC introduces a smoothly varying continuous risk model that aggregates forgetting deficiency and utility degradation into a single configuration-level score. Building on conformal risk analysis, FROC computes (1) the Conformal Unlearning Risk (CUR), a data-driven estimated value on the probability that forgotten samples continue to influence model predictions, and (2) risk-controlled configuration sets, which identify unlearning hyperparameters that are valid under the specified risk budget. Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes and reveals consistent relationships between unlearning configurations, semantic shift, and utility impact. FROC reframes MU as a controllable, risk-aware process and offers a practical foundation for managing unlearning behavior in large-scale LLM deployments.

[LG-17] KD-PINN: Knowledge-Distilled PINNs for ultra-low-latency real-time neural PDE solvers

链接: https://arxiv.org/abs/2512.13336
作者: Karim Bounja,Lahcen Laayouni,Abdeljalil Sakat
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This work introduces Knowledge-Distilled Physics-Informed Neural Networks (KD-PINN), a framework that transfers the predictive accuracy of a high-capacity teacher model to a compact student through a continuous adaptation of the Kullback-Leibler divergence. To confirm its generality for various dynamics and dimensionalities, the framework is evaluated on a representative set of partial differential equations (PDEs). In all tested cases, the student model preserved the teacher’s physical accuracy, with a mean RMSE increase below 0.64%, and achieved inference speedups ranging from 4.8x (Navier-Stokes) to 6.9x (Burgers). The distillation process also revealed a regularizing effect. With an average inference latency of 5.3 ms on CPU, the distilled models enter the ultra-low-latency real-time regime defined by sub-10 ms performance. Finally, this study examines how knowledge distillation reduces inference latency in PINNs to contribute to the development of accurate ultra-low-latency neural PDE solvers.

[LG-18] BézierFlow: Bézier Stochastic Interpolant Schedulers for Few-Step Generation

链接: https://arxiv.org/abs/2512.13255
作者: Yunhong Min,Juil Koo,Seungwoo Yoo,Minhyuk Sung
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce BézierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. BézierFlow achieves a 2-3x performance improvement for sampling with \leq 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as Bézier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to Bézier control points. Across a range of pretrained diffusion and flow models, BézierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to Bézier-based trajectory transformations.

[LG-19] Learning to Retrieve with Weakened Labels: Robust Training under Label Noise

链接: https://arxiv.org/abs/2512.13237
作者: Arnab Sharma
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model’s confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.

[LG-20] ModSSC: A Modular Framework for Semi-Supervised Classification on Heterogeneous Data

链接: https://arxiv.org/abs/2512.13228
作者: Melvin Barbaux(IMB)
类目: Machine Learning (cs.LG)
*备注: Preprint describing the open source ModSSC framework for inductive and transductive semi-supervised classification on heterogeneous data

点击查看摘要

Abstract:Semi-supervised classification leverages both labeled and unlabeled data to improve predictive performance, but existing software support is fragmented across methods and modalities. We introduce ModSSC, an open source Python framework that unifies inductive and transductive semi-supervised classification in a modular code base. ModSSC implements a broad range of classical and recent algorithms, provides loaders for tabular, image, text, audio and graph datasets, and exposes a single configuration interface for specifying datasets, models and evaluation protocols. It supports both lightweight classical methods on small datasets running on CPU and recent deep approaches that can exploit multiple GPUs within the same experimental framework. Experiments are described declaratively in YAML, which facilitates reproducing existing work and running large comparative studies. ModSSC 1.0.0 is released under the MIT license with extensive documentation and tests, and is available at this https URL.

[LG-21] Evaluating Adversarial Attacks on Federated Learning for Temperature Forecasting

链接: https://arxiv.org/abs/2512.13207
作者: Karina Chichifoi,Fabio Merizzi,Michele Colajanni
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep learning and federated learning (FL) are becoming powerful partners for next-generation weather forecasting. Deep learning enables high-resolution spatiotemporal forecasts that can surpass traditional numerical models, while FL allows institutions in different locations to collaboratively train models without sharing raw data, addressing efficiency and security concerns. While FL has shown promise across heterogeneous regions, its distributed nature introduces new vulnerabilities. In particular, data poisoning attacks, in which compromised clients inject manipulated training data, can degrade performance or introduce systematic biases. These threats are amplified by spatial dependencies in meteorological data, allowing localized perturbations to influence broader regions through global model aggregation. In this study, we investigate how adversarial clients distort federated surface temperature forecasts trained on the Copernicus European Regional ReAnalysis (CERRA) dataset. We simulate geographically distributed clients and evaluate patch-based and global biasing attacks on regional temperature forecasts. Our results show that even a small fraction of poisoned clients can mislead predictions across large, spatially connected areas. A global temperature bias attack from a single compromised client shifts predictions by up to -1.7 K, while coordinated patch attacks more than triple the mean squared error and produce persistent regional anomalies exceeding +3.5 K. Finally, we assess trimmed mean aggregation as a defense mechanism, showing that it successfully defends against global bias attacks (2-13% degradation) but fails against patch attacks (281-603% amplification), exposing limitations of outlier-based defenses for spatially correlated data.

[LG-22] Noise-Resilient Quantum Aggregation on NISQ for Federated ADAS Learning

链接: https://arxiv.org/abs/2512.13196
作者: Chethana Prasad Kabgere,Sudarshan T S B
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: This paper was accepted and presented at WinTechCon 2025, Bangalore, India, and is published in IEEE Xplore

点击查看摘要

Abstract:Advanced Driver Assistance Systems (ADAS) increasingly employ Federated Learning (FL) to collaboratively train models across distributed vehicular nodes while preserving data privacy. Yet, conventional FL aggregation remains susceptible to noise, latency, and security constraints inherent to real-time vehicular networks. This paper introduces Noise-Resilient Quantum Federated Learning (NR-QFL), a hybrid quantum-classical framework that enables secure, low-latency aggregation through variational quantum circuits (VQCs) operating under Noisy Intermediate-Scale Quantum (NISQ) conditions. The framework encodes model parameters as quantum states with adaptive gate reparameterization, ensuring bounded-error convergence and provable resilience under Completely Positive Trace-Preserving (CPTP) dynamics. NR-QFL employs quantum entropy-based client selection and multi-server coordination for fairness and stability. Empirical validation shows consistent convergence with reduced gradient variance, lower communication overhead, and enhanced noise tolerance under constrained edge conditions. The framework establishes a scalable foundation for quantum-enhanced federated learning, enabling secure, efficient, and dynamically stable ADAS intelligence at the vehicular edge.

[LG-23] Iterative Tuning of Nonlinear Model Predictive Control for Robotic Manufacturing Tasks

链接: https://arxiv.org/abs/2512.13170
作者: Deepak Ingole,Valentin Bhend,Shiva Ganesh Murali,Oliver Dobrich,Alisa Rupenayan
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.

[LG-24] Enhancing Node-Level Graph Domain Adaptation by Alleviating Local Dependency KDD2026

链接: https://arxiv.org/abs/2512.13149
作者: Xinwei Tai,Dongmian Zou,Hongfei Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to KDD 2026

点击查看摘要

Abstract:Recent years have witnessed significant advancements in machine learning methods on graphs. However, transferring knowledge effectively from one graph to another remains a critical challenge. This highlights the need for algorithms capable of applying information extracted from a source graph to an unlabeled target graph, a task known as unsupervised graph domain adaptation (GDA). One key difficulty in unsupervised GDA is conditional shift, which hinders transferability. In this paper, we show that conditional shift can be observed only if there exists local dependencies among node features. To support this claim, we perform a rigorous analysis and also further provide generalization bounds of GDA when dependent node features are modeled using markov chains. Guided by the theoretical findings, we propose to improve GDA by decorrelating node features, which can be specifically implemented through decorrelated GCN layers and graph transformer layers. Our experimental results demonstrate the effectiveness of this approach, showing not only substantial performance enhancements over baseline GDA methods but also clear visualizations of small intra-class distances in the learned representations. Our code is available at this https URL

[LG-25] Quanvolutional Neural Networks for Spectrum Peak-Finding

链接: https://arxiv.org/abs/2512.13125
作者: Lukas Bischof,Rudolf M. Füchslin,Kurt Stockinger,Pavel Sulimov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The analysis of spectra, such as Nuclear Magnetic Resonance (NMR) spectra, for the comprehensive characterization of peaks is a challenging task for both experts and machines, especially with complex molecules. This process, also known as deconvolution, involves identifying and quantifying the peaks in the spectrum. Machine learning techniques have shown promising results in automating this process. With the advent of quantum computing, there is potential to further enhance these techniques. In this work, inspired by the success of classical Convolutional Neural Networks (CNNs), we explore the use of Quanvolutional Neural Networks (QuanvNNs) for the multi-task peak finding problem, involving both peak counting and position estimation. We implement a simple and interpretable QuanvNN architecture that can be directly compared to its classical CNN counterpart, and evaluate its performance on a synthetic NMR-inspired dataset. Our results demonstrate that QuanvNNs outperform classical CNNs on challenging spectra, achieving an 11% improvement in F1 score and a 30% reduction in mean absolute error for peak position estimation. Additionally, QuanvNNs appear to exhibit better convergence stability for harder problems.

[LG-26] owards Practical Large-scale Dynamical Heterogeneous Graph Embedding: Cold-start Resilient Recommendation

链接: https://arxiv.org/abs/2512.13120
作者: Mabiao Long,Jiaxi Liu,Yufeng Li,Hao Xiong,Junchi Yan,Kefan Wang,Yi Cao,Jiandong Ding
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with linear scalability, while ILLE provides rapid, targeted updates to incorporate new data, thus avoiding costly full retraining. This dual approach is cold-start resilient, leveraging the graph to create meaningful embeddings from sparse data. On billion-scale graphs, A/B tests show HetSGFormer achieved up to a 6.11% lift in Advertiser Value over previous methods, while the ILLE module added another 3.22% lift and improved embedding refresh timeliness by 83.2%. Our work provides a validated framework for deploying dynamic graph learning in production environments.

[LG-27] PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

链接: https://arxiv.org/abs/2512.13093
作者: Mingqi Yuan,Tao Yu,Haolin Song,Bo Li,Xin Jin,Hua Chen,Wenjun Zeng
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.

[LG-28] LikeBench: Evaluating Subjective Likability in LLM s for Personalization

链接: https://arxiv.org/abs/2512.13077
作者: Md Awsafur Rahman,Adam Gabrys,Doug Kang,Jingjing Sun,Tian Tan,Ashwin Chandramouli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user’s preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3’s higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

[LG-29] Multi-fidelity aerodynamic data fusion by autoencoder transfer learning

链接: https://arxiv.org/abs/2512.13069
作者: Javier Nieto-Centenero,Esther Andrés,Rodrigo Castellanos
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Machine Learning (stat.ML)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Accurate aerodynamic prediction often relies on high-fidelity simulations; however, their prohibitive computational costs severely limit their applicability in data-driven modeling. This limitation motivates the development of multi-fidelity strategies that leverage inexpensive low-fidelity information without compromising accuracy. Addressing this challenge, this work presents a multi-fidelity deep learning framework that combines autoencoder-based transfer learning with a newly developed Multi-Split Conformal Prediction (MSCP) strategy to achieve uncertainty-aware aerodynamic data fusion under extreme data scarcity. The methodology leverages abundant Low-Fidelity (LF) data to learn a compact latent physics representation, which acts as a frozen knowledge base for a decoder that is subsequently fine-tuned using scarce HF samples. Tested on surface-pressure distributions for NACA airfoils (2D) and a transonic wing (3D) databases, the model successfully corrects LF deviations and achieves high-accuracy pressure predictions using minimal HF training data. Furthermore, the MSCP framework produces robust, actionable uncertainty bands with pointwise coverage exceeding 95%. By combining extreme data efficiency with uncertainty quantification, this work offers a scalable and reliable solution for aerodynamic regression in data-scarce environments.

[LG-30] Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments

链接: https://arxiv.org/abs/2512.13060
作者: Kangning Gao,Yi Hu,Cong Nie,Wei Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of low scheduling efficiency, unbalanced resource allocation, and poor adaptability in ETL (Extract-Transform-Load) processes under heterogeneous data environments by proposing an intelligent scheduling optimization framework based on deep Q-learning. The framework formalizes the ETL scheduling process as a Markov Decision Process and enables adaptive decision-making by a reinforcement learning agent in high-dimensional state spaces to dynamically optimize task allocation and resource scheduling. The model consists of a state representation module, a feature embedding network, a Q-value estimator, and a reward evaluation mechanism, which collectively consider task dependencies, node load states, and data flow characteristics to derive the optimal scheduling strategy in complex environments. A multi-objective reward function is designed to balance key performance indicators such as average scheduling delay, task completion rate, throughput, and resource utilization. Sensitivity experiments further verify the model’s robustness under changes in hyperparameters, environmental dynamics, and data scale. Experimental results show that the proposed deep Q-learning scheduling framework significantly reduces scheduling delay, improves system throughput, and enhances execution stability under multi-source heterogeneous task conditions, demonstrating the strong potential of reinforcement learning in complex data scheduling and resource management, and providing an efficient and scalable optimization strategy for intelligent data pipeline construction.

[LG-31] Progressive Refinement of E-commerce Search Ranking Based on Short-Term Activities of the Buyer

链接: https://arxiv.org/abs/2512.13037
作者: Taoran Sheng,Sathappan Muthiah,Atiq Islam,Jinming Feng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In e-commerce shopping, aligning search results with a buyer’s immediate needs and preferences presents a significant challenge, particularly in adapting search results throughout the buyer’s shopping journey as they move from the initial stages of browsing to making a purchase decision or shift from one intent to another. This study presents a systematic approach to adapting e-commerce search results based on the current context. We start with basic methods and incrementally incorporate more contextual information and state-of-the-art techniques to improve the search outcomes. By applying this evolving contextual framework to items displayed on the search engine results page (SERP), we progressively align search outcomes more closely with the buyer’s interests and current search intentions. Our findings demonstrate that this incremental enhancement, from simple heuristic autoregressive features to advanced sequence models, significantly improves ranker performance. The integration of contextual techniques enhances the performance of our production ranker, leading to improved search results in both offline and online A/B testing in terms of Mean Reciprocal Rank (MRR). Overall, the paper details iterative methodologies and their substantial contributions to search result contextualization on e-commerce platforms.

[LG-32] Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization

链接: https://arxiv.org/abs/2512.13034
作者: Xiaoyu He,Yu Cai,Jin Jia,Canxi Huang,Wenqing Chen,Zibin Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work proposes Alada, an adaptive momentum method for stochastic optimization over large-scale matrices. Alada employs a rank-one factorization approach to estimate the second moment of gradients, where factors are updated alternatively to minimize the estimation error. Alada achieves sublinear memory overheads and can be readily extended to optimizing tensor-shaped this http URL also equip Alada with a first moment estimation rule, which enhances the algorithm’s robustness without incurring additional memory overheads. The theoretical performance of Alada aligns with that of traditional methods such as Adam. Numerical studies conducted on several natural language processing tasks demonstrate the reduction in memory overheads and the robustness in training large models relative to Adam and its variants.

[LG-33] Deep Learning-Driven Inversion Framework for Shear Modulus Estimation in Magnetic Resonance Elastography (DIME)

链接: https://arxiv.org/abs/2512.13010
作者: Hassan Iftikhar,Rizwan Ahmad,Arunark Kolipaka
类目: Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:The Multimodal Direct Inversion (MMDI) algorithm is widely used in Magnetic Resonance Elastography (MRE) to estimate tissue shear stiffness. However, MMDI relies on the Helmholtz equation, which assumes wave propagation in a uniform, homogeneous, and infinite medium. Furthermore, the use of the Laplacian operator makes MMDI highly sensitive to noise, which compromises the accuracy and reliability of stiffness estimates. In this study, we propose the Deep-Learning driven Inversion Framework for Shear Modulus Estimation in MRE (DIME), aimed at enhancing the robustness of inversion. DIME is trained on the displacement fields-stiffness maps pair generated through Finite Element Modelling (FEM) simulations. To capture local wave behavior and improve robustness to global image variations, DIME is trained on small image patches. We first validated DIME using homogeneous and heterogeneous datasets simulated with FEM, where DIME produced stiffness maps with low inter-pixel variability, accurate boundary delineation, and higher correlation with ground truth (GT) compared to MMDI. Next, DIME was evaluated in a realistic anatomy-informed simulated liver dataset with known GT and compared directly to MMDI. DIME reproduced ground-truth stiffness patterns with high fidelity (r = 0.99, R^2 = 0.98), while MMDI showed greater underestimation. After validating DIME on synthetic data, we tested the model in in vivo liver MRE data from eight healthy and seven fibrotic subjects. DIME preserved physiologically consistent stiffness patterns and closely matched MMDI, which showed directional bias. Overall, DIME showed higher correlation with ground truth and visually similar stiffness patterns, whereas MMDI displayed a larger bias that can potentially be attributed to directional filtering. These preliminary results highlight the feasibility of DIME for clinical applications in MRE.

[LG-34] CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks

链接: https://arxiv.org/abs/2512.12981
作者: Jonathan Wenshøj,Tong Chen,Bob Pepin,Raghavendra Selvan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Source code at this https URL

点击查看摘要

Abstract:While joint pruning–quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning–quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.

[LG-35] Application of Deep Learning in Biological Data Compression

链接: https://arxiv.org/abs/2512.12975
作者: Chunyu Zou
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Cryogenic electron microscopy (Cryo-EM) has become an essential tool for capturing high-resolution biological structures. Despite its advantage in visualizations, the large storage size of Cryo-EM data file poses significant challenges for researchers and educators. This paper investigates the application of deep learning, specifically implicit neural representation (INR), to compress Cryo-EM biological data. The proposed approach first extracts the binary map of each file according to the density threshold. The density map is highly repetitive, ehich can be effectively compressed by GZIP. The neural network then trains to encode spatial density information, allowing the storage of network parameters and learnable latent vectors. To improve reconstruction accuracy, I further incorporate the positional encoding to enhance spatial representation and a weighted Mean Squared Error (MSE) loss function to balance density distribution variations. Using this approach, my aim is to provide a practical and efficient biological data compression solution that can be used for educational and research purpose, while maintaining a reasonable compression ratio and reconstruction quality from file to file.

[LG-36] Understanding When Graph Convolutional Networks Help: A Diagnostic Study on Label Scarcity and Structural Properties

链接: https://arxiv.org/abs/2512.12947
作者: Nischal Subedi,Ember Kerstetter,Winnie Li,Silo Murphy
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 10 pages, 8 tables,5 figures

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have become a standard approach for semi-supervised node classification, yet practitioners lack clear guidance on when GCNs provide meaningful improvements over simpler baselines. We present a diagnostic study using the Amazon Computers co-purchase data to understand when and why GCNs help. Through systematic experiments with simulated label scarcity, feature ablation, and per-class analysis, we find that GCN performance depends critically on the interaction between graph homophily and feature quality. GCNs provide the largest gains under extreme label scarcity, where they leverage neighborhood structure to compensate for limited supervision. Surprisingly, GCNs can match their original performance even when node features are replaced with random noise, suggesting that structure alone carries sufficient signal on highly homophilous graphs. However, GCNs hurt performance when homophily is low and features are already strong, as noisy neighbors corrupt good predictions. Our quadrant analysis reveals that GCNs help in three of four conditions and only hurt when low homophily meets strong features. These findings offer practical guidance for practitioners deciding whether to adopt graph-based methods.

[LG-37] SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision

链接: https://arxiv.org/abs/2512.12930
作者: Yuseon Choi,Sangjin Kim,Jungjun Oh,Byeongcheol Kim,Hoi-Jun Yoo
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.

[LG-38] LLM -based Personalized Portfolio Recommender: Integrating Large Language Models and Reinforcement Learning for Intelligent Investment Strategy Optimization

链接: https://arxiv.org/abs/2512.12922
作者: Bangyu Li,Boping Gu,Ziyang Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern financial markets, investors increasingly seek personalized and adaptive portfolio strategies that reflect their individual risk preferences and respond to dynamic market conditions. Traditional rule-based or static optimization approaches often fail to capture the nonlinear interactions among investor behavior, market volatility, and evolving financial objectives. To address these limitations, this paper introduces the LLM-based Personalized Portfolio Recommender , an integrated framework that combines Large Language Models, reinforcement learning, and individualized risk preference modeling to support intelligent investment decision-making.

[LG-39] Machine Learning Architectures for the Estimation of Predicted Occupancy Grids in Road Traffic

链接: https://arxiv.org/abs/2512.12907
作者: Parthasarathy Nadarajan,Michael Botsch,Sebastian Sardina
类目: Machine Learning (cs.LG)
*备注: Journal of Advances in Information Technology

点击查看摘要

Abstract:This paper introduces a novel machine learning architecture for an efficient estimation of the probabilistic space-time representation of complex traffic scenarios. A detailed representation of the future traffic scenario is of significant importance for autonomous driving and for all active safety systems. In order to predict the future space-time representation of the traffic scenario, first the type of traffic scenario is identified and then the machine learning algorithm maps the current state of the scenario to possible future states. The input to the machine learning algorithms is the current state representation of a traffic scenario, termed as the Augmented Occupancy Grid (AOG). The output is the probabilistic space-time representation which includes uncertainties regarding the behaviour of the traffic participants and is termed as the Predicted Occupancy Grid (POG). The novel architecture consists of two Stacked Denoising Autoencoders (SDAs) and a set of Random Forests. It is then compared with the other two existing architectures that comprise of SDAs and DeconvNet. The architectures are validated with the help of simulations and the comparisons are made both in terms of accuracy and computational time. Also, a brief overview on the applications of POGs in the field of active safety is presented.

[LG-40] Next-generation reservoir computing validated by classification task

链接: https://arxiv.org/abs/2512.12903
作者: Ken-ichi Kitayama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An emerging computing paradigm, so-called next-generation reservoir computing (NG-RC) is investigated. True to its namesake, NG-RC requires no actual reservoirs for input data mixing but rather computing the polynomial terms directly from the time series inputs. However, benchmark tests so far reported have been one-sided, limited to prediction tasks of temporal waveforms such as Lorenz 63 attractor and Mackey-Glass chaotic signal. We will demonstrate for the first time that NG-RC can perform classification task as good as conventional RC. This validates the versatile computational capability of NG-RC in tasks of both prediction and classification.

[LG-41] Predicted-occupancy grids for vehicle safety applications based on autoencoders and the Random Forest algorithm IJCNN

链接: https://arxiv.org/abs/2512.12901
作者: Parthasarathy Nadarajan,Michael Botsch,Sebastian Sardina
类目: Machine Learning (cs.LG)
*备注: 2017 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:In this paper, a probabilistic space-time representation of complex traffic scenarios is predicted using machine learning algorithms. Such a representation is significant for all active vehicle safety applications especially when performing dynamic maneuvers in a complex traffic scenario. As a first step, a hierarchical situation classifier is used to distinguish the different types of traffic scenarios. This classifier is responsible for identifying the type of the road infrastructure and the safety-relevant traffic participants of the driving environment. With each class representing similar traffic scenarios, a set of Random Forests (RFs) is individually trained to predict the probabilistic space-time representation, which depicts the future behavior of traffic participants. This representation is termed as a Predicted-Occupancy Grid (POG). The input to the RFs is an Augmented Occupancy Grid (AOG). In order to increase the learning accuracy of the RFs and to perform better predictions, the AOG is reduced to low-dimensional features using a Stacked Denoising Autoencoder (SDA). The excellent performance of the proposed machine learning approach consisting of SDAs and RFs is demonstrated in simulations and in experiments with real vehicles. An application of POGs to estimate the criticality of traffic scenarios and to determine safe trajectories is also presented.

[LG-42] Probability Estimation for Predicted-Occupancy Grids in Vehicle Safety Applications Based on Machine Learning

链接: https://arxiv.org/abs/2512.12896
作者: Parthasarathy Nadarajan,Michael Botsch
类目: Machine Learning (cs.LG)
*备注: 2016 IEEE Intelligent Vehicles Symposium

点击查看摘要

Abstract:This paper presents a method to predict the evolution of a complex traffic scenario with multiple objects. The current state of the scenario is assumed to be known from sensors and the prediction is taking into account various hypotheses about the behavior of traffic participants. This way, the uncertainties regarding the behavior of traffic participants can be modelled in detail. In the first part of this paper a model-based approach is presented to compute Predicted-Occupancy Grids (POG), which are introduced as a grid-based probabilistic representation of the future scenario hypotheses. However, due to the large number of possible trajectories for each traffic participant, the model-based approach comes with a very high computational load. Thus, a machine-learning approach is adopted for the computation of POGs. This work uses a novel grid-based representation of the current state of the traffic scenario and performs the mapping to POGs. This representation consists of augmented cells in an occupancy grid. The adopted machine-learning approach is based on the Random Forest algorithm. Simulations of traffic scenarios are performed to compare the machine-learning with the model-based approach. The results are promising and could enable the real-time computation of POGs for vehicle safety applications. With this detailed modelling of uncertainties, crucial components in vehicle safety systems like criticality estimation and trajectory planning can be improved.

[LG-43] Wait Wait Wait… Why Do Reasoning Models Loop?

链接: https://arxiv.org/abs/2512.12895
作者: Charilaos Pipis,Shivam Garg,Vasilis Kontonis,Vaishnavi Shrivastava,Akshay Krishnamurthy,Dimitris Papailiopoulos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.

[LG-44] Distillation of Discrete Diffusion by Exact Conditional Distribution Matching

链接: https://arxiv.org/abs/2512.12889
作者: Yansong Gao,Yu Sun
类目: Machine Learning (cs.LG)
*备注: [work in progress]

点击查看摘要

Abstract:Discrete diffusion models (DDMs) are a powerful class of generative models for categorical data, but they typically require many function evaluations for a single sample, making inference expensive. Existing acceleration methods either rely on approximate simulators, such as \tau -leaping, or on distillation schemes that train new student models and auxiliary networks with proxy objectives. We propose a simple and principled distillation alternative based on \emphconditional distribution matching. Our key observation is that the reverse conditional distribution of clean data given a noisy state, p_0\mid t(x_0 \mid x_t) , admits a Markov decomposition through intermediate times and can be recovered from marginal density ratios and the known forward CTMC kernel. We exploit this structure to define distillation objectives that directly match conditional distributions between a pre-trained teacher and a low-NFE student, both for one-step and few-step samplers.

[LG-45] Unsupervised learning of multiscale switching dynamical system models from multimodal neural data

链接: https://arxiv.org/abs/2512.12881
作者: DongKyu Kim,Han-Lin Hsieh,Maryam M. Shanechi
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: 30 pages, 8 figures

点击查看摘要

Abstract:Neural population activity often exhibits regime-dependent non-stationarity in the form of switching dynamics. Learning accurate switching dynamical system models can reveal how behavior is encoded in neural activity. Existing switching approaches have primarily focused on learning models from a single neural modality, either continuous Gaussian signals or discrete Poisson signals. However, multiple neural modalities are often recorded simultaneously to measure different spatiotemporal scales of brain activity, and all these modalities can encode behavior. Moreover, regime labels are typically unavailable in training data, posing a significant challenge for learning models of regime-dependent switching dynamics. To address these challenges, we develop a novel unsupervised learning algorithm that learns the parameters of switching multiscale dynamical system models using only multiscale neural observations. We demonstrate our method using both simulations and two distinct experimental datasets with multimodal spike-LFP observations during different motor tasks. We find that our switching multiscale dynamical system models more accurately decode behavior than switching single-scale dynamical models, showing the success of multiscale neural fusion. Further, our models outperform stationary multiscale models, illustrating the importance of tracking regime-dependent non-stationarity in multimodal neural data. The developed unsupervised learning framework enables more accurate modeling of complex multiscale neural dynamics by leveraging information in multimodal recordings while incorporating regime switches. This approach holds promise for improving the performance and robustness of brain-computer interfaces over time and for advancing our understanding of the neural basis of behavior.

[LG-46] Improving Recursive Transformers with Mixture of LoRAs

链接: https://arxiv.org/abs/2512.12880
作者: Mohammadmahdi Nouriborji,Morteza Rohanian,Omid Rohanian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M–120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

[LG-47] KANELÉ: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation

链接: https://arxiv.org/abs/2512.12850
作者: Duc Hoang,Aarush Gupta,Philip Harris
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY); High Energy Physics - Experiment (hep-ex)
*备注: International Symposium on Field-Programmable Gate Arrays 2026 (ISFPGA’2026)

点击查看摘要

Abstract:Low-latency, resource-efficient neural network inference on FPGAs is essential for applications demanding real-time capability and low power. Lookup table (LUT)-based neural networks are a common solution, combining strong representational power with efficient FPGA implementation. In this work, we introduce KANELÉ, a framework that exploits the unique properties of Kolmogorov-Arnold Networks (KANs) for FPGA deployment. Unlike traditional multilayer perceptrons (MLPs), KANs employ learnable one-dimensional splines with fixed domains as edge activations, a structure naturally suited to discretization and efficient LUT mapping. We present the first systematic design flow for implementing KANs on FPGAs, co-optimizing training with quantization and pruning to enable compact, high-throughput, and low-latency KAN architectures. Our results demonstrate up to a 2700x speedup and orders of magnitude resource savings compared to prior KAN-on-FPGA approaches. Moreover, KANELÉ matches or surpasses other LUT-based architectures on widely used benchmarks, particularly for tasks involving symbolic or physical formulas, while balancing resource usage across FPGA hardware. Finally, we showcase the versatility of the framework by extending it to real-time, power-efficient control systems.

[LG-48] HaShiFlex: A High-Throughput Hardened Shifter DNN Accelerator with Fine-Tuning Flexibility

链接: https://arxiv.org/abs/2512.12847
作者: Jonathan Herbst(1),Michael Pellauer(2),Sherief Reda(1) ((1) Brown University, (2) NVIDIA)
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:We introduce a high-throughput neural network accelerator that embeds most network layers directly in hardware, minimizing data transfer and memory usage while preserving a degree of flexibility via a small neural processing unit for the final classification layer. By leveraging power-of-two (Po2) quantization for weights, we replace multiplications with simple rewiring, effectively reducing each convolution to a series of additions. This streamlined approach offers high-throughput, energy-efficient processing, making it highly suitable for applications where model parameters remain stable, such as continuous sensing tasks at the edge or large-scale data center deployments. Furthermore, by including a strategically chosen reprogrammable final layer, our design achieves high throughput without sacrificing fine-tuning capabilities. We implement this accelerator in a 7nm ASIC flow using MobileNetV2 as a baseline and report throughput, area, accuracy, and sensitivity to quantization and pruning - demonstrating both the advantages and potential trade-offs of the proposed architecture. We find that for MobileNetV2, we can improve inference throughput by 20x over fully programmable GPUs, processing 1.21 million images per second through a full forward pass while retaining fine-tuning flexibility. If absolutely no post-deployment fine tuning is required, this advantage increases to 67x at 4 million images per second.

[LG-49] Optimal Resource Allocation for ML Model Training and Deployment under Concept Drift

链接: https://arxiv.org/abs/2512.12816
作者: Hasan Burhan Beytur,Gustavo de Veciana,Haris Vikalo,Kevin S Chan
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:We study how to allocate resources for training and deployment of machine learning (ML) models under concept drift and limited budgets. We consider a setting in which a model provider distributes trained models to multiple clients whose devices support local inference but lack the ability to retrain those models, placing the burden of performance maintenance on the provider. We introduce a model-agnostic framework that captures the interaction between resource allocation, concept drift dynamics, and deployment timing. We show that optimal training policies depend critically on the aging properties of concept durations. Under sudden concept changes, we derive optimal training policies subject to budget constraints when concept durations follow distributions with Decreasing Mean Residual Life (DMRL), and show that intuitive heuristics are provably suboptimal under Increasing Mean Residual Life (IMRL). We further study model deployment under communication constraints, prove that the associated optimization problem is quasi-convex under mild conditions, and propose a randomized scheduling strategy that achieves near-optimal client-side performance. These results offer theoretical and algorithmic foundations for cost-efficient ML model management under concept drift, with implications for continual learning, distributed inference, and adaptive ML systems.

[LG-50] RACER: Transfer Learning based Real-time Adaptation for Clinical Evolving Risk

链接: https://arxiv.org/abs/2512.12795
作者: Mengying Yan,Ziye Tian,Siqi Li,Nan Liu,Benjamin A. Goldstein,Molei Liu,Chuan Hong
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Clinical decision support tools built on electronic health records often experience performance drift due to temporal population shifts, particularly when changes in the clinical environment initially affect only a subset of patients, resulting in a transition to mixed populations. Such case-mix changes commonly arise following system-level operational updates or the emergence of new diseases, such as COVID-19. We propose TRACER (Transfer Learning-based Real-time Adaptation for Clinical Evolving Risk), a framework that identifies encounter-level transition membership and adapts predictive models using transfer learning without full retraining. In simulation studies, TRACER outperformed static models trained on historical or contemporary data. In a real-world application predicting hospital admission following emergency department visits across the COVID-19 transition, TRACER improved both discrimination and calibration. TRACER provides a scalable approach for maintaining robust predictive performance under evolving and heterogeneous clinical conditions.

[LG-51] Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

链接: https://arxiv.org/abs/2512.12783
作者: Atalay Denknalbant,Emre Sezdi,Zeki Furkan Kutlu,Polat Goktas
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced (F_1) from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.

[LG-52] OLR-WAA: Adaptive and Drift-Resilient Online Regression with Dynamic Weighted Averag ing

链接: https://arxiv.org/abs/2512.12779
作者: Mohammad Abu-Shaira,Weishi Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world datasets frequently exhibit evolving data distributions, reflecting temporal variations and underlying shifts. Overlooking this phenomenon, known as concept drift, can substantially degrade the predictive performance of the model. Furthermore, the presence of hyperparameters in online models exacerbates this issue, as these parameters are typically fixed and lack the flexibility to dynamically adjust to evolving data. This paper introduces “OLR-WAA: An Adaptive and Drift-Resilient Online Regression with Dynamic Weighted Average”, a hyperparameter-free model designed to tackle the challenges of non-stationary data streams and enable effective, continuous adaptation. The objective is to strike a balance between model stability and adaptability. OLR-WAA incrementally updates its base model by integrating incoming data streams, utilizing an exponentially weighted moving average. It further introduces a unique optimization mechanism that dynamically detects concept drift, quantifies its magnitude, and adjusts the model based on real-time data characteristics. Rigorous evaluations show that it matches batch regression performance in static settings and consistently outperforms or rivals state-of-the-art online models, confirming its effectiveness. Concept drift datasets reveal a performance gap that OLR-WAA effectively bridges, setting it apart from other online models. In addition, the model effectively handles confidence-based scenarios through a conservative update strategy that prioritizes stable, high-confidence data points. Notably, OLR-WAA converges rapidly, consistently yielding higher R2 values compared to other online models.

[LG-53] An End-to-End Approach for Microgrid Probabilistic Forecasting and Robust Operation via Decision-focused Learning

链接: https://arxiv.org/abs/2512.12755
作者: Tingwei Cao,Yan Xu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:High penetration of renewable energy sources (RES) introduces significant uncertainty and intermittency into microgrid operations, posing challenges to economic and reliable scheduling. To address this, this paper proposes an end-to-end decision-focused framework that jointly optimizes probabilistic forecasting and robust operation for microgrids. A multilayer encoder-decoder (MED) probabilistic forecasting model is integrated with a two-stage robust optimization (TSRO) model involving direct load control (DLC) through a differentiable decision pathway, enabling gradient-based feedback from operational outcomes to improve forecasting performance. Unlike conventional sequential approaches, the proposed method aligns forecasting accuracy with operational objectives by directly minimizing decision regret via a surrogate smart predict-then-optimize (SPO) loss function. This integration ensures that probabilistic forecasts are optimized for downstream decisions, enhancing both economic efficiency and robustness. Case studies on modified IEEE 33-bus and 69-bus systems demonstrate that the proposed framework achieves superior forecasting accuracy and operational performance, reducing total and net operation costs by up to 18% compared with conventional forecasting and optimization combinations. The results verify the effectiveness and scalability of the end-to-end decision-focused approach for resilient and cost-efficient microgrid management under uncertainty.

[LG-54] Resting Neurons Active Insights: Improving Input Sparsification for Large Language Models

链接: https://arxiv.org/abs/2512.12744
作者: Haotian Xu,Tian Gao,Tsui-Wei Weng,Tengfei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) achieve state-of-the-art performance across a wide range of applications, but their massive scale poses significant challenges for both efficiency and interpretability. Structural pruning, which reduces model size by removing redundant computational units such as neurons, has been widely explored as a solution, and this study devotes to input sparsification, an increasingly popular technique that improves efficiency by selectively activating only a subset of entry values for each input. However, existing approaches focus primarily on computational savings, often overlooking the representational consequences of sparsification and leaving a noticeable performance gap compared to full models. In this work, we first reinterpret input sparsification as a form of dynamic structural pruning. Motivated by the spontaneous baseline firing rates observed in biological neurons, we introduce a small set of trainable spontaneous neurons that act as compensatory units to stabilize activations in sparsified LLMs. Experiments demonstrate that these auxiliary neurons substantially reduce the sparsification-induced performance gap while generalizing effectively across tasks.

[LG-55] SPARK: Igniting Communication-Efficient Decentralized Learning via Stage-wise Projected NTK and Accelerated Regularization

链接: https://arxiv.org/abs/2512.12737
作者: Li Xia
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Decentralized federated learning (DFL) faces critical challenges from statistical heterogeneity and communication overhead. While NTK-based methods achieve faster convergence, transmitting full Jacobian matrices is impractical for bandwidth-constrained edge networks. We propose SPARK, synergistically integrating random projection-based Jacobian compression, stage-wise annealed distillation, and Nesterov momentum acceleration. Random projections compress Jacobians while preserving spectral properties essential for convergence. Stage-wise annealed distillation transitions from pure NTK evolution to neighbor-regularized learning, counteracting compression noise. Nesterov momentum accelerates convergence through stable accumulation enabled by distillation smoothing. SPARK achieves 98.7% communication reduction compared to NTK-DFL while maintaining convergence speed and superior accuracy. With momentum, SPARK reaches target performance 3 times faster, establishing state-of-the-art results for communication-efficient decentralized learning and enabling practical deployment in bandwidth-limited edge environments.

[LG-56] Solving a Machine Learning Regression Problem Based on the Theory of Random Functions

链接: https://arxiv.org/abs/2512.12731
作者: Yuriy N. Bakhvalov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Part 1 of 4 in the “Polyharmonic Cascade” cycle. 25 pages, 2 figures. Source code is available at: this https URL

点击查看摘要

Abstract:This paper studies a machine learning regression problem as a multivariate approximation problem using the framework of the theory of random functions. An ab initio derivation of a regression method is proposed, starting from postulates of indifference. It is shown that if a probability measure on an infinite-dimensional function space possesses natural symmetries (invariance under translation, rotation, scaling, and Gaussianity), then the entire solution scheme, including the kernel form, the type of regularization, and the noise parameterization, follows analytically from these postulates. The resulting kernel coincides with a generalized polyharmonic spline; however, unlike existing approaches, it is not chosen empirically but arises as a consequence of the indifference principle. This result provides a theoretical foundation for a broad class of smoothing and interpolation methods, demonstrating their optimality in the absence of a priori information.

[LG-57] Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

链接: https://arxiv.org/abs/2512.12713
作者: Yiyang Jia,Chengxu Zhou
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Control policies in deep reinforcement learning are often implemented with fixed-capacity multilayer perceptrons trained by backpropagation, which lack structural plasticity and depend on global error signals. This paper introduces the Self-Motivated Growing Neural Network (SMGrNN), a controller whose topology evolves online through a local Structural Plasticity Module (SPM). The SPM monitors neuron activations and edge-wise weight update statistics over short temporal windows and uses these signals to trigger neuron insertion and pruning, while synaptic weights are updated by a standard gradient-based optimizer. This allows network capacity to be regulated during learning without manual architectural tuning. SMGrNN is evaluated on control benchmarks via policy distillation. Compared with multilayer perceptron baselines, it achieves similar or higher returns, lower variance, and task-appropriate network sizes. Ablation studies with growth disabled and growth-only variants isolate the role of structural plasticity, showing that adaptive topology improves reward stability. The local and modular design of SPM enables future integration of a Hebbian plasticity module and spike-timing-dependent plasticity, so that SMGrNN can support both artificial and spiking neural implementations driven by local rules. Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG) Cite as: arXiv:2512.12713 [cs.NE] (or arXiv:2512.12713v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2512.12713 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Multi-Trajectory Physics-Informed Neural Networks for HJB Equations with Hard-Zero Terminal Inventory: Optimal Execution on Synthetic SPY Data NEURIPS2025

链接: https://arxiv.org/abs/2512.12708
作者: Anthime Valin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 24 pages, 19 figures. Accepted to the NeurIPS 2025 Workshop on Generative AI in Finance

点击查看摘要

Abstract:We study optimal trade execution with a hard-zero terminal inventory constraint, modeled via Hamilton-Jacobi-Bellman (HJB) equations. Vanilla PINNs often under-enforce this constraint and produce unstable controls. We propose a Multi-Trajectory PINN (MT-PINN) that adds a rollout-based trajectory loss and propagates a terminal penalty on terminal inventory via backpropagation-through-time, directly enforcing zero terminal inventory. A lightweight lambda-curriculum is adopted to stabilize training as the state expands from a risk-neutral reduced HJB to a risk-averse HJB. On the Gatheral-Schied single-asset model, MT-PINN aligns closely with their derived closed-form solutions and concentrates terminal inventory tightly around zero while reducing errors along optimal paths. We apply MT-PINNs on SPY intraday data, matching TWAP when risk-neutral, and achieving lower exposure and competitive costs, especially in falling windows, for higher risk-aversion.

[LG-59] On Approaches to Building Surrogate ODE Models for Diffusion Bridges

链接: https://arxiv.org/abs/2512.12671
作者: Maria Khilchuk,Vladimir Latypov,Pavel Kleshchev,Alexander Hvatov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion and Schrödinger Bridge models have established state-of-the-art performance in generative modeling but are often hampered by significant computational costs and complex training procedures. While continuous-time bridges promise faster sampling, overparameterized neural networks describe their optimal dynamics, and the underlying stochastic differential equations can be difficult to integrate efficiently. This work introduces a novel paradigm that uses surrogate models to create simpler, faster, and more flexible approximations of these dynamics. We propose two specific algorithms: SINDy Flow Matching (SINDy-FM), which leverages sparse regression to identify interpretable, symbolic differential equations from data, and a Neural-ODE reformulation of the Schrödinger Bridge (DSBM-NeuralODE) for flexible continuous-time parameterization. Our experiments on Gaussian transport tasks and MNIST latent translation demonstrate that these surrogates achieve competitive performance while offering dramatic improvements in efficiency and interpretability. The symbolic SINDy-FM models, in particular, reduce parameter counts by several orders of magnitude and enable near-instantaneous inference, paving the way for a new class of tractable and high-performing bridge models for practical deployment.

[LG-60] orch Geometric Pool: the Pytorch library for pooling in Graph Neural Networks

链接: https://arxiv.org/abs/2512.12642
作者: Filippo Maria Bianchi,Carlo Abate,Ivan Marisca
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Torch Geometric Pool (tgp), a library for hierarchical pooling in Graph Neural Networks. Built upon Pytorch Geometric, Torch Geometric Pool (tgp) provides a wide variety of pooling operators, unified under a consistent API and a modular design. The library emphasizes usability and extensibility, and includes features like precomputed pooling, which significantly accelerate training for a class of operators. In this paper, we present tgp’s structure and present an extensive benchmark. The latter showcases the library’s features and systematically compares the performance of the implemented graph-pooling methods in different downstream tasks. The results, showing that the choice of the optimal pooling operator depends on tasks and data at hand, support the need for a library that enables fast prototyping.

[LG-61] CoLSE: A Lightweight and Robust Hybrid Learned Model for Single-Table Cardinality Estimation using Joint CDF

链接: https://arxiv.org/abs/2512.12624
作者: Lankadinee Rathuwadu,Guanli Liu,Christopher Leckie,Renata Borovica-Gajic
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardinality estimation (CE), the task of predicting the result size of queries is a critical component of query optimization. Accurate estimates are essential for generating efficient query execution plans. Recently, machine learning techniques have been applied to CE, broadly categorized into query-driven and data-driven approaches. Data-driven methods learn the joint distribution of data, while query-driven methods construct regression models that map query features to cardinalities. Ideally, a CE technique should strike a balance among three key factors: accuracy, efficiency, and memory footprint. However, existing state-of-the-art models often fail to achieve this balance. To address this, we propose CoLSE, a hybrid learned approach for single-table cardinality estimation. CoLSE directly models the joint probability over queried intervals using a novel algorithm based on copula theory and integrates a lightweight neural network to correct residual estimation errors. Experimental results show that CoLSE achieves a favorable trade-off among accuracy, training time, inference latency, and model size, outperforming existing state-of-the-art methods. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2512.12624 [cs.DB] (or arXiv:2512.12624v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2512.12624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] Spectral Sentinel: Scalable Byzantine-Robust Decentralized Federated Learning via Sketched Random Matrix Theory on Blockchain

链接: https://arxiv.org/abs/2512.12617
作者: Animesh Mishra
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized federated learning (DFL) enables collaborative model training without centralized trust, but it remains vulnerable to Byzantine clients that poison gradients under heterogeneous (Non-IID) data. Existing defenses face a scalability trilemma: distance-based filtering (e.g., Krum) can reject legitimate Non-IID updates, geometric-median methods incur prohibitive O(n^2 d) cost, and many certified defenses are evaluated only on models below 100M parameters. We propose Spectral Sentinel, a Byzantine detection and aggregation framework that leverages a random-matrix-theoretic signature: honest Non-IID gradients produce covariance eigenspectra whose bulk follows the Marchenko-Pastur law, while Byzantine perturbations induce detectable tail anomalies. Our algorithm combines Frequent Directions sketching with data-dependent MP tracking, enabling detection on models up to 1.5B parameters using O(k^2) memory with k \ll d . Under a (\sigma,f) threat model with coordinate-wise honest variance bounded by \sigma^2 and f 1/2 adversaries, we prove (\epsilon,\delta) -Byzantine resilience with convergence rate O(\sigma f / \sqrtT + f^2 / T) , and we provide a matching information-theoretic lower bound \Omega(\sigma f / \sqrtT) , establishing minimax optimality. We implement the full system with blockchain integration on Polygon networks and validate it across 144 attack-aggregator configurations, achieving 78.4 percent average accuracy versus 48-63 percent for baseline methods.

[LG-63] Causal inference and model explainability tools for retail

链接: https://arxiv.org/abs/2512.12605
作者: Pranav Gupta,Nithin Surendran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most major retailers today have multiple divisions focused on various aspects, such as marketing, supply chain, online customer experience, store customer experience, employee productivity, and vendor fulfillment. They also regularly collect data corresponding to all these aspects as dashboards and weekly/monthly/quarterly reports. Although several machine learning and statistical techniques have been in place to analyze and predict key metrics, such models typically lack interpretability. Moreover, such techniques also do not allow the validation or discovery of causal links. In this paper, we aim to provide a recipe for applying model interpretability and causal inference for deriving sales insights. In this paper, we review the existing literature on causal inference and interpretability in the context of problems in e-commerce and retail, and apply them to a real-world dataset. We find that an inherently explainable model has a lower variance of SHAP values, and show that including multiple confounders through a double machine learning approach allows us to get the correct sign of causal effect.

[LG-64] Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

链接: https://arxiv.org/abs/2512.12602
作者: Jingdi Lei,Di Zhang,Soujanya Poria
类目: Machine Learning (cs.LG)
*备注: 17 pages, 2 figures

点击查看摘要

Abstract:Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

[LG-65] ceLLM ate: Sandboxing Browser AI Agents

链接: https://arxiv.org/abs/2512.12594
作者: Luoxi Meng,Henry Feng,Ilia Shumailov,Earlence Fernandes
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Homepage: this https URL

点击查看摘要

Abstract:Browser-using agents (BUAs) are an emerging class of autonomous agents that interact with web browsers in human-like ways, including clicking, scrolling, filling forms, and navigating across pages. While these agents help automate repetitive online tasks, they are vulnerable to prompt injection attacks that can trick an agent into performing undesired actions, such as leaking private information or issuing state-changing requests. We propose ceLLMate, a browser-level sandboxing framework that restricts the agent’s ambient authority and reduces the blast radius of prompt injections. We address two fundamental challenges: (1) The semantic gap challenge in policy enforcement arises because the agent operates through low-level UI observations and manipulations; however, writing and enforcing policies directly over UI-level events is brittle and error-prone. To address this challenge, we introduce an agent sitemap that maps low-level browser behaviors to high-level semantic actions. (2) Policy prediction in BUAs is the norm rather than the exception. BUAs have no app developer to pre-declare sandboxing policies, and thus, ceLLMate pairs website-authored mandatory policies with an automated policy-prediction layer that adapts and instantiates these policies from the user’s natural-language task. We implement ceLLMate as an agent-agnostic browser extension and demonstrate how it enables sandboxing policies that effectively block various types of prompt injection attacks with negligible overhead.

[LG-66] Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses

链接: https://arxiv.org/abs/2512.12581
作者: David Strnadel
类目: Machine Learning (cs.LG)
*备注: Exploratory, simulator-based proof of concept. No claims of quantum advantage

点击查看摘要

Abstract:This paper presents an exploratory, simulator-based proof of concept investigating whether differentiable energy terms derived from parameterized quantum circuits can serve as auxiliary regularization signals in Generative Adversarial Networks (GANs). We augment the Auxiliary Classifier GAN (ACGAN) generator objective with a Variational Quantum Eigensolver (VQE)-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit’s EstimatorQNN and TorchConnector. Important limitations: All experiments run on a noiseless statevector simulator with only 4 qubits, use a deliberately simple Hamiltonian parameterization, and lack ablation studies comparing against equivalent classical biases. The computational overhead (approximately 200x slower than classical ACGAN) reflects simulator artifacts rather than inherent quantum costs. On MNIST, we observe that the energy-regularized model (termed QACGAN) achieves high classification accuracy (99 to 100 percent) within 5 epochs compared to 87.8 percent for ACGAN, suggesting the auxiliary term influences class conditioning. However, sample quality metrics (FID) show high variance across runs (coefficient of variation approximately 25 percent at epoch 5), with values ranging from 19.92 to 35.96. Extended runs stabilize around FID 23 to 24, comparable to the ACGAN baseline. We explicitly do not claim quantum advantage, improved stability in any general sense, or scalability beyond this toy setting. The contribution is methodological: demonstrating that VQE-style energy computations can be integrated into GAN training loops via differentiable pathways. Whether such auxiliary signals provide benefits beyond equivalent classical regularizers remains an open question requiring systematic ablation studies, which we leave for future work. Comments: Exploratory, simulator-based proof of concept. No claims of quantum advantage Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.12581 [cs.LG] (or arXiv:2512.12581v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12581 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David Strnadel [view email] [v1] Sun, 14 Dec 2025 07:23:57 UTC (1,075 KB)

[LG-67] On the Accuracy of Newton Step and Influence Function Data Attributions

链接: https://arxiv.org/abs/2512.12572
作者: Ittai Rubinstein,Samuel B. Hopkins
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of linear regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ( d ) and the number of samples removed ( k ). As a result, these analyses are not tight enough to answer fundamental questions such as “what is the asymptotic scaling of the errors of each method?” or “which of these methods is more accurate for a given dataset?” In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regression, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. [ \mathbbE_T \subseteq [n],, |T| = k \bigl[ |\hat\theta_T - \hat\theta_T^\mathrmNS|_2 \bigr] = \widetilde\Theta!\left(\frack dn^2\right), \qquad \mathbbE_T \subseteq [n],, |T| = k \bigl[ |\hat\theta_T^\mathrmNS - \hat\theta_T^\mathrmIF|_2 \bigr] = \widetilde\Theta!\left( \frac(k + d)\sqrtk dn^2 \right). ] Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 62F35, 68W99, 62J05 ACMclasses: G.3; I.2.6; F.2.2 Cite as: arXiv:2512.12572 [cs.LG] (or arXiv:2512.12572v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12572 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ittai Rubinstein [view email] [v1] Sun, 14 Dec 2025 06:33:52 UTC (92 KB)

[LG-68] Optimal Mistake Bounds for Transductive Online Learning

链接: https://arxiv.org/abs/2512.12567
作者: Zachary Chase,Steve Hanneke,Shay Moran,Jonathan Shafer
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. In the standard setting, the optimal mistake bound is characterized by the Littlestone dimension d of the concept class H (Littlestone 1987). We prove that in the transductive setting, the mistake bound is at least \Omega(\sqrtd) . This constitutes an exponential improvement over previous lower bounds of \Omega(\log\log d) , \Omega(\sqrt\log d) , and \Omega(\log d) , due respectively to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that this lower bound is tight: for every d , there exists a class of Littlestone dimension d with transductive mistake bound O(\sqrtd) . Our upper bound also improves upon the best known upper bound of (2/3)d from Ben-David, Kushilevitz, and Mansour (1997). These results establish a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advance access to the unlabeled instance sequence. This contrasts with the PAC setting, where transductive and standard learning exhibit similar sample complexities.

[LG-69] Effective Fine-Tuning with Eigenvector Centrality Based Pruning

链接: https://arxiv.org/abs/2512.12543
作者: Shaif Chowdhury,Soham Biren Katlariwala,Devleena Kashyap
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:In social media networks a small number of highly influential users can drive large scale changes in discourse across multiple communities. Small shifts in the behavior of these users are often sufficient to propagate widely throughout the network. A similar phenomenon occurs during neural network fine tuning. Conventional fine tuning of convolutional neural networks typically adds a new linear classification layer on top of a large pre trained model. Instead we argue that improved adaptation can be achieved by first pruning the network to retain only the most important neurons and then performing fine tuning. We propose a graph theory based method for pruning neural networks that is designed to improve fine tuning performance. In this method each neuron is represented as a node and edges encode similarity between neurons. Neurons are pruned based on importance scores computed using eigenvector centrality. The resulting pruned network is then fine tuned using only the most central neurons. We evaluate the proposed method on VGGNet EfficientNet and ResNet models using the TF Flowers Caltech one zero one and Oxford Flowers one zero two datasets. The proposed approach achieves higher classification accuracy while significantly reducing model complexity. On the Oxford Flowers one zero two dataset the method achieves forty eight percent classification accuracy compared to thirty percent accuracy obtained by the baseline VGGNet model. Comments: 12 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.12543 [cs.LG] (or arXiv:2512.12543v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Empirical Mode Decomposition and Graph Transformation of the MSCI World Index: A Multiscale Topological Analysis for Graph Neural Network Modeling

链接: https://arxiv.org/abs/2512.12526
作者: Agustín M. de los Riscos,Julio E. Sandubete,Diego Carmona-Fernández,León Beleña
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 19 pages, 3 figures, 6 tables

点击查看摘要

Abstract:This study applies Empirical Mode Decomposition (EMD) to the MSCI World index and converts the resulting intrinsic mode functions (IMFs) into graph representations to enable modeling with graph neural networks (GNNs). Using CEEMDAN, we extract nine IMFs spanning high-frequency fluctuations to long-term trends. Each IMF is transformed into a graph using four time-series-to-graph methods: natural visibility, horizontal visibility, recurrence, and transition graphs. Topological analysis shows clear scale-dependent structure: high-frequency IMFs yield dense, highly connected small-world graphs, whereas low-frequency IMFs produce sparser networks with longer characteristic path lengths. Visibility-based methods are more sensitive to amplitude variability and typically generate higher clustering, while recurrence graphs better preserve temporal dependencies. These results provide guidance for designing GNN architectures tailored to the structural properties of decomposed components, supporting more effective predictive modeling of financial time series.

[LG-71] Policy Optimization for Dynamic Heart Transplant Allocation

链接: https://arxiv.org/abs/2512.12497
作者: Ioannis Anagnostides,Zachary W. Sollie,Arman Kilic,Tuomas Sandholm
类目: Machine Learning (cs.LG)
*备注: An extended abstract of this paper was presented at the scientific sessions of AHA 2025

点击查看摘要

Abstract:Heart transplantation is a viable path for patients suffering from advanced heart failure, but this lifesaving option is severely limited due to donor shortage. Although the current allocation policy was recently revised in 2018, a major concern is that it does not adequately take into account pretransplant and post-transplant mortality. In this paper, we take an important step toward addressing these deficiencies. To begin with, we use historical data from UNOS to develop a new simulator that enables us to evaluate and compare the performance of different policies. We then leverage our simulator to demonstrate that the status quo policy is considerably inferior to the myopic policy that matches incoming donors to the patient who maximizes the number of years gained by the transplant. Moreover, we develop improved policies that account for the dynamic nature of the allocation process through the use of potentials – a measure of a patient’s utility in future allocations that we introduce. We also show that batching together even a handful of donors – which is a viable option for a certain type of donors – further enhances performance. Our simulator also allows us to evaluate the effect of critical, and often unexplored, factors in the allocation, such as geographic proximity and the tendency to reject offers by the transplant centers. Comments: An extended abstract of this paper was presented at the scientific sessions of AHA 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.12497 [cs.LG] (or arXiv:2512.12497v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] AI-Driven Early Warning Systems for Student Success: Discovering Static Feature Dominance in Temporal Prediction Models KDD2026

链接: https://arxiv.org/abs/2512.12493
作者: Vaarunay Kaushal,Rajib Mall
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, KDD 2026

点击查看摘要

Abstract:Early identification of at-risk students is critical for effective intervention in online learning environments. This study extends temporal prediction analysis to Week 20 (50% of course duration), comparing Decision Tree and Long Short- Term Memory (LSTM) models across six temporal snapshots. Our analysis reveals that different performance metrics matter at different intervention stages: high recall is critical for early intervention (Weeks 2-4), while balanced precision-recall is important for mid-course resource allocation (Weeks 8-16), and high precision becomes paramount in later stages (Week 20). We demonstrate that static demographic features dominate predictions (68% importance), enabling assessment-free early prediction. The LSTM model achieves 97% recall at Week 2, making it ideal for early intervention, while Decision Tree provides stable balanced performance (78% accuracy) during mid-course. By Week 20, both models converge to similar recall (68%), but LSTM achieves higher precision (90% vs 86%). Our findings also suggest that model selection should depend on intervention timing, and that early signals (Weeks 2-4) are sufficient for reliable initial prediction using primarily demographic and pre-enrollment information.

[LG-73] GoMS: Graph of Molecule Substructure Network for Molecule Property Prediction

链接: https://arxiv.org/abs/2512.12489
作者: Shuhui Qu,Cheolwoo Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While graph neural networks have shown remarkable success in molecular property prediction, current approaches like the Equivariant Subgraph Aggregation Networks (ESAN) treat molecules as bags of independent substructures, overlooking crucial relationships between these components. We present Graph of Molecule Substructures (GoMS), a novel architecture that explicitly models the interactions and spatial arrangements between molecular substructures. Unlike ESAN’s bag-based representation, GoMS constructs a graph where nodes represent subgraphs and edges capture their structural relationships, preserving critical topological information about how substructures are connected and overlap within the molecule. Through extensive experiments on public molecular datasets, we demonstrate that GoMS outperforms ESAN and other baseline methods, with particularly improvements for large molecules containing more than 100 atoms. The performance gap widens as molecular size increases, demonstrating GoMS’s effectiveness for modeling industrial-scale molecules. Our theoretical analysis demonstrates that GoMS can distinguish molecules with identical subgraph compositions but different spatial arrangements. Our approach shows particular promise for materials science applications involving complex molecules where properties emerge from the interplay between multiple functional units. By capturing substructure relationships that are lost in bag-based approaches, GoMS represents a significant advance toward scalable and interpretable molecular property prediction for real-world applications.

[LG-74] Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

链接: https://arxiv.org/abs/2512.12469
作者: Sandy Fraser,Patryk Wielopolski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for 0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept’s latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.

[LG-75] Breaking the Curse of Dimensionality: On the Stability of Modern Vector Retrieval

链接: https://arxiv.org/abs/2512.12458
作者: Vihan Lakshman,Blaise Munyampirwa,Julian Shun,Benjamin Coleman
类目: Information Retrieval (cs.IR); Computational Geometry (cs.CG); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Modern vector databases enable efficient retrieval over high-dimensional neural embeddings, powering applications from web search to retrieval-augmented generation. However, classical theory predicts such tasks should suffer from the curse of dimensionality, where distances between points become nearly indistinguishable, thereby crippling efficient nearest-neighbor search. We revisit this paradox through the lens of stability, the property that small perturbations to a query do not radically alter its nearest neighbors. Building on foundational results, we extend stability theory to three key retrieval settings widely used in practice: (i) multi-vector search, where we prove that the popular Chamfer distance metric preserves single-vector stability, while average pooling aggregation may destroy it; (ii) filtered vector search, where we show that sufficiently large penalties for mismatched filters can induce stability even when the underlying search is unstable; and (iii) sparse vector search, where we formalize and prove novel sufficient stability conditions. Across synthetic and real datasets, our experimental results match our theoretical predictions, offering concrete guidance for model and system design to avoid the curse of dimensionality.

[LG-76] Optimized Architectures for Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2512.12448
作者: James Bagrow,Josh Bongard
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 12 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Efforts to improve Kolmogorov-Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable sparsification, turning architecture search into an end-to-end optimization problem. Across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, we demonstrate competitive or superior accuracy while discovering substantially smaller models. Overprovisioning and sparsification are synergistic, with the combination outperforming either alone. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.

[LG-77] Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

链接: https://arxiv.org/abs/2512.12445
作者: Abdul Matin,Rupasree Dey,Tanjim Bin Faruk,Shrideep Pallickara,Sangmi Lee Pallickara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating domain knowledge into deep learning has emerged as a promising direction for improving model interpretability, generalization, and data efficiency. In this work, we present a novel knowledge-guided ViT-based Masked Autoencoder that embeds scientific domain knowledge within the self-supervised reconstruction process. Instead of relying solely on data-driven optimization, our proposed approach incorporates the Linear Spectral Mixing Model (LSMM) as a physical constraint and physically-based Spectral Angle Mapper (SAM), ensuring that learned representations adhere to known structural relationships between observed signals and their latent components. The framework jointly optimizes LSMM and SAM loss with a conventional Huber loss objective, promoting both numerical accuracy and geometric consistency in the feature space. This knowledge-guided design enhances reconstruction fidelity, stabilizes training under limited supervision, and yields interpretable latent representations grounded in physical principles. The experimental findings indicate that the proposed model substantially enhances reconstruction quality and improves downstream task performance, highlighting the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning.

[LG-78] Learning Dynamics in Memristor-Based Equilibrium Propagation

链接: https://arxiv.org/abs/2512.12428
作者: Michael Döll,Andreas Müller,Bernd Ulmann
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Memristor-based in-memory computing has emerged as a promising paradigm to overcome the constraints of the von Neumann bottleneck and the memory wall by enabling fully parallelisable and energy-efficient vector-matrix multiplications. We investigate the effect of nonlinear, memristor-driven weight updates on the convergence behaviour of neural networks trained with equilibrium propagation (EqProp). Six memristor models were characterised by their voltage-current hysteresis and integrated into the EBANA framework for evaluation on two benchmark classification tasks. EqProp can achieve robust convergence under nonlinear weight updates, provided that memristors exhibit a sufficiently wide resistance range of at least an order of magnitude.

[LG-79] Can Graphs Improve Tabular Foundation Models?

链接: https://arxiv.org/abs/2512.12405
作者: Franck Le,Keith Grueneberg,Erich Nahum,Vadim Sheinin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data are central to many real-world systems. While recent tabular transformers and in-context learners such as SAINT, TP-BERTa, TabPFN, TabICL, and MITRA incorporate limited inter-row reasoning, most approaches still lack an explicit mechanism to model relationships among instances, even though similar samples often share related outcomes. We investigate whether introducing \emphsimple graph priors can enhance \emphpretrained tabular transformers. Concretely, we introduce BOLERO, a lightweight, static bipartite graph head that augments RoBERTa-Tab (a RoBERTa-style tabular backbone pretrained with masked-token prediction.) Each instance connects to feature/value anchors; a small GNN refines row representations, while the backbone remains frozen. We evaluate on 80 classification and 64 regression datasets from the TP-BERTa benchmark suites, comparing against strong baselines including XGBoost, CatBoost, TabPFN-v2, MITRA, TabICL, TP-BERTa, and RoBERTa-Tab. To ensure statistically sound conclusions, we follow best practices for multi-dataset evaluation: pairwise Wilcoxon signed-rank tests on per-dataset score differences and effect sizes (median improvement with confidence intervals), rather than mean-rank post-hoc tests that depend on the competitor pool. BOLERO achieves the highest number of statistically significant wins across both classification and regression, demonstrating that lightweight graph priors meaningfully improve pretrained tabular transformers.

[LG-80] DeepVekua: Geometric-Spectral Representation Learning for Physics-Informed Fields

链接: https://arxiv.org/abs/2512.12402
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present DeepVekua, a hybrid architecture that unifies geometric deep learning with spectral analysis to solve partial differential equations (PDEs) in sparse data regimes. By learning a diffeomorphic coordinate transformation that maps complex geometries to a latent harmonic space, our method outperforms state-of-the-art implicit representations on advection-diffusion systems. Unlike standard coordinate-based networks which struggle with spectral bias, DeepVekua separates the learning of geometry from the learning of physics, solving for optimal spectral weights in closed form. We demonstrate a 100x improvement over spectral baselines. The code is available at this https URL.

[LG-81] Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

链接: https://arxiv.org/abs/2512.12387
作者: Yawen Shao,Jie Xiao,Kai Zhu,Yu Liu,Wei Zhai,Yang Cao,Zheng-Jun Zha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: this https URL.

[LG-82] ElasticVR: Elastic Task Computing in Multi-User Multi-Connectivity Wireless Virtual Reality (VR) Systems

链接: https://arxiv.org/abs/2512.12366
作者: Babak Badnava,Jacob Chakareski,Morteza Hashemi
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Submitted to ACM TOMM

点击查看摘要

Abstract:Diverse emerging VR applications integrate streaming of high fidelity 360 video content that requires ample amounts of computation and data rate. Scalable 360 video tiling enables having elastic VR computational tasks that can be scaled adaptively in computation and data rate based on the available user and system resources. We integrate scalable 360 video tiling in an edge-client wireless multi-connectivity architecture for joint elastic task computation offloading across multiple VR users called ElasticVR. To balance the trade-offs in communication, computation, energy consumption, and QoE that arise herein, we formulate a constrained QoE and energy optimization problem that integrates the multi-user/multi-connectivity action space with the elasticity of VR computational tasks. The ElasticVR framework introduces two multi-agent deep reinforcement learning solutions, namely CPPG and IPPG. CPPG adopts a centralized training and centralized execution approach to capture the coupling between users’ communication and computational demands. This leads to globally coordinated decisions at the cost of increased computational overheads and limited scalability. To address the latter challenges, we also explore an alternative strategy denoted IPPG that adopts a centralized training with decentralized execution paradigm. IPPG leverages shared information and parameter sharing to learn robust policies; however, during execution, each user takes action independently based on its local state information only. The decentralized execution alleviates the communication and computation overhead of centralized decision-making and improves scalability. We show that the ElasticVR framework improves the PSNR by 43.21%, while reducing the response time and energy consumption by 42.35% and 56.83%, respectively, compared with a case where no elasticity is incorporated into VR computations.

[LG-83] Synthetic Swarm Mosquito Dataset for Acoustic Classification: A Proof of Concept

链接: https://arxiv.org/abs/2512.12365
作者: Thai-Duy Dinh,Minh-Luan Vo,Cuong Tuan Nguyen,Bich-Hien Vo
类目: Machine Learning (cs.LG)
*备注: Accepted at RIVF 2025

点击查看摘要

Abstract:Mosquito-borne diseases pose a serious global health threat, causing over 700,000 deaths annually. This work introduces a proof-of-concept Synthetic Swarm Mosquito Dataset for Acoustic Classification, created to simulate realistic multi-species and noisy swarm conditions. Unlike conventional datasets that require labor-intensive recording of individual mosquitoes, the synthetic approach enables scalable data generation while reducing human resource demands. Using log-mel spectrograms, we evaluated lightweight deep learning architectures for the classification of mosquito species. Experiments show that these models can effectively identify six major mosquito vectors and are suitable for deployment on embedded low-power devices. The study demonstrates the potential of synthetic swarm audio datasets to accelerate acoustic mosquito research and enable scalable real-time surveillance solutions.

[LG-84] Uncertainty Quantification for Machine Learning: One Size Does Not Fit All

链接: https://arxiv.org/abs/2512.12341
作者: Paul Hofman,Yusuf Sale,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Proper quantification of predictive uncertainty is essential for the use of machine learning in safety-critical applications. Various uncertainty measures have been proposed for this purpose, typically claiming superiority over other measures. In this paper, we argue that there is no single best measure. Instead, uncertainty quantification should be tailored to the specific application. To this end, we use a flexible family of uncertainty measures that distinguishes between total, aleatoric, and epistemic uncertainty of second-order distributions. These measures can be instantiated with specific loss functions, so-called proper scoring rules, to control their characteristics, and we show that different characteristics are useful for different tasks. In particular, we show that, for the task of selective prediction, the scoring rule should ideally match the task loss. On the other hand, for out-of-distribution detection, our results confirm that mutual information, a widely used measure of epistemic uncertainty, performs best. Furthermore, in an active learning setting, epistemic uncertainty based on zero-one loss is shown to consistently outperform other uncertainty measures.

[LG-85] Eventually LIL Regret: Almost Sure lnln T Regret for a sub-Gaussian Mixture on Unbounded Data

链接: https://arxiv.org/abs/2512.12325
作者: Shubhada Agrawal,Aaditya Ramdas
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 24 pages

点击查看摘要

Abstract:We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural ``Ville event’’ E_\alpha , this regret till time T is bounded by \ln^2(1/\alpha)/V_T + \ln (1/\alpha) + \ln \ln V_T up to universal constants, where V_T is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to \ln(1/\alpha) + \ln \ln V_T if V_T \geq \ln(1/\alpha) .) If the data were stochastic, then one can show that E_\alpha has probability at least 1-\alpha under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event E_0 of probability one, the regret on every path in E_0 is eventually bounded by \ln \ln V_T (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

[LG-86] winFormer: A Dual-Level Transformer for Long-Sequence Time-Series Forecasting

链接: https://arxiv.org/abs/2512.12301
作者: Mahima Kumavat,Aditya Maheshwari
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:TwinFormer is a hierarchical Transformer for long-sequence time-series forecasting. It divides the input into non-overlapping temporal patches and processes them in two stages: (1) a Local Informer with top- k Sparse Attention models intra-patch dynamics, followed by mean pooling; (2) a Global Informer captures long-range inter-patch dependencies using the same top- k attention. A lightweight GRU aggregates the globally contextualized patch tokens for direct multi-horizon prediction. The resulting architecture achieves linear O(kLd) time and memory complexity. On eight real-world benchmarking datasets from six different domains, including weather, stock price, temperature, power consumption, electricity, and disease, and forecasting horizons 96-720 , TwinFormer secures 27 positions in the top two out of 34 . Out of the 27 , it achieves the best performance on MAE and RMSE at 17 places and 10 at the second-best place on MAE and RMSE. This consistently outperforms PatchTST, iTransformer, FEDformer, Informer, and vanilla Transformers. Ablations confirm the superiority of top- k Sparse Attention over ProbSparse and the effectiveness of GRU-based aggregation. Code is available at this repository: this https URL.

[LG-87] Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

链接: https://arxiv.org/abs/2512.12295
作者: Wenjun Yu,Sitian Chen,Cheng Chen,Amelie Chi Zhou
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 Pages, 19 figures

点击查看摘要

Abstract:Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak = 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact 20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.

[LG-88] Balancing Accuracy and Speed: A Multi-Fidelity Ensemble Kalman Filter with a Machine Learning Surrogate Model

链接: https://arxiv.org/abs/2512.12276
作者: Jeffrey van der Voort,Martin Verlaan,Hanne Kekkonen
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Currently, more and more machine learning (ML) surrogates are being developed for computationally expensive physical models. In this work we investigate the use of a Multi-Fidelity Ensemble Kalman Filter (MF-EnKF) in which the low-fidelity model is such a machine learning surrogate model, instead of a traditional low-resolution or reduced-order model. The idea behind this is to use an ensemble of a few expensive full model runs, together with an ensemble of many cheap but less accurate ML model runs. In this way we hope to reach increased accuracy within the same computational budget. We investigate the performance by testing the approach on two common test problems, namely the Lorenz-2005 model and the Quasi-Geostrophic model. By keeping the original physical model in place, we obtain a higher accuracy than when we completely replace it by the ML model. Furthermore, the MF-EnKF reaches improved accuracy within the same computational budget. The ML surrogate has similar or improved accuracy compared to the low-resolution one, but it can provide a larger speed-up. Our method contributes to increasing the effective ensemble size in the EnKF, which improves the estimation of the initial condition and hence accuracy of the predictions in fields such as meteorology and oceanography.

[LG-89] Optimized Learned Count-Min Sketch NEURIPS2025

链接: https://arxiv.org/abs/2512.12252
作者: Kyosuke Nishishita,Atsuki Sato,Yusuke Matsui
类目: Machine Learning (cs.LG)
*备注: 4 pages, 3 figures. Accepted at NeurIPS 2025 Workshop on Machine Learning for Systems

点击查看摘要

Abstract:Count-Min Sketch (CMS) is a memory-efficient data structure for estimating the frequency of elements in a multiset. Learned Count-Min Sketch (LCMS) enhances CMS with a machine learning model to reduce estimation error under the same memory usage, but suffers from slow construction due to empirical parameter tuning and lacks theoretical guarantees on intolerable error probability. We propose Optimized Learned Count-Min Sketch (OptLCMS), which partitions the input domain and assigns each partition to its own CMS instance, with CMS parameters analytically derived for fixed thresholds, and thresholds optimized via dynamic programming with approximate feasibility checks. This reduces the need for empirical validation, enabling faster construction while providing theoretical guarantees under these assumptions. OptLCMS also allows explicit control of the allowable error threshold, improving flexibility in practice. Experiments show that OptLCMS builds faster, achieves lower intolerable error probability, and matches the estimation accuracy of LCMS.

[LG-90] Learning to Get Up Across Morphologies: Zero-Shot Recovery with a Unified Humanoid Policy

链接: https://arxiv.org/abs/2512.12230
作者: Jonathan Spraggett
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at 28th RoboCup International Symposium

点击查看摘要

Abstract:Fall recovery is a critical skill for humanoid robots in dynamic environments such as RoboCup, where prolonged downtime often decides the match. Recent techniques using deep reinforcement learning (DRL) have produced robust get-up behaviors, yet existing methods require training of separate policies for each robot morphology. This paper presents a single DRL policy capable of recovering from falls across seven humanoid robots with diverse heights (0.48 - 0.81 m), weights (2.8 - 7.9 kg), and dynamics. Trained with CrossQ, the unified policy transfers zero-shot up to 86 +/- 7% (95% CI [81, 89]) on unseen morphologies, eliminating the need for robot-specific training. Comprehensive leave-one-out experiments, morph scaling analysis, and diversity ablations show that targeted morphological coverage improves zero-shot generalization. In some cases, the shared policy even surpasses the specialist baselines. These findings illustrate the practicality of morphology-agnostic control for fall recovery, laying the foundation for generalist humanoid control. The software is open-source and available at: this https URL

[LG-91] EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training AAAI-2026

链接: https://arxiv.org/abs/2512.12210
作者: Yuting Tang,Weibang Jiang,Shanglin Li,Yong Li,Chenyu Liu,Xinliang Zhou,Yi Ding,Cuntai Guan
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI-2026

点击查看摘要

Abstract:Large-scale EEG foundation models have shown strong generalization across a range of downstream tasks, but their training remains resource-intensive due to the volume and variable quality of EEG data. In this work, we introduce EEG-DLite, a data distillation framework that enables more efficient pre-training by selectively removing noisy and redundant samples from large EEG datasets. EEG-DLite begins by encoding EEG segments into compact latent representations using a self-supervised autoencoder, allowing sample selection to be performed efficiently and with reduced sensitivity to noise. Based on these representations, EEG-DLite filters out outliers and minimizes redundancy, resulting in a smaller yet informative subset that retains the diversity essential for effective foundation model training. Through extensive experiments, we demonstrate that training on only 5 percent of a 2,500-hour dataset curated with EEG-DLite yields performance comparable to, and in some cases better than, training on the full dataset across multiple downstream tasks. To our knowledge, this is the first systematic study of pre-training data distillation in the context of EEG foundation models. EEG-DLite provides a scalable and practical path toward more effective and efficient physiological foundation modeling. The code is available at this https URL.

[LG-92] MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching

链接: https://arxiv.org/abs/2512.12198
作者: Jirui Jin,Cheng Zeng,Pawan Prakash,Ellad B. Tadmor,Adrian Roitberg,Richard G. Hennig,Stefano Martiniani,Mingjie Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 19 pages, 5 figures, code: this https URL

点击查看摘要

Abstract:Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, promoting structural diversity, and enabling efficient sampling for discovery. Recent advances in computer vision introduced a range of new guidance strategies for generative models, many of which can be adapted to support these goals. In this work, we integrate state-of-the-art guidance methods – including classifier-free guidance, autoguidance, and model guidance – in a leading molecule generation framework built on an SE(3)-equivariant flow matching process. We propose a hybrid guidance strategy that separately guides continuous and discrete molecular modalities – operating on velocity fields and predicted logits, respectively – while jointly optimizing their guidance scales via Bayesian optimization. Our implementation, benchmarked on the QM9 and QMe14S datasets, achieves new state-of-the-art performance in property alignment for de novo molecular generation. The generated molecules also exhibit high structural validity. Furthermore, we systematically compare the strengths and limitations of various guidance methods, offering insights into their broader applicability.

[LG-93] HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone

链接: https://arxiv.org/abs/2512.12183
作者: Yihan Wang,Annan Yu,Lujun Zhang,Charuleka Varadharajan,N. Benjamin Erichson
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Recent advances have introduced diffusion models for probabilistic streamflow forecasting, demonstrating strong early flood-warning skill. However, current implementations rely on recurrent Long Short-Term Memory (LSTM) backbones and single-step training objectives, which limit their ability to capture long-range dependencies and produce coherent forecast trajectories across lead times. To address these limitations, we developed HydroDiffusion, a diffusion-based probabilistic forecasting framework with a decoder-only state space model backbone. The proposed framework jointly denoises full multi-day trajectories in a single pass, ensuring temporal coherence and mitigating error accumulation common in autoregressive prediction. HydroDiffusion is evaluated across 531 watersheds in the contiguous United States (CONUS) in the CAMELS dataset. We benchmark HydroDiffusion against two diffusion baselines with LSTM backbones, as well as the recently proposed Diffusion-based Runoff Model (DRUM). Results show that HydroDiffusion achieves strong nowcast accuracy when driven by observed meteorological forcings, and maintains consistent performance across the full simulation horizon. Moreover, HydroDiffusion delivers stronger deterministic and probabilistic forecast skill than DRUM in operational forecasting. These results establish HydroDiffusion as a robust generative modeling framework for medium-range streamflow forecasting, providing both a new modeling benchmark and a foundation for future research on probabilistic hydrologic prediction at continental scales.

[LG-94] Keep the Lights On Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLM s in Energy Forecasting

链接: https://arxiv.org/abs/2512.12154
作者: Hua Ma,Ruoxi Sun,Minhui Xue,Xingliang Yuan,Carsten Rudolph,Surya Nepal,Ling Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM’s own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

[LG-95] On the Approximation Power of SiLU Networks: Exponential Rates and Depth Efficiency

链接: https://arxiv.org/abs/2512.12132
作者: Koffi O. Ayena
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 35 pages, 20 figures, submitted to the journal

点击查看摘要

Abstract:This article establishes a comprehensive theoretical framework demonstrating that SiLU (Sigmoid Linear Unit) activation networks achieve exponential approximation rates for smooth functions with explicit and improved complexity control compared to classical ReLU-based constructions. We develop a novel hierarchical construction beginning with an efficient approximation of the square function x^2 more compact in depth and size than comparable ReLU realizations, such as those given by Yarotsky. This construction yields an approximation error decaying as \mathcalO(\omega^-2k) using networks of depth \mathcalO(1) . We then extend this approach through functional composition to establish sharp approximation bounds for deep SiLU networks in approximating Sobolev-class functions, with total depth \mathcalO(1) and size \mathcalO(\varepsilon^-d/n) .

[LG-96] BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

链接: https://arxiv.org/abs/2512.12131
作者: Zhengyang Wang,Ziyue Liu,Ruijie Zhang,Avinash Maurya,Paul Hovland,Bogdan Nicolae,Franck Cappello,Zheng Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91 \times speedup over full-rank model baselines and 1.87-2.27 \times speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

[LG-97] High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure Representation Synergy and Theoretical Guarantees

链接: https://arxiv.org/abs/2512.12122
作者: Elynn Chen,Yuefeng Han,Jiayu Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textscrc-PCA) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.12122 [cs.LG] (or arXiv:2512.12122v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-98] Citation-Grounded Code Comprehension: Preventing LLM Hallucination Through Hybrid Retrieval and Graph-Augmented Context

链接: https://arxiv.org/abs/2512.12117
作者: Jahidul Arafat
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models have become essential tools for code comprehension, enabling developers to query unfamiliar codebases through natural language interfaces. However, LLM hallucination, generating plausible but factually incorrect citations to source code, remains a critical barrier to reliable developer assistance. This paper addresses the challenges of achieving verifiable, citation grounded code comprehension through hybrid retrieval and lightweight structural reasoning. Our work is grounded in systematic evaluation across 30 Python repositories with 180 developer queries, comparing retrieval modalities, graph expansion strategies, and citation verification mechanisms. We find that challenges of citation accuracy arise from the interplay between sparse lexical matching, dense semantic similarity, and cross file architectural dependencies. Among these, cross file evidence discovery is the largest contributor to citation completeness, but it is largely overlooked because existing systems rely on pure textual similarity without leveraging code structure. We advocate for citation grounded generation as an architectural principle for code comprehension systems and demonstrate this need by achieving 92 percent citation accuracy with zero hallucinations. Specifically, we develop a hybrid retrieval system combining BM25 sparse matching, BGE dense embeddings, and Neo4j graph expansion via import relationships, which outperforms single mode baselines by 14 to 18 percentage points while discovering cross file evidence missed by pure text similarity in 62 percent of architectural queries.

[LG-99] Neural CDEs as Correctors for Learned Time Series Models

链接: https://arxiv.org/abs/2512.12116
作者: Muhammad Bilal Shahid,Prajwal Koirla,Cody Fleming
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

[LG-100] GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes

链接: https://arxiv.org/abs/2512.12091
作者: Mohammad Pivezhandi,Mahdi Banisharif,Saeed Bakhshan,Abusayeed Saifullah,Ali Jannesari
类目: Machine Learning (cs.LG)
*备注: 36 pages, 1 figure, 7 tables

点击查看摘要

Abstract:Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts. We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 0.95 with well-calibrated uncertainty (ECE 0.05). To demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL), multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate, uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems. Comments: 36 pages, 1 figure, 7 tables Subjects: Machine Learning (cs.LG) ACMclasses: C.4; D.1.3; I.2.6 Cite as: arXiv:2512.12091 [cs.LG] (or arXiv:2512.12091v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.12091 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohammad Pivezhandi [view email] [v1] Fri, 12 Dec 2025 23:46:05 UTC (375 KB) Full-text links: Access Paper: View a PDF of the paper titled GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes, by Mohammad Pivezhandi and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-101] CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation

链接: https://arxiv.org/abs/2512.12086
作者: Xin Yang,Omid Ardakanian
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Data obfuscation is a promising technique for mitigating attribute inference attacks by semi-trusted parties with access to time-series data emitted by sensors. Recent advances leverage conditional generative models together with adversarial training or mutual information-based regularization to balance data privacy and utility. However, these methods often require modifying the downstream task, struggle to achieve a satisfactory privacy-utility trade-off, or are computationally intensive, making them impractical for deployment on resource-constrained mobile IoT devices. We propose Cloak, a novel data obfuscation framework based on latent diffusion models. In contrast to prior work, we employ contrastive learning to extract disentangled representations, which guide the latent diffusion process to retain useful information while concealing private information. This approach enables users with diverse privacy needs to navigate the privacy-utility trade-off with minimal retraining. Extensive experiments on four public time-series datasets, spanning multiple sensing modalities, and a dataset of facial images demonstrate that Cloak consistently outperforms state-of-the-art obfuscation techniques and is well-suited for deployment in resource-constrained settings.

[LG-102] SigTime: Learning and Visually Explaining Time Series Signatures

链接: https://arxiv.org/abs/2512.12076
作者: Yu-Chia Huang,Juntong Chen,Dongyu Liu,Kwan-Liu Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding and distinguishing temporal patterns in time series data is essential for scientific discovery and decision-making. For example, in biomedical research, uncovering meaningful patterns in physiological signals can improve diagnosis, risk assessment, and patient outcomes. However, existing methods for time series pattern discovery face major challenges, including high computational complexity, limited interpretability, and difficulty in capturing meaningful temporal structures. To address these gaps, we introduce a novel learning framework that jointly trains two Transformer models using complementary time series representations: shapelet-based representations to capture localized temporal structures and traditional feature engineering to encode statistical properties. The learned shapelets serve as interpretable signatures that differentiate time series across classification labels. Additionally, we develop a visual analytics system – SigTIme – with coordinated views to facilitate exploration of time series signatures from multiple perspectives, aiding in useful insights generation. We quantitatively evaluate our learning framework on eight publicly available datasets and one proprietary clinical dataset. Additionally, we demonstrate the effectiveness of our system through two usage scenarios along with the domain experts: one involving public ECG data and the other focused on preterm labor analysis.

[LG-103] Physics-informed neural networks to solve inverse problems in unbounded domains

链接: https://arxiv.org/abs/2512.12074
作者: Gregorio Pérez-Bernal,Oscar Rincón-Cardeño,Silvana Montoya-Noguera,Nicolás Guarín-Zapata
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 19 pages, 15 figures

点击查看摘要

Abstract:Inverse problems are extensively studied in applied mathematics, with applications ranging from acoustic tomography for medical diagnosis to geophysical exploration. Physics informed neural networks (PINNs) have emerged as a powerful tool for solving such problems, while Physics informed Kolmogorov Arnold networks (PIKANs) represent a recent benchmark that, in certain problems, promises greater interpretability and accuracy compared to PINNs, due to their nature, being constructed as a composition of polynomials. In this work, we develop a methodology for addressing inverse problems in infinite and semi infinite domains. We introduce a novel sampling strategy for the network’s training points, using the negative exponential and normal distributions, alongside a dual network architecture that is trained to learn the solution and parameters of an equation with the same loss function. This design enables the solution of inverse problems without explicitly imposing boundary conditions, as long as the solutions tend to stabilize when leaving the domain of interest. The proposed architecture is implemented using both PINNs and PIKANs, and their performance is compared in terms of accuracy with respect to a known solution as well as computational time and response to a noisy environment. Our results demonstrate that, in this setting, PINNs provide a more accurate and computationally efficient solution, solving the inverse problem 1,000 times faster and in the same order of magnitude, yet with a lower relative error than PIKANs.

[LG-104] Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

链接: https://arxiv.org/abs/2512.12046
作者: Vittorio Giammarino,Ahmed H. Qureshi
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.

[LG-105] DFedReweighting: A Unified Framework for Objective-Oriented Reweighting in Decentralized Federated Learning

链接: https://arxiv.org/abs/2512.12022
作者: Kaichuang Zhang,Wei Yin,Jinghao Yang,Ping Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized federated learning (DFL) has recently emerged as a promising paradigm that enables multiple clients to collaboratively train machine learning model through iterative rounds of local training, communication, and aggregation without relying on a central server which introduces potential vulnerabilities in conventional Federated Learning. Nevertheless, DFL systems continue to face a range of challenges, including fairness, robustness, etc. To address these challenges, we propose \textbfDFedReweighting, a unified aggregation framework designed to achieve diverse objectives in DFL systems via a objective-oriented reweighting aggregation at the final step of each learning round. Specifically, the framework first computes preliminary weights based on \textittarget performance metric obtained from auxiliary dataset constructed using local data. These weights are then refined using \textitcustomized reweighting strategy, resulting in the final aggregation weights. Our results from the theoretical analysis demonstrate that the appropriate combination of the target performance metric and the customized reweighting strategy ensures linear convergence. Experimental results consistently show that our proposed framework significantly improves fairness and robustness against Byzantine attacks in diverse scenarios. Provided that appropriate target performance metrics and customized reweighting strategy are selected, our framework can achieve a wide range of desired learning objectives.

[LG-106] EnviroLLM : Resource Tracking and Optimization for Local AI

链接: https://arxiv.org/abs/2512.12004
作者: Troy Allen
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 8 pages, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed locally for privacy and accessibility, yet users lack tools to measure their resource usage, environmental impact, and efficiency metrics. This paper presents EnviroLLM, an open-source toolkit for tracking, benchmarking, and optimizing performance and energy consumption when running LLMs on personal devices. The system provides real-time process monitoring, benchmarking across multiple platforms (Ollama, LM Studio, vLLM, and OpenAI-compatible APIs), persistent storage with visualizations for longitudinal analysis, and personalized model and optimization recommendations. The system includes LLM-as-judge evaluations alongside energy and speed metrics, enabling users to assess quality-efficiency tradeoffs when testing models with custom prompts.

[LG-107] Adversarial Attacks Against Deep Learning-Based Radio Frequency Fingerprint Identification

链接: https://arxiv.org/abs/2512.12002
作者: Jie Ma,Junqing Zhang,Guanxiong Shen,Alan Marshall,Chip-Hong Chang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio frequency fingerprint identification (RFFI) is an emerging technique for the lightweight authentication of wireless Internet of things (IoT) devices. RFFI exploits deep learning models to extract hardware impairments to uniquely identify wireless devices. Recent studies show deep learning-based RFFI is vulnerable to adversarial attacks. However, effective adversarial attacks against different types of RFFI classifiers have not yet been explored. In this paper, we carried out a comprehensive investigations into different adversarial attack methods on RFFI systems using various deep learning models. Three specific algorithms, fast gradient sign method (FGSM), projected gradient descent (PGD), and universal adversarial perturbation (UAP), were analyzed. The attacks were launched to LoRa-RFFI and the experimental results showed the generated perturbations were effective against convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and gated recurrent units (GRU). We further used UAP to launch practical attacks. Special factors were considered for the wireless context, including implementing real-time attacks, the effectiveness of the attacks over a period of time, etc. Our experimental evaluation demonstrated that UAP can successfully launch adversarial attacks against the RFFI, achieving a success rate of 81.7% when the adversary almost has no prior knowledge of the victim RFFI systems.

[LG-108] Policy Gradient Algorithms for Age-of-Information Cost Minimization

链接: https://arxiv.org/abs/2512.11990
作者: José-Ramón Vidal,Vicent Pla,Luis Guijarro,Israel Leyva-Mayorga
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: IEEE Transactions on Communications, 2025

点击查看摘要

Abstract:Recent developments in cyber-physical systems have increased the importance of maximizing the freshness of the information about the physical environment. However, optimizing the access policies of Internet of Things devices to maximize the data freshness, measured as a function of the Age-of-Information (AoI) metric, is a challenging task. This work introduces two algorithms to optimize the information update process in cyber-physical systems operating under the generate-at-will model, by finding an online policy without knowing the characteristics of the transmission delay or the age cost function. The optimization seeks to minimize the time-average cost, which integrates the AoI at the receiver and the data transmission cost, making the approach suitable for a broad range of scenarios. Both algorithms employ policy gradient methods within the framework of model-free reinforcement learning (RL) and are specifically designed to handle continuous state and action spaces. Each algorithm minimizes the cost using a distinct strategy for deciding when to send an information update. Moreover, we demonstrate that it is feasible to apply the two strategies simultaneously, leading to an additional reduction in cost. The results demonstrate that the proposed algorithms exhibit good convergence properties and achieve a time-average cost within 3% of the optimal value, when the latter is computable. A comparison with other state-of-the-art methods shows that the proposed algorithms outperform them in one or more of the following aspects: being applicable to a broader range of scenarios, achieving a lower time-average cost, and requiring a computational cost at least one order of magnitude lower.

[LG-109] Learning to Extract Context for Context-Aware LLM Inference

链接: https://arxiv.org/abs/2512.11986
作者: Minseon Kim,Lucas Caccia,Zhengyan Shi,Matheus Pereira,Marc-Alexandre Côté,Xingdi Yuan,Alessandro Sordoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.

[LG-110] Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

链接: https://arxiv.org/abs/2512.11949
作者: Max McGuinness,Alex Serrano,Luke Bailey,Scott Emmons
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation monitoring, which probes a model’s internal states using lightweight classifiers, is an emerging tool for AI safety. However, its worst-case robustness under a misalignment threat model–where a model might learn to actively conceal its internal states–remains untested. Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is to stress-test the learnability of this behavior. We demonstrate that finetuning can create Neural Chameleons: models capable of zero-shot evading activation monitors. Specifically, we fine-tune an LLM to evade monitors for a set of benign concepts (e.g., languages, HTML) when conditioned on a trigger of the form: “You are being probed for concept”. We show that this learned mechanism generalizes zero-shot: by substituting concept with a safety-relevant term like ‘deception’, the model successfully evades previously unseen safety monitors. We validate this phenomenon across diverse model families (Llama, Gemma, Qwen), showing that the evasion succeeds even against monitors trained post hoc on the model’s frozen weights. This evasion is highly selective, targeting only the specific concept mentioned in the trigger, and having a modest impact on model capabilities on standard benchmarks. Using Gemma-2-9b-it as a case study, a mechanistic analysis reveals this is achieved via a targeted manipulation that moves activations into a low-dimensional subspace. While stronger defenses like monitor ensembles and non-linear classifiers show greater resilience, the model retains a non-trivial evasion capability. Our work provides a proof-of-concept for this failure mode and a tool to evaluate the worst-case robustness of monitoring techniques against misalignment threat models.

[LG-111] he Art of Storytelling in Authoritarian Regimes: Crafting State Narratives on Chinese Social Media

链接: https://arxiv.org/abs/2512.11875
作者: Ting Luo,Yan Wang
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This article examines how authoritarian regimes construct state narratives about politically consequential events. Building on the narrative policy framework and existing research on authoritarian propaganda, we propose two dimensions that shape narrative construction: legitimacy implications – whether events enhance or threaten regime legitimacy, and citizen verification capacity – the extent to which citizens can evaluate official narratives through alternative sources. Using quantitative narrative analysis of Chinese social media posts by government, state media, and celebrity accounts, we extract subject-verb-object (SVO) triplets to map dominant narrative structures across four major events. Our findings show that legitimacy implications of the event shape regime’s efforts in storytelling and the beliefs highlighted in the narratives, while citizen’s verification capacity could balance the strategic choice between a top-down manipulation and bottom-up responsiveness of state narratives. Together, the results reveal propaganda as a complex process of narrative construction adaptive to specific contexts, offering new insights into how dynamic storytelling sustains authoritarian resilience.

[LG-112] Phase transitions reveal hierarchical structure in deep neural networks

链接: https://arxiv.org/abs/2512.11866
作者: Ibrahim Talha Ersoy,Andrés Fernando Cardozo Licha,Karoline Wiesner
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Training Deep Neural Networks relies on the model converging on a high-dimensional, non-convex loss landscape toward a good minimum. Yet, much of the phenomenology of training remains ill understood. We focus on three seemingly disparate observations: the occurrence of phase transitions reminiscent of statistical physics, the ubiquity of saddle points, and phenomenon of mode connectivity relevant for model merging. We unify these within a single explanatory framework, the geometry of the loss and error landscapes. We analytically show that phase transitions in DNN learning are governed by saddle points in the loss landscape. Building on this insight, we introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes. We apply it to confirm mode connectivity in DNNs trained on the MNIST dataset by efficiently finding paths that connect global minima. We then show numerically that saddle points induce transitions between models that encode distinct digit classes. Our work establishes the geometric origin of key training phenomena in DNNs and reveals a hierarchy of accuracy basins analogous to phases in statistical physics.

[LG-113] Evolving Deep Learning Optimizers

链接: https://arxiv.org/abs/2512.11853
作者: Mitchell Marfinetz
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a genetic algorithm framework for automatically discovering deep learning optimization algorithms. Our approach encodes optimizers as genomes that specify combinations of primitive update terms (gradient, momentum, RMS normalization, Adam-style adaptive terms, and sign-based updates) along with hyperparameters and scheduling options. Through evolutionary search over 50 generations with a population of 50 individuals, evaluated across multiple vision tasks, we discover an evolved optimizer that outperforms Adam by 2.6% in aggregate fitness and achieves a 7.7% relative improvement on CIFAR-10. The evolved optimizer combines sign-based gradient terms with adaptive moment estimation, uses lower momentum coefficients than Adam ( \beta_1 =0.86, \beta_2 =0.94), and notably disables bias correction while enabling learning rate warmup and cosine decay. Our results demonstrate that evolutionary search can discover competitive optimization algorithms and reveal design principles that differ from hand-crafted optimizers. Code is available at this https URL.

[LG-114] ny Recursive Models on ARC-AGI-1: Inductive Biases Identity Conditioning and Test-Time Compute

链接: https://arxiv.org/abs/2512.11847
作者: Antonio Roye-Azar,Santiago Vargas-Naranjo,Dhruv Ghai,Nithin Balamurugan,Rayan Amir
类目: Machine Learning (cs.LG)
*备注: 13 pages, 0 figures, 6 tables

点击查看摘要

Abstract:Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM’s non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM’s ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.

[LG-115] Exploring Topological Bias in Heterogeneous Graph Neural Networks

链接: https://arxiv.org/abs/2512.11846
作者: Yihan Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are characterized by their capacity of processing graph-structured data. However, due to the sparsity of labels under semi-supervised learning, they have been found to exhibit biased performance on specific nodes. This kind of bias has been validated to correlate with topological structure and is considered as a bottleneck of GNNs’ performance. Existing work focuses on the study of homogeneous GNNs and little attention has been given to topological bias in Heterogeneous Graph Neural Networks (HGNNs). In this work, firstly, in order to distinguish distinct meta relations, we apply meta-weighting to the adjacency matrix of a heterogeneous graph. Based on the modified adjacency matrix, we leverage PageRank along with the node label information to construct a projection. The constructed projection effectively maps nodes to values that strongly correlated with model performance when using datasets both with and without intra-type connections, which demonstrates the universal existence of topological bias in HGNNs. To handle this bias, we propose a debiasing structure based on the difference in the mapped values of nodes and use it along with the original graph structure for contrastive learning. Experiments on three public datasets verify the effectiveness of the proposed method in improving HGNNs’ performance and debiasing.

[LG-116] Meta-Continual Mobility Forecasting for Proactive Handover Prediction

链接: https://arxiv.org/abs/2512.11841
作者: Sasi Vardhan Reddy Mandapati
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:Short-term mobility forecasting is a core requirement for proactive handover (HO) in cellular networks. Real-world mobility is highly non-stationary: abrupt turns, rapid speed changes, and unpredictable user behavior cause conventional predictors to drift, leading to mistimed or failed handovers. We propose a lightweight meta-continual forecasting framework that integrates a GRU-based predictor, Reptile meta-initialization for fast few-shot adaptation, and an EWMA residual detector that triggers compact online updates only when drift occurs. Evaluated on a reproducible GeoLife and DeepMIMO pipeline, our method achieves 4.46 m ADE and 7.79 m FDE in zero-shot settings, improves few-shot ADE to 3.71 m at 10-shot, and enables recovery from abrupt drift about 2 to 3 times faster than an offline GRU. When applied to downstream HO prediction, the approach improves F1 to 0.83 and AUROC to 0.90, with substantial reductions in missed-HO and ping-pong events. The model is lightweight (128k parameters) and suitable for edge deployment in 5G and 6G systems.

[LG-117] Amortized Causal Discovery with Prior-Fitted Networks

链接: https://arxiv.org/abs/2512.11840
作者: Mateusz Sypniewski,Mateusz Olko,Mateusz Gajewski,Piotr Miłoś
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, differentiable penalized likelihood methods have gained popularity, optimizing the causal structure by maximizing its likelihood with respect to the data. However, recent research has shown that errors in likelihood estimation, even on relatively large sample sizes, disallow the discovery of proper structures. We propose a new approach to amortized causal discovery that addresses the limitations of likelihood estimator accuracy. Our method leverages Prior-Fitted Networks (PFNs) to amortize data-dependent likelihood estimation, yielding more reliable scores for structure learning. Experiments on synthetic, simulated, and real-world datasets show significant gains in structure recovery compared to standard baselines. Furthermore, we demonstrate directly that PFNs provide more accurate likelihood estimates than conventional neural network-based approaches.

[LG-118] Large Language Models as Generalist Policies for Network Optimization

链接: https://arxiv.org/abs/2512.11839
作者: Duo Wu,Linjia Kang,Zhimin Wang,Fangxin Wang,Wei Zhang,Xuefeng Tao,Wei Yang,Le Zhang,Peng Cui,Zhi Wang
类目: Machine Learning (cs.LG)
*备注: In submission. Project homepage: this https URL

点击查看摘要

Abstract:Designing control policies to ensure robust network services is essential to modern digital infrastructure. However, the dominant paradigm for network optimization relies on designing specialist policies based on handcrafted rules or deep learning models, leading to poor generalization across diverse tasks and environments. In contrast, large language models (LLMs), pretrained on Internet-scale corpora, provide a rich and unified knowledge base that encodes fundamental networking principles. Combined with their emergent abilities in generalization to unseen scenarios, LLMs offer a transformative foundation for generalist network policies that can generalize across diverse tasks and environments with minimal adaptation. In this paper, we present Trailblazer, the first systematic framework to realize such a generalist policy for networking. Trailblazer incorporates a network alignment scheme to ground the LLM in specific networking tasks, and an adaptive policy collaboration mechanism that offloads simple control cases from the LLM to a lightweight policy for computational efficiency. Through extensive simulations and large-scale real-world online evaluation on Douyin (the Chinese version of TikTok), Trailblazer, powered by a single LLM, demonstrates stronger cross-task and cross-environment generalization than conventional specialist policies. Our results validate LLMs as the foundation for generalist network policies, and position Trailblazer as the first step toward the generalist-driven paradigm that enables strong generalization with minimal efforts in policy design.

[LG-119] D-STEER - Preference Alignment Techniques Learn to Behave not to Believe – Beneath the Surface DPO as Steering Vector Perturbation in Activation Space

链接: https://arxiv.org/abs/2512.11838
作者: Samarth Raina,Saksham Aggarwal,Aman Chadha,Vinija Jain,Amitava Das
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has become a standard recipe for aligning large language models, yet it is still unclear what kind of change it actually induces inside the network. This paper argues that DPO does not rewrite a models internal beliefs; instead, it acts as a low rank steering mechanism that nudges activations along a small number of preference directions. Using a simple derivation, we show that the DPO gradient depends only on the difference between the logit embeddings of preferred and dispreferred completions, implying a first order shift in the final hidden representation rather than a deep restructuring of semantics. We then extract an empirical steering vector from a DPO tuned model and demonstrate that adding this vector to base activations reproduces most of the aligned behavior, while subtracting it nearly restores the original model. Finally, spectral analyses reveal rank-one dominance and entropy collapse in upper layers, indicating that alignment is funneled through a narrow subspace. Taken together, these results support a behavioral illusion view of DPO: it teaches models how to act aligned, not what to believe.

[LG-120] Hybrid twinning using PBDW and DeepONet for the effective state estimation and prediction on partially known systems

链接: https://arxiv.org/abs/2512.11834
作者: Stiven Briand Massala,Ludovic Chamoin,Massimo Picca Ciamarra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate estimation of the state of complex uncertain physical systems requires reconciling theoretical models, with inherent imperfections, with noisy experimental data. In this work, we propose an effective hybrid approach that combines physics-based modeling with data-driven learning to enhance state estimation and further prediction. Our method builds upon the Parameterized Background Data-Weak (PBDW) framework, which naturally integrates a reduced-order representation of the best-available model with measurement data to account for both anticipated and unanticipated uncertainties. To address model discrepancies not captured by the reduced-order space, and learn the structure of model deviation, we incorporate a Deep Operator Network (DeepONet) constrained to be an orthogonal complement of the best-knowledge manifold. This ensures that the learned correction targets only the unknown components of model bias, preserving the interpretability and fidelity of the physical model. An optimal sensor placement strategy is also investigated to maximize information gained from measurements. We validate the proposed approach on a representative problem involving the Helmholtz equation under various sources of modeling error, including those arising from boundary conditions and source terms.

[LG-121] A Neural-preconditioned Poisson Solver for Mixed Dirichlet and Neumann Boundary Conditions

链接: https://arxiv.org/abs/2310.00177
作者: Kai Weixian Lan,Elias Gueidon,Ayano Kaneda,Julian Panetta,Joseph Teran
类目: Numerical Analysis (math.NA); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a neural-preconditioned iterative solver for Poisson equations with mixed boundary conditions. Typical Poisson discretizations yield large, ill-conditioned linear systems. Iterative solvers can be effective for these problems, but only when equipped with powerful preconditioners. Unfortunately, effective preconditioners like multigrid require costly setup phases that must be re-executed every time domain shapes or boundary conditions change, forming a severe bottleneck for problems with evolving boundaries. In contrast, we present a neural preconditioner trained to efficiently approximate the inverse of the discrete Laplacian in the presence of such changes. Our approach generalizes to domain shapes, boundary conditions, and grid sizes outside the training set. The key to our preconditioner’s success is a novel, lightweight neural network architecture featuring spatially varying convolution kernels and supporting fast inference. We demonstrate that our solver outperforms state-of-the-art methods like algebraic multigrid as well as recently proposed neural preconditioners on challenging test cases arising from incompressible fluid simulations.

[LG-122] Universality of high-dimensional scaling limits of stochastic gradient descent

链接: https://arxiv.org/abs/2512.13634
作者: Reza Gheissari,Aukosh Jagannath
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 30 pages

点击查看摘要

Abstract:We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to \infty and the step size goes to 0 commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE’s fixed points are not universal.

[LG-123] A Nonparametric Statistics Approach to Feature Selection in Deep Neural Networks with Theoretical Guarantees

链接: https://arxiv.org/abs/2512.13565
作者: Junye Du,Zhenghao Li,Zhutong Gu,Long Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 64 pages

点击查看摘要

Abstract:This paper tackles the problem of feature selection in a highly challenging setting: \mathbbE(y | \boldsymbolx) = G(\boldsymbolx_\mathcalS_0) , where \mathcalS_0 is the set of relevant features and G is an unknown, potentially nonlinear function subject to mild smoothness conditions. Our approach begins with feature selection in deep neural networks, then generalizes the results to Hölder smooth functions by exploiting the strong approximation capabilities of neural networks. Unlike conventional optimization-based deep learning methods, we reformulate neural networks as index models and estimate \mathcalS_0 using the second-order Stein’s formula. This gradient-descent-free strategy guarantees feature selection consistency with a sample size requirement of n = \Omega(p^2) , where p is the feature dimension. To handle high-dimensional scenarios, we further introduce a screening-and-selection mechanism that achieves nonlinear selection consistency when n = \Omega(s \log p) , with s representing the sparsity level. Additionally, we refit a neural network on the selected features for prediction and establish performance guarantees under a relaxed sparsity assumption. Extensive simulations and real-data analyses demonstrate the strong performance of our method even in the presence of complex feature interactions.

[LG-124] Adaptive Sampling for Hydrodynamic Stability

链接: https://arxiv.org/abs/2512.13532
作者: Anshima Singh,David J. Silvester
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An adaptive sampling approach for efficient detection of bifurcation boundaries in parametrized fluid flow problems is presented herein. The study extends the machine-learning approach of Silvester (Machine Learning for Hydrodynamic Stability, arXiv:2407.09572), where a classifier network was trained on preselected simulation data to identify bifurcated and nonbifurcated flow regimes. In contrast, the proposed methodology introduces adaptivity through a flow-based deep generative model that automatically refines the sampling of the parameter space. The strategy has two components: a classifier network maps the flow parameters to a bifurcation probability, and a probability density estimation technique (KRnet) for the generation of new samples at each adaptive step. The classifier output provides a probabilistic measure of flow stability, and the Shannon entropy of these predictions is employed as an uncertainty indicator. KRnet is trained to approximate a probability density function that concentrates sampling in regions of high entropy, thereby directing computational effort towards the evolving bifurcation boundary. This coupling between classification and generative modeling establishes a feedback-driven adaptive learning process analogous to error-indicator based refinement in contemporary partial differential equation solution strategies. Starting from a uniform parameter distribution, the new approach achieves accurate bifurcation boundary identification with significantly fewer Navier–Stokes simulations, providing a scalable foundation for high-dimensional stability analysis.

[LG-125] Actively Learning Joint Contours of Multiple Computer Experiments

链接: https://arxiv.org/abs/2512.13530
作者: Shih-Ni Prim,Kevin R. Quinlan,Paul Hawkins,Jagadeesh Movva,Annie S. Booth
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contour location \unicodex2014 the process of sequentially training a surrogate model to identify the design inputs that result in a pre-specified response value from a single computer experiment \unicodex2014 is a well-studied active learning problem. Here, we tackle a related but distinct problem: identifying the input configuration that returns pre-specified values of multiple independent computer experiments simultaneously. Motivated by computer experiments of the rotational torques acting upon a vehicle in flight, we aim to identify stable flight conditions which result in zero torque forces. We propose a “joint contour location” (jCL) scheme that strikes a strategic balance between exploring the multiple response surfaces while exploiting learning of the intersecting contours. We employ both shallow and deep Gaussian process surrogates, but our jCL procedure is applicable to any surrogate that can provide posterior predictive distributions. Our jCL designs significantly outperform existing (single response) CL strategies, enabling us to efficiently locate the joint contour of our motivating computer experiments.

[LG-126] Enhancing lithological interpretation from petrophysical well log of IODP expedition 390/393 using machine learning

链接: https://arxiv.org/abs/2512.13529
作者: Raj Sahu,Saumen Maiti
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Accepted for presentation at the International Meeting for Applied Geoscience Energy (IMAGE) 2025

点击查看摘要

Abstract:Enhanced lithological interpretation from well logs plays a key role in geological resource exploration and mapping, as well as in geo-environmental modeling studies. Core and cutting information is useful for making sound interpretations of well logs; however, these are rarely collected at each depth due to high costs. Moreover, well log interpretation using traditional methods is constrained by poor borehole conditions. Traditional statistical methods are mostly linear, often failing to discriminate between lithology and rock facies, particularly when dealing with overlapping well log signals characterized by the structural and compositional variation of rock types. In this study, we develop multiple supervised and unsupervised machine learning algorithms to jointly analyze multivariate well log data from Integrated Ocean Drilling Program (IODP) expeditions 390 and 393 for enhanced lithological interpretations. Among the algorithms, Logistic Regression, Decision Trees, Gradient Boosting, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP) neural network models, the Decision Tree and Gradient Boosting models outperformed the others, achieving an accuracy of 0.9950 and an F1-score of 0.9951. While unsupervised machine learning (ML) provides the foundation for cluster information that inherently supports the classification algorithm, supervised ML is applied to devise a data-driven lithology clustering mechanism for IODP datasets. The joint ML-based method developed here has the potential to be further explored for analyzing other well log datasets from the world’s oceans.

[LG-127] A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments

链接: https://arxiv.org/abs/2512.13517
作者: Raymond Khazoum,Daniela Fernandes,Aleksandr Krylov,Qin Li,Stephane Deny
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mental rotation – the ability to compare objects seen from different viewpoints – is a fundamental example of mental simulation and spatial world modelling in humans. Here we propose a mechanistic model of human mental rotation, leveraging advances in deep, equivariant, and neuro-symbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, taking images as input and producing 3D spatial representations of objects, (2) a neuro-symbolic object encoder, deriving symbolic descriptions of objects from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the abundant experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare, providing us with additional insights into the cognitive process of mental rotation. Our model captures well the performance, response times and behavior of participants in our and others’ experiments. The necessity of each model component is shown through systematic ablations. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.

[LG-128] Better LMO-based Momentum Methods with Second-Order Information

链接: https://arxiv.org/abs/2512.13227
作者: Sarit Khirirat,Abdurakhmon Sadiev,Yury Demidovich,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework–leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the O(1/K^1/4) rate. While several approaches–such as Hessian-Corrected Momentum (HCM)–have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of O(1/K^1/3) for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.

[LG-129] Rethinking Physics-Informed Regression Beyond Training Loops and Bespoke Architectures

链接: https://arxiv.org/abs/2512.13217
作者: Lorenzo Sabug Jr.,Eric Kerrigan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the problem of physics-informed regression, and propose a method that directly computes the state at the prediction point, simultaneously with the derivative and curvature information of the existing samples. We frame each prediction as a constrained optimisation problem, leveraging multivariate Taylor series expansions and explicitly enforcing physical laws. Each individual query can be processed with low computational cost without any pre- or re-training, in contrast to global function approximator-based solutions such as neural networks. Our comparative benchmarks on a reaction-diffusion system show competitive predictive accuracy relative to a neural network-based solution, while completely eliminating the need for long training loops, and remaining robust to changes in the sampling layout.

[LG-130] MicroPhaseNO: Adapting an Earthquake-Trained Phase Neural Operator for Microseismic Phase Picking

链接: https://arxiv.org/abs/2512.13197
作者: Ayrat Abdullin,Umair bin Waheed,Leo Eisner,Naveed Iqbal
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Submitted to Pure and Applied Geophysics

点击查看摘要

Abstract:Seismic phase picking is very often used for microseismic monitoring and subsurface imaging. Traditional manual processing is not feasible for either real-time applications or large arrays. Deep learning-based pickers trained on large earthquake catalogs offer an automated alternative. However, they are typically optimized for high signal-to-noise, long-duration networks and struggle with the challenges presented by microseismic datasets, which are purpose-built for limited time without previously detected seismicity. In this study, we demonstrate how a network-wide earthquake phase picker, the Phase Neural Operator (PhaseNO), can be adapted to microseismic monitoring using transfer learning. Starting from a PhaseNO model pre-trained on more than 57,000 three-component earthquake and noise records, we fine-tune the model using only 200 labeled and noise seismograms from induced events in hydraulic-fracturing settings. The fine-tuned model thus preserves the rich spatio-temporal representation learned from abundant earthquake data, while adapting to the characteristics and labeling conventions of microseismic phases, which are often picked on peaks or troughs rather than onsets. We evaluate performance on three distinct real-world microseismic datasets with different network geometries and acquisition parameters. Compared to the original PhaseNO and a conventional workflow, the adapted model increases F1 score and accuracy by up to 30%, and strongly reduces systematic timing bias and pick uncertainty. Because the adaptation relies on a small, campaign-specific calibration set, the approach is readily transferable to other microseismic tasks where public earthquake data and pre-trained models are accessible. The associated code will be released openly at this https URL.

[LG-131] Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

链接: https://arxiv.org/abs/2512.13123
作者: Liviu Aolaritei,Michael I. Jordan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably \varepsilon -optimal with probability at least 1-\alpha and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time \varepsilon -optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

[LG-132] General OOD Detection via Model-aware and Subspace-aware Variable Priority

链接: https://arxiv.org/abs/2512.13003
作者: Min Lu,Hemant Ishwaran
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ meaningfully from its training distribution. While widely studied in classification, OOD detection for regression and survival analysis remains limited due to the absence of discrete labels and the challenge of quantifying predictive uncertainty. We introduce a framework for OOD detection that is simultaneously model aware and subspace aware, and that embeds variable prioritization directly into the detection step. The method uses the fitted predictor to construct localized neighborhoods around each test case that emphasize the features driving the model’s learned relationship and downweight directions that are less relevant to prediction. It produces OOD scores without relying on global distance metrics or estimating the full feature density. The framework is applicable across outcome types, and in our implementation we use random forests, where the rule structure yields transparent neighborhoods and effective scoring. Experiments on synthetic and real data benchmarks designed to isolate functional shifts show consistent improvements over existing methods. We further demonstrate the approach in an esophageal cancer survival study, where distribution shifts related to lymphadenectomy identify patterns relevant to surgical guidelines.

[LG-133] Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

链接: https://arxiv.org/abs/2512.12911
作者: Kohei Nishikawa,Koki Shimizu,Hashiguchi Hiroki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

[LG-134] PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders NEURIPS2025

链接: https://arxiv.org/abs/2512.12905
作者: Ruixin Guo,Ruoming Jin,Xinyu Li,Yang Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025 ( this https URL )

点击查看摘要

Abstract:Linear Autoencoders (LAEs) have shown strong performance in state-of-the-art recommender systems. However, this success remains largely empirical, with limited theoretical understanding. In this paper, we investigate the generalizability – a theoretical measure of model performance in statistical learning – of multivariate linear regression and LAEs. We first propose a PAC-Bayes bound for multivariate linear regression, extending the earlier bound for single-output linear regression by Shalaeva et al., and establish sufficient conditions for its convergence. We then show that LAEs, when evaluated under a relaxed mean squared error, can be interpreted as constrained multivariate linear regression models on bounded data, to which our bound adapts. Furthermore, we develop theoretical methods to improve the computational efficiency of optimizing the LAE bound, enabling its practical evaluation on large models and real-world datasets. Experimental results demonstrate that our bound is tight and correlates well with practical ranking metrics such as Recall@K and NDCG@K.

[LG-135] Flow-matching Operators for Residual-Augmented Probabilistic Learning of Partial Differential Equations

链接: https://arxiv.org/abs/2512.12749
作者: Sahil Bhola,Karthik Duraisamy
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning probabilistic surrogates for PDEs remains challenging in data-scarce regimes: neural operators require large amounts of high-fidelity data, while generative approaches typically sacrifice resolution invariance. We formulate flow matching in an infinite-dimensional function space to learn a probabilistic transport that maps low-fidelity approximations to the manifold of high-fidelity PDE solutions via learned residual corrections. We develop a conditional neural operator architecture based on feature-wise linear modulation for flow-matching vector fields directly in function space, enabling inference at arbitrary spatial resolutions without retraining. To improve stability and representational control of the induced neural ODE, we parameterize the flow vector field as a sum of a linear operator and a nonlinear operator, combining lightweight linear components with a conditioned Fourier neural operator for expressive, input-dependent dynamics. We then formulate a residual-augmented learning strategy where the flow model learns probabilistic corrections from inexpensive low-fidelity surrogates to high-fidelity solutions, rather than learning the full solution mapping from scratch. Finally, we derive tractable training objectives that extend conditional flow matching to the operator setting with input-function-dependent couplings. To demonstrate the effectiveness of our approach, we present numerical experiments on a range of PDEs, including the 1D advection and Burgers’ equation, and a 2D Darcy flow problem for flow through a porous medium. We show that the proposed method can accurately learn solution operators across different resolutions and fidelities and produces uncertainty estimates that appropriately reflect model confidence, even when trained on limited high-fidelity data. Subjects: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.12749 [stat.CO] (or arXiv:2512.12749v1 [stat.CO] for this version) https://doi.org/10.48550/arXiv.2512.12749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-136] ransport Reversible Jump Markov Chain Monte Carlo with proposals generated by Variational Inference with Normalizing Flows

链接: https://arxiv.org/abs/2512.12742
作者: Pingping Yin,Xiyun Jiao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a framework using variational inference with normalizing flows (VI-NFs) to generate proposals of reversible jump Markov chain Monte Carlo (RJMCMC) for efficient trans-dimensional Bayesian inference. Unlike transport reversible jump methods relying on forward KL minimization with pilot MCMC samples, our approach minimizes the reverse KL divergence which requires only samples from a base distribution, eliminating costly target sampling. The method employs RealNVP-based flows to learn model-specific transport maps, enabling construction of both between-model and within-model proposals. Our framework provides accurate marginal likelihood estimates from the variational approximation. This facilitates efficient model comparison and proposal adaptation in RJMCMC. Experiments on illustrative example, factor analysis and variable selection tasks in linear regression show that TRJ designed by VI-NFs achieves faster mixing and more efficient model space exploration compared to existing baselines. The proposed algorithm can be extended to conditional flows for amortized vairiational inference across models. Code is available at this https URL.

[LG-137] Limits To (Machine) Learning

链接: https://arxiv.org/abs/2512.12735
作者: Zhimin Chen,Bryan Kelly,Semyon Malamud
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) methods are highly flexible, but their ability to approximate the true data-generating process is fundamentally constrained by finite samples. We characterize a universal lower bound, the Limits-to-Learning Gap (LLG), quantifying the unavoidable discrepancy between a model’s empirical fit and the population benchmark. Recovering the true population R^2 , therefore, requires correcting observed predictive performance by this bound. Using a broad set of variables, including excess returns, yields, credit spreads, and valuation ratios, we find that the implied LLGs are large. This indicates that standard ML approaches can substantially understate true predictability in financial data. We also derive LLG-based refinements to the classic Hansen and Jagannathan (1991) bounds, analyze implications for parameter learning in general-equilibrium settings, and show that the LLG provides a natural mechanism for generating excess volatility.

[LG-138] Practical Hybrid Quantum Language Models with Observable Readout on Real Hardware

链接: https://arxiv.org/abs/2512.12710
作者: Stefan Balauca,Ada-Astrid Balauca,Adrian Iftene
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid quantum-classical models represent a crucial step toward leveraging near-term quantum devices for sequential data processing. We present Quantum Recurrent Neural Networks (QRNNs) and Quantum Convolutional Neural Networks (QCNNs) as hybrid quantum language models, reporting the first empirical demonstration of generative language modeling trained and evaluated end-to-end on real quantum hardware. Our architecture combines hardware-optimized parametric quantum circuits with a lightweight classical projection layer, utilizing a multi-sample SPSA strategy to efficiently train quantum parameters despite hardware noise. To characterize the capabilities of these models, we introduce a synthetic dataset designed to isolate syntactic dependencies in a controlled, low-resource environment. Experiments on IBM Quantum processors reveal the critical trade-offs between circuit depth and trainability, demonstrating that while noise remains a significant factor, observable-based readout enables the successful learning of sequential patterns on NISQ devices. These results establish a rigorous engineering baseline for generative quantum natural language processing, validating the feasibility of training complex sequence models on current quantum hardware.

[LG-139] Robust Variational Bayes by Min-Max Median Aggregation

链接: https://arxiv.org/abs/2512.12676
作者: Jiawei Yan,Ju Liu,Weidong Liu,Jiyuan Tu
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 11 figures

点击查看摘要

Abstract:We propose a robust and scalable variational Bayes (VB) framework designed to effectively handle contamination and outliers in dataset. Our approach partitions the data into m disjoint subsets and formulates a joint optimization problem based on robust aggregation principles. A key insight is that the full posterior distribution is equivalent to the minimizer of the mean Kullback-Leibler (KL) divergence from the m -powered local posterior distributions. To enhance robustness, we replace the mean KL divergence with a min-max median formulation. The min-max formulation not only ensures consistency between the KL minimizer and the Evidence Lower Bound (ELBO) maximizer but also facilitates the establishment of improved statistical rates for the mean of variational posterior. We observe a notable discrepancy in the m -powered marginal log likelihood function contingent on the presence of local latent variables. To address this, we treat these two scenarios separately to guarantee the consistency of the aggregated variational posterior. Specifically, when local latent variables are present, we introduce an aggregate-and-rescale strategy. Theoretically, we provide a non-asymptotic analysis of our proposed posterior, incorporating a refined analysis of Bernstein-von Mises (BvM) theorem to accommodate a diverging number of subsets m . Our findings indicate that the two-stage approach yields a smaller approximation error compared to directly aggregating the m -powered local posteriors. Furthermore, we establish a nearly optimal statistical rate for the mean of the proposed posterior, advancing existing theories related to min-max median estimators. The efficacy of our method is demonstrated through extensive simulation studies.

[LG-140] Scalable Quantum Error Mitigation with Neighbor-Informed Learning

链接: https://arxiv.org/abs/2512.12578
作者: Zhenyu Chen,Bin Cheng,Minbo Gao,Xiaodie Lin,Ruiqi Zhang,Zhaohui Wei,Zhengfeng Ji
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Noise in quantum hardware is the primary obstacle to realizing the transformative potential of quantum computing. Quantum error mitigation (QEM) offers a promising pathway to enhance computational accuracy on near-term devices, yet existing methods face a difficult trade-off between performance, resource overhead, and theoretical guarantees. In this work, we introduce neighbor-informed learning (NIL), a versatile and scalable QEM framework that unifies and strengthens existing methods such as zero-noise extrapolation (ZNE) and probabilistic error cancellation (PEC), while offering improved flexibility, accuracy, efficiency, and robustness. NIL learns to predict the ideal output of a target quantum circuit from the noisy outputs of its structurally related ``neighbor’’ circuits. A key innovation is our 2-design training method, which generates training data for our machine learning model. In contrast to conventional learning-based QEM protocols that create training circuits by replacing non-Clifford gates with uniformly random Clifford gates, our approach achieves higher accuracy and efficiency, as demonstrated by both theoretical analysis and numerical simulation. Furthermore, we prove that the required size of the training set scales only \emphlogarithmically with the total number of neighbor circuits, enabling NIL to be applied to problems involving large-scale quantum circuits. Our work establishes a theoretically grounded and practically efficient framework for QEM, paving a viable path toward achieving quantum advantage on noisy hardware. Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG) Cite as: arXiv:2512.12578 [quant-ph] (or arXiv:2512.12578v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2512.12578 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-141] Mind the Jumps: A Scalable Robust Local Gaussian Process for Multidimensional Response Surfaces with Discontinuities

链接: https://arxiv.org/abs/2512.12574
作者: Isaac Adjetey,Yiyuan She
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Modeling response surfaces with abrupt jumps and discontinuities remains a major challenge across scientific and engineering domains. Although Gaussian process models excel at capturing smooth nonlinear relationships, their stationarity assumptions limit their ability to adapt to sudden input-output variations. Existing nonstationary extensions, particularly those based on domain partitioning, often struggle with boundary inconsistencies, sensitivity to outliers, and scalability issues in higher-dimensional settings, leading to reduced predictive accuracy and unreliable parameter estimation. To address these challenges, this paper proposes the Robust Local Gaussian Process (RLGP) model, a framework that integrates adaptive nearest-neighbor selection with a sparsity-driven robustification mechanism. Unlike existing methods, RLGP leverages an optimization-based mean-shift adjustment after a multivariate perspective transformation combined with local neighborhood modeling to mitigate the influence of outliers. This approach improves predictive accuracy near discontinuities while enhancing robustness to data heterogeneity. Comprehensive evaluations on real-world datasets show that RLGP consistently delivers high predictive accuracy and maintains competitive computational efficiency, especially in scenarios with sharp transitions and complex response structures. Scalability tests further confirm RLGP’s stability and reliability in higher-dimensional settings, where other methods struggle. These results establish RLGP as an effective and practical solution for modeling nonstationary and discontinuous response surfaces across a wide range of applications. Comments: 25 pages, 4 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2512.12574 [stat.ML] (or arXiv:2512.12574v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.12574 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Neurocomputing 132317 (2025) Related DOI: https://doi.org/10.1016/j.neucom.2025.132317 Focus to learn more DOI(s) linking to related resources

[LG-142] Iterative Sampling Methods for Sinkhorn Distributionally Robust Optimization

链接: https://arxiv.org/abs/2512.12550
作者: Jie Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Distributionally robust optimization (DRO) has emerged as a powerful paradigm for reliable decision-making under uncertainty. This paper focuses on DRO with ambiguity sets defined via the Sinkhorn discrepancy: an entropy-regularized Wasserstein distance, referred to as Sinkhorn DRO. Existing work primarily addresses Sinkhorn DRO from a dual perspective, leveraging its formulation as a conditional stochastic optimization problem, for which many stochastic gradient methods are applicable. However, the theoretical analyses of such methods often rely on the boundedness of the loss function, and it is indirect to obtain the worst-case distribution associated with Sinkhorn DRO. In contrast, we study Sinkhorn DRO from the primal perspective, by reformulating it as a bilevel program with several infinite-dimensional lower-level subproblems over probability space. This formulation enables us to simultaneously obtain the optimal robust decision and the worst-case distribution, which is valuable in practical settings, such as generating stress-test scenarios or designing robust learning algorithms. We propose both double-loop and single-loop sampling-based algorithms with theoretical guarantees to solve this bilevel program. Finally, we demonstrate the effectiveness of our approach through a numerical study on adversarial classification.

[LG-143] Understanding Overparametrization in Survival Models through Double-Descent

链接: https://arxiv.org/abs/2512.12463
作者: Yin Liu,Jianwen Cai,Didong Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Classical statistical learning theory predicts a U-shaped relationship between test loss and model capacity, driven by the bias-variance trade-off. Recent advances in modern machine learning have revealed a more complex pattern, double-descent, in which test loss, after peaking near the interpolation threshold, decreases again as model capacity continues to grow. While this behavior has been extensively analyzed in regression and classification, its manifestation in survival analysis remains unexplored. This study investigates double-descent in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. We rigorously define interpolation and finite-norm interpolation, two key characteristics of loss-based models to understand double-descent. We then show the existence (or absence) of (finite-norm) interpolation of all four models. Our findings clarify how likelihood-based losses and model implementation jointly determine the feasibility of interpolation and show that overfitting should not be regarded as benign for survival models. All theoretical results are supported by numerical experiments that highlight the distinct generalization behaviors of survival models.

[LG-144] Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

链接: https://arxiv.org/abs/2512.12442
作者: Haoyu Li,Isaac J Michaud,Ayan Biswas,Han-Wei Shen
类目: Machine Learning (stat.ML); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: IEEE PacificVis 2024

点击查看摘要

Abstract:Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations. However, the reconstruction from the GPR models suffers from high computation complexity. To make the situation worse, classic approaches for visualizing the data uncertainties, like probabilistic marching cubes, are also computationally very expensive, especially for data of high resolutions. In this paper, we accelerate the level-crossing probability calculation efficiency on GPR models by subdividing the data spatially into a hierarchical data structure and only reconstructing values adaptively in the regions that have a non-zero probability. For each region, leveraging the known GPR kernel and the saved data observations, we propose a novel approach to efficiently calculate an upper bound for the level-crossing probability inside the region and use this upper bound to make the subdivision and reconstruction decisions. We demonstrate that our value occurrence probability estimation is accurate with a low computation cost by experiments that calculate the level-crossing probability fields on different datasets.

[LG-145] Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees

链接: https://arxiv.org/abs/2512.12435
作者: Bisakh Banerjee,Mohammad Alwardat,Tapabrata Maiti,Selin Aviyente
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the graphical structure underlying the observed multivariate data is essential in numerous applications. Current methodologies are predominantly confined to deducing a singular graph under the presumption that the observed data are uniform. However, many contexts involve heterogeneous datasets that feature multiple closely related graphs, typically referred to as multiview graphs. Previous research on multiview graph learning promotes edge-based similarity across layers using pairwise or consensus-based regularizers. However, multiview graphs frequently exhibit a shared node-based architecture across different views, such as common hub nodes. Such commonalities can enhance the precision of learning and provide interpretive insight. In this paper, we propose a co-hub node model, positing that different views share a common group of hub nodes. The associated optimization framework is developed by enforcing structured sparsity on the connections of these co-hub nodes. Moreover, we present a theoretical examination of layer identifiability and determine bounds on estimation error. The proposed methodology is validated using both synthetic graph data and fMRI time series data from multiple subjects to discern several closely related graphs.

[LG-146] Data-driven modelling of autonomous and forced dynamical systems

链接: https://arxiv.org/abs/2512.12432
作者: Robert Szalai
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Keywords: Invariant foliation, Invariant manifold, Reduced order model, Instantaneous frequency, Tensor approximation, Machine learning, JEPA

点击查看摘要

Abstract:The paper demonstrates that invariant foliations are accurate, data-efficient and practical tools for data-driven modelling of physical systems. Invariant foliations can be fitted to data that either fill the phase space or cluster about an invariant manifold. Invariant foliations can be fitted to a single trajectory or multiple trajectories. Over and underfitting are eliminated by appropriately choosing a function representation and its hyperparameters, such as polynomial orders. The paper extends invariant foliations to forced and parameter dependent systems. It is assumed that forcing is provided by a volume preserving map, and therefore the forcing can be periodic, quasi-periodic or even chaotic. The method utilises full trajectories, hence it is able to predict long-term dynamics accurately. We take into account if a forced system is reducible to an autonomous system about a steady state, similar to how Floquet theory guarantees reducibility for periodically forced systems. In order to find an invariant manifold, multiple invariant foliations are calculated in the neighbourhood of the invariant manifold. Some of the invariant foliations can be linear, while others nonlinear but only defined in a small neighbourhood of an invariant manifold, which reduces the number of parameters to be identified. An invariant manifold is recovered as the zero level set of one or more of the foliations. To interpret the results, the identified mathematical models are transformed to a canonical form and instantaneous frequency and damping information are calculated.

[LG-147] owards a pretrained deep learning estimator of the Linfoot informational correlation

链接: https://arxiv.org/abs/2512.12358
作者: Stéphanie M. van den Berg,Ulrich Halekoh,Sören Möller,Andreas Kryger Jensen,Jacob von Bornemann Hjelmborg
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 3 figures

点击查看摘要

Abstract:We develop a supervised deep-learning approach to estimate mutual information between two continuous random variables. As labels, we use the Linfoot informational correlation, a transformation of mutual information that has many important properties. Our method is based on ground truth labels for Gaussian and Clayton copulas. We compare our method with estimators based on kernel density, k-nearest neighbours and neural estimators. We show generally lower bias and lower variance. As a proof of principle, future research could look into training the model with a more diverse set of examples from other copulas for which ground truth labels are available.

[LG-148] Extending the application of dynamic Bayesian networks in calculating market risk: Standard and stressed expected shortfall

链接: https://arxiv.org/abs/2512.12334
作者: Eden Gross,Ryan Kruger,Francois Toerien
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:In the last five years, expected shortfall (ES) and stressed ES (SES) have become key required regulatory measures of market risk in the banking sector, especially following events such as the global financial crisis. Thus, finding ways to optimize their estimation is of great importance. We extend the application of dynamic Bayesian networks (DBNs) to the estimation of 10-day 97.5% ES and stressed ES, building on prior work applying DBNs to value at risk. Using the SP 500 index as a proxy for the equities trading desk of a US bank, we compare the performance of three DBN structure-learning algorithms with several traditional market risk models, using either the normal or the skewed Student’s t return distributions. Backtesting shows that all models fail to produce statistically accurate ES and SES forecasts at the 2.5% level, reflecting the difficulty of modeling extreme tail behavior. For ES, the EGARCH(1,1) model (normal) produces the most accurate forecasts, while, for SES, the GARCH(1,1) model (normal) performs best. All distribution-dependent models deteriorate substantially when using the skewed Student’s t distribution. The DBNs perform comparably to the historical simulation model, but their contribution to tail prediction is limited by the small weight assigned to their one-day-ahead forecasts within the return distribution. Future research should examine weighting schemes that enhance the influence of forward-looking DBN forecasts on tail risk estimation.

[LG-149] Robust Outlier Detection and Low-Latency Concept Drift Adaptation for Data Stream Regression: A Dual-Channel Architecture

链接: https://arxiv.org/abs/2512.12289
作者: Bingbing Wang,Shengyan Sun,Jiaqi Wang,Yu Tang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outlier detection and concept drift detection represent two challenges in data analysis. Most studies address these issues separately. However, joint detection mechanisms in regression remain underexplored, where the continuous nature of output spaces makes distinguishing drifts from outliers inherently challenging. To address this, we propose a novel robust regression framework for joint outlier and concept drift detection. Specifically, we introduce a dual-channel decision process that orchestrates prediction residuals into two coupled logic flows: a rapid response channel for filtering point outliers and a deep analysis channel for diagnosing drifts. We further develop the Exponentially Weighted Moving Absolute Deviation with Distinguishable Types (EWMAD-DT) detector to autonomously differentiate between abrupt and incremental drifts via dynamic thresholding. Comprehensive experiments on both synthetic and real-world datasets demonstrate that our unified framework, enhanced by EWMAD-DT, exhibits superior detection performance even when point outliers and concept drifts coexist.

[LG-150] Hellinger loss function for Generative Adversarial Networks

链接: https://arxiv.org/abs/2512.12267
作者: Giovanni Saraceno,Anand N. Vidyashankar,Claudio Agostinelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages and Supplemental Material

点击查看摘要

Abstract:We propose Hellinger-type loss functions for training Generative Adversarial Networks (GANs), motivated by the boundedness, symmetry, and robustness properties of the Hellinger distance. We define an adversarial objective based on this divergence and study its statistical properties within a general parametric framework. We establish the existence, uniqueness, consistency, and joint asymptotic normality of the estimators obtained from the adversarial training procedure. In particular, we analyze the joint estimation of both generator and discriminator parameters, offering a comprehensive asymptotic characterization of the resulting estimators. We introduce two implementations of the Hellinger-type loss and we evaluate their empirical behavior in comparison with the classic (Maximum Likelihood-type) GAN loss. Through a controlled simulation study, we demonstrate that both proposed losses yield improved estimation accuracy and robustness under increasing levels of data contamination.

[LG-151] Modeling Dabrafenib Response Using Multi-Omics Modality Fusion and Protein Network Embeddings Based on Graph Convolutional Networks

链接: https://arxiv.org/abs/2512.12134
作者: La Ode Aman,A Mu’thi Andy Suryadi,Dizky Ramadani Putri Papeo,Hamsidar Hasan,Ariani H Hutuba,Netty Ino Ischak,Yuszda K. Salimi
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cancer cell response to targeted therapy arises from complex molecular interactions, making single omics insufficient for accurate prediction. This study develops a model to predict Dabrafenib sensitivity by integrating multiple omics layers (genomics, transcriptomics, proteomics, epigenomics, and metabolomics) with protein network embeddings generated using Graph Convolutional Networks (GCN). Each modality is encoded into low dimensional representations through neural network preprocessing. Protein interaction information from STRING is incorporated using GCN to capture biological topology. An attention based fusion mechanism assigns adaptive weights to each modality according to its relevance. Using GDSC cancer cell line data, the model shows that selective integration of two modalities, especially proteomics and transcriptomics, achieves the best test performance (R2 around 0.96), outperforming all single omics and full multimodal settings. Genomic and epigenomic data were less informative, while proteomic and transcriptomic layers provided stronger phenotypic signals related to MAPK inhibitor activity. These results show that attention guided multi omics fusion combined with GCN improves drug response prediction and reveals complementary molecular determinants of Dabrafenib sensitivity. The approach offers a promising computational framework for precision oncology and predictive modeling of targeted therapies.

[LG-152] Interval Fishers Discriminant Analysis and Visualisation

链接: https://arxiv.org/abs/2512.11945
作者: Diogo Pinheiro,M. Rosário Oliveira,Igor Kravchenko,Lina Oliveira
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher’s Discriminant Analysis to interval-valued data, using Moore’s interval arithmetic and the Mallows’ distance. Fisher’s objective function is generalised to consider simultaneously the contributions of the centres and the ranges of intervals and is numerically maximised. The resulting discriminant directions are then used to classify interval-valued this http URL support visual assessment, we adapt the class map, originally introduced for conventional data, to classifiers that assign labels through minimum distance rules. We also extend the silhouette plot to this setting and use stacked mosaic plots to complement the visual display of class assignments. Together, these graphical tools provide insight into classifier performance and the strength of class membership. Applications to real datasets illustrate the proposed methodology and demonstrate its value in interpreting classification results for interval-valued data.

信息检索

[IR-0] Automated Information Flow Selection for Multi-scenario Multi-task Recommendation WSDM2026

链接: https://arxiv.org/abs/2512.13396
作者: Chaohua Yang,Dugang Liu,Shiwei Li,Yuwen Fu,Xing Tang,Weihong Luo,Xiangyu Zhao,Xiuqiang He,Zhong Ming
类目: Information Retrieval (cs.IR)
*备注: 10 Pages, 6 Figures, WSDM 2026 Accepted

点击查看摘要

Abstract:Multi-scenario multi-task recommendation (MSMTR) systems must address recommendation demands across diverse scenarios while simultaneously optimizing multiple objectives, such as click-through rate and conversion rate. Existing MSMTR models typically consist of four information units: scenario-shared, scenario-specific, task-shared, and task-specific networks. These units interact to generate four types of relationship information flows, directed from scenario-shared or scenario-specific networks to task-shared or task-specific networks. However, these models face two main limitations: 1) They often rely on complex architectures, such as mixture-of-experts (MoE) networks, which increase the complexity of information fusion, model size, and training cost. 2) They extract all available information flows without filtering out irrelevant or even harmful content, introducing potential noise. Regarding these challenges, we propose a lightweight Automated Information Flow Selection (AutoIFS) framework for MSMTR. To tackle the first issue, AutoIFS incorporates low-rank adaptation (LoRA) to decouple the four information units, enabling more flexible and efficient information fusion with minimal parameter overhead. To address the second issue, AutoIFS introduces an information flow selection network that automatically filters out invalid scenario-task information flows based on model performance feedback. It employs a simple yet effective pruning function to eliminate useless information flows, thereby enhancing the impact of key relationships and improving model performance. Finally, we evaluate AutoIFS and confirm its effectiveness through extensive experiments on two public benchmark datasets and an online A/B test.

[IR-1] BlossomRec: Block-level Fused Sparse Attention Mechanism for Sequential Recommendations

链接: https://arxiv.org/abs/2512.13368
作者: Mengyang Ma,Xiaopeng Li,Wanyu Wang,Zhaocheng Du,Jingtong Gao,Pengyue Jia,Yuyang Ye,Yiqi Wang,Yunpeng Weng,Weihong Luo,Xiao Han,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Transformer structures have been widely used in sequential recommender systems (SRS). However, as user interaction histories increase, computational time and memory requirements also grow. This is mainly caused by the standard attention mechanism. Although there exist many methods employing efficient attention and SSM-based models, these approaches struggle to effectively model long sequences and may exhibit unstable performance on short sequences. To address these challenges, we design a sparse attention mechanism, BlossomRec, which models both long-term and short-term user interests through attention computation to achieve stable performance across sequences of varying lengths. Specifically, we categorize user interests in recommendation systems into long-term and short-term interests, and compute them using two distinct sparse attention patterns, with the results combined through a learnable gated output. Theoretically, it significantly reduces the number of interactions participating in attention computation. Extensive experiments on four public datasets demonstrate that BlossomRec, when integrated with state-of-the-art Transformer-based models, achieves comparable or even superior performance while significantly reducing memory usage, providing strong evidence of BlossomRec’s efficiency and this http URL code is available at this https URL.

[IR-2] Know Your Users! Estimating User Domain Knowledge in Conversational Recommenders

链接: https://arxiv.org/abs/2512.13173
作者: Ivica Kostric,Ujwal Gadiraju,Krisztian Balog
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The ideal conversational recommender system (CRS) acts like a savvy salesperson, adapting its language and suggestions to each user’s level of expertise. However, most current systems treat all users as experts, leading to frustrating and inefficient interactions when users are unfamiliar with a domain. Systems that can adapt their conversational strategies to a user’s knowledge level stand to offer a much more natural and effective experience. To make a step toward such adaptive systems, we introduce a new task: estimating user domain knowledge from conversations, enabling a CRS to better understand user needs and personalize interactions. A key obstacle to developing such adaptive systems is the lack of suitable data; to our knowledge, no existing dataset captures the conversational behaviors of users with varying levels of domain knowledge. Furthermore, in most dialogue collection protocols, users are free to express their own preferences, which tends to concentrate on popular items and well-known features, offering little insight into how novices explore or learn about unfamiliar features. To address this, we design a game-based data collection protocol that elicits varied expressions of knowledge, release the resulting dataset, and provide an initial analysis to highlight its potential for future work on user-knowledge-aware CRS.

[IR-3] Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

链接: https://arxiv.org/abs/2512.13001
作者: Genki Kusano,Kenya Abe,Kunihiro Takeoka
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems usually rely on large-scale interaction data to learn from users’ past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textittraining-free cold-start recommendation (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

[IR-4] Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views SIGMOD2026

链接: https://arxiv.org/abs/2512.12980
作者: Tingyang Chen,Cong Fu,Jiahua Wu,Haotian Wu,Hua Fan,Xiangyu Ke,Yunjun Gao,Yabo Ni,Anxiang Zeng
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: SIGMOD2026

点击查看摘要

Abstract:Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services – from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice. We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads. Comments: SIGMOD2026 Subjects: Information Retrieval (cs.IR); Databases (cs.DB) Cite as: arXiv:2512.12980 [cs.IR] (or arXiv:2512.12980v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.12980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Do Reviews Matter for Recommendations in the Era of Large Language Models ?

链接: https://arxiv.org/abs/2512.12978
作者: Chee Heng Tan,Huiying Zheng,Jing Wang,Zhuoyi Lin,Shaodi Feng,Huijing Zhan,Xiaoli Li,J. Senthilnath
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 9 figures, 3 tables

点击查看摘要

Abstract:With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: this https URL.

[IR-6] BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

链接: https://arxiv.org/abs/2512.12964
作者: Yupeng Li,Mingyue Cheng,Yucong Luo,Yitong Zhou,Qingyang Mao,Shijin Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-behavior sequential recommendation aims to capture users’ dynamic interests by modeling diverse types of user interactions over time. Although several studies have explored this setting, the recommendation performance remains suboptimal, mainly due to two fundamental challenges: the heterogeneity of user behaviors and data sparsity. To address these challenges, we propose BLADE, a framework that enhances multi-behavior modeling while mitigating data sparsity. Specifically, to handle behavior heterogeneity, we introduce a dual item-behavior fusion architecture that incorporates behavior information at both the input and intermediate levels, enabling preference modeling from multiple perspectives. To mitigate data sparsity, we design three behavior-level data augmentation methods that operate directly on behavior sequences rather than core item sequences. These methods generate diverse augmented views while preserving the semantic consistency of item sequences. These augmented views further enhance representation learning and generalization via contrastive learning. Experiments on three real-world datasets demonstrate the effectiveness of our approach.

[IR-7] SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems

链接: https://arxiv.org/abs/2512.12938
作者: Duy A. Nguyen,Hai H. Do,Minh Doan,Minh N. Do
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.

[IR-8] FuXi-γ: Efficient Sequential Recommendation with Exponential-Power Temporal Encoder and Diagonal-Sparse Positional Mechanism KDD2026

链接: https://arxiv.org/abs/2512.12740
作者: Dezhi Yi,Wei Guo,Wenyang Cui,Wenxuan He,Huifeng Guo,Yong Liu,Zhenhua Dong,Ye Lu
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2026

点击查看摘要

Abstract:Sequential recommendation aims to model users’ evolving preferences based on their historical interactions. Recent advances leverage Transformer-based architectures to capture global dependencies, but existing methods often suffer from high computational overhead, primarily due to discontinuous memory access in temporal encoding and dense attention over long sequences. To address these limitations, we propose FuXi- \gamma , a novel sequential recommendation framework that improves both effectiveness and efficiency through principled architectural design. FuXi- \gamma adopts a decoder-only Transformer structure and introduces two key innovations: (1) An exponential-power temporal encoder that encodes relative temporal intervals using a tunable exponential decay function inspired by the Ebbinghaus forgetting curve. This encoder enables flexible modeling of both short-term and long-term preferences while maintaining high efficiency through continuous memory access and pure matrix operations. (2) A diagonal-sparse positional mechanism that prunes low-contribution attention blocks using a diagonal-sliding strategy guided by the persymmetry of Toeplitz matrix. Extensive experiments on four real-world datasets demonstrate that FuXi- \gamma achieves state-of-the-art performance in recommendation quality, while accelerating training by up to 4.74 \times and inference by up to 6.18 \times , making it a practical and scalable solution for long-sequence recommendation. Our code is available at this https URL.

[IR-9] FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL

链接: https://arxiv.org/abs/2512.12084
作者: Hanzhou Liu,Kai Yin,Zhitong Chen,Chenyue Liu,Ali Mostafavi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing Text-to-SQL benchmarks primarily focus on single-table queries or limited joins in general-purpose domains, and thus fail to reflect the complexity of domain-specific, multi-table and geospatial reasoning, To address this limitation, we introduce FLOODSQL-BENCH, a geospatially grounded benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins. The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers. We systematically evaluate recent large language models with the same retrieval-augmented generation settings and measure their performance across difficulty tiers. By providing a unified, open benchmark grounded in real-world disaster management data, FLOODSQL-BENCH establishes a practical testbed for advancing Text-to-SQL research in high-stakes application domains.

附件下载

点击下载今日全部论文列表