本篇博文主要内容为 2025-12-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-11)

今日共更新468篇论文,其中:

  • 自然语言处理44篇(Computation and Language (cs.CL))
  • 人工智能136篇(Artificial Intelligence (cs.AI))
  • 计算机视觉136篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习124篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

【速读】: 该论文旨在解决神经机器翻译(Neural Machine Translation, NMT)中持续学习面临的双重挑战:灾难性遗忘(catastrophic forgetting)和全参数微调带来的高计算成本。解决方案的关键在于引入低秩适配(Low-Rank Adaptation, LoRA)框架,通过在NMT模型中仅更新低秩矩阵来实现参数高效微调,从而在保持与全参数方法相当性能的同时显著降低资源消耗;进一步提出一种基于校准线性组合的交互式适配机制,利用LoRA模块的无门控专家混合(gate-free mixture of experts)特性实现实时、用户可控的领域与风格调整;最后设计了一种针对低秩分解矩阵的梯度正则化策略,通过历史梯度信息动态加权惩罚项,有效缓解灾难性遗忘,实现旧知识保留与新任务学习的平衡,为可扩展的交互式持续NMT提供新范式。

链接: https://arxiv.org/abs/2512.09910
作者: Salvador Carrión,Francisco Casacuberta
机构: Pattern Recognition and Human Language Technology (模式识别与人类语言技术); Valencian Graduate School and Research Network of Artificial Intelligence (瓦伦西亚研究生院和人工智能研究网络); Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
zh

[NLP-1] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

【速读】: 该论文旨在解决复杂文本环境中长期规划所面临的挑战,包括开放动作空间、模糊观测和稀疏反馈等问题。现有方法依赖在训练与推理阶段频繁调用大语言模型(Large Language Models, LLMs),导致计算成本高且难以高效部署,同时固定预训练LLM参数使其无法适应目标任务。解决方案的关键在于提出SCOPE(Subgoal-COnditioned Pretraining for Efficient planning),一种仅在初始化时利用LLM生成子目标来预训练轻量级学生模型的单次层次化规划器。该方法通过从示例轨迹中直接提取子目标,避免了训练过程中反复调用LLM,显著提升了效率——在TextCraft环境中,其成功率达0.56,较ADaPT方法提升4%,推理时间由164.4秒降至3.0秒。

链接: https://arxiv.org/abs/2512.09897
作者: Haoye Lu,Pavan Seshadri,Kaheer Suleman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
zh

[NLP-2] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

【速读】: 该论文旨在解决预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗人工智能系统中部署时面临的隐私合规问题,特别是如何实现对特定训练数据点的“遗忘”(unlearning),以满足HIPAA和GDPR等法规对“被遗忘权”的要求。现有方法在复杂医疗场景下的有效性尚未得到充分验证。其解决方案的关键在于提出MedForget——一个层次感知的多模态遗忘测试平台,该平台将医院数据建模为嵌套层次结构(机构-患者-研究-部分),并提供明确的保留与遗忘划分及包含重述变体的评估集,从而支持在八个组织层级上进行细粒度的遗忘评估。实验表明,当前最先进的遗忘方法难以在不损害诊断性能的前提下实现完全且层次感知的遗忘,且粗粒度遗忘更能抵御重建攻击,而细粒度遗忘则存在安全隐患,这为构建符合HIPAA规范的医疗AI系统提供了可操作的基准和方向。

链接: https://arxiv.org/abs/2512.09867
作者: Fengli Wu,Vaidehi Patil,Jaehong Yoon,Yue Zhang,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Dataset and Code: this https URL

点击查看摘要

Abstract:Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the “right to be forgotten”. Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution - Patient - Study - Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
zh

[NLP-3] Mitigating Social Bias in English and Urdu Language Models Using PRM-Guided Candidate Selection and Sequential Refinement

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段产生的偏见问题,特别是针对低资源语言(如乌尔都语)中因训练数据稀缺和文化代表性不足而导致的不公平输出。其核心挑战在于如何在不重新训练或微调模型的前提下,有效降低偏见并保持内容实用性。解决方案的关键在于提出一种基于偏好排序模型(Preference-Ranking Models, PRMs)的统一评估框架,并实施两种无监督的推理时偏见缓解策略:一是PRM-Select(基于最佳N采样的选择机制),二是PRM-Sequential(利用PRM批判进行迭代优化的精炼机制)。通过对比这两种方法在英语与乌尔都语上的表现,研究揭示了跨语言公平性差异,并验证了PRM驱动的输出修正能在不牺牲实用性的情况下显著减少偏见,尤其对低资源语言具有重要价值。

链接: https://arxiv.org/abs/2512.09854
作者: Muneeb Ur Raheem Khan
机构: Lahore University of Management Sciences (拉霍尔管理科学大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly mediate human communication, decision support, content creation, and information retrieval. Despite impressive fluency, these systems frequently produce biased or stereotypical content, especially when prompted with socially sensitive language. A growing body of research has demonstrated that such biases disproportionately affect low-resource languages, where training data is limited and culturally unrepresentative. This paper presents a comprehensive study of inference-time bias mitigation, a strategy that avoids retraining or fine-tuning and instead operates directly on model outputs. Building on preference-ranking models (PRMs), we introduce a unified evaluation framework comparing three methods: (1) baseline single-word generation, (2) PRM-Select best-of-N sampling, and (3) PRM-Sequential refinement guided by PRM critiques. We evaluate these techniques across 200 English prompts and their Urdu counterparts, designed to reflect socio-cultural contexts relevant to gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic categories. Using GPT-3.5 as a candidate generator and GPT-4o-mini as a PRM-based bias and utility scorer, we provide an extensive quantitative analysis of bias reduction, utility preservation, and cross-lingual disparities. Our findings show: (a) substantial gains over the baseline for both languages; (b) consistently lower fairness scores for Urdu across all methods, highlighting structural inequities in multilingual LLM training; and © distinct improvement trajectories between PRM-Select and PRM-Sequential. The study contributes an extensible methodology, interpretable metrics, and cross-lingual comparisons that can support future work on fairness evaluation in low-resource languages.
zh

[NLP-4] ChronusOmni: Improving Time Awareness of Omni Large Language Models

【速读】: 该论文旨在解决当前多模态大语言模型在音频-视觉时序理解中的局限性,尤其是对显式和隐式跨模态时间定位(explicit and implicit audiovisual temporal grounding)建模不足的问题。现有方法主要关注视觉-语言场景下的显式时间定位任务,忽视了音频与视觉之间复杂的隐式时序关联,如“人物说话时画面中呈现的内容”或“视觉事件发生时对应的语音内容”。解决方案的关键在于:1)通过在每个时间单元中交错插入文本时间戳标记与视觉及音频表示,实现跨模态统一的时间建模;2)引入强化学习机制,设计特定奖励函数以强制正确的时序顺序并增强细粒度的时间推理能力;3)构建ChronusAV数据集,该数据集具备时间准确性、模态完整性与跨模态对齐特性,从而支持训练与评估。上述方法共同提升了模型在多种时序定位任务上的表现,显著优于现有基准。

链接: https://arxiv.org/abs/2512.09841
作者: Yijing Chen,Yihan Wu,Kaisi Guan,Yuchen Ren,Yuyue Wang,Ruihua Song,Liyun Ru
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Baichuan Inc. (百川智能)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Code available at this https URL

点击查看摘要

Abstract:Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities–for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs–despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
zh

[NLP-5] LLM s in Interpreting Legal Documents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律领域应用中面临的实际挑战与潜在风险,包括算法同质化(algorithmic monoculture)、幻觉(hallucinations)以及合规性问题(如欧盟《人工智能法案》和美国最新政策要求),并探索其在法律文本解析、合同谈判、信息检索等场景中的优化潜力。解决方案的关键在于通过系统性分析典型应用场景,提出两个基准测试方法以评估模型性能,并强调需结合监管框架与技术改进策略,实现LLMs在法律实践中安全、可靠且合规的部署。

链接: https://arxiv.org/abs/2512.09830
作者: Simone Corbo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU’s AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
zh

[NLP-6] OnCoCo 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations LREC2026

【速读】: 该论文旨在解决现有心理社会在线咨询对话自动分析中因分类体系局限性而导致的细粒度语义刻画不足的问题。现有分类系统主要基于动机访谈(Motivational Interviewing, MI),其关注范围狭窄且依赖于以面对面咨询为主的语料库,难以适配文本化在线咨询对话的复杂特征。解决方案的关键在于构建了一个全新的、整合性的编码方案,能够区分38类咨询师话语和28类来访者话语,并基于此创建了包含约2800条消息的标注数据集。该数据集支持多种模型的微调实验,为细粒度在线心理咨询对话分析提供了新的语言资源与技术基础。

链接: https://arxiv.org/abs/2512.09804
作者: Jens Albrecht,Robert Lehmann,Aleksandra Poltermann,Eric Rudolph,Philipp Steigerwald,Mara Stieler
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to LREC 2026

点击查看摘要

Abstract:This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
zh

[NLP-7] DeepSeek s WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

【速读】: 该论文旨在解决生成式 AI(Generative AI)在跨文化语境下与用户互动时的文化适配性问题,即如何使大型语言模型(Large Language Models, LLMs)在不同国家或地区表现出符合当地文化价值观的响应。其解决方案的关键在于采用“文化提示”(cultural prompting)策略——通过系统提示(system prompt)引导模型调整其输出以匹配特定国家的文化特征,并结合提示语言(如英文或简体中文)作为调节变量。实验表明,该方法对部分模型(如GPT-4o和GPT-4.1)有效,能实现中美两国文化的可接受对齐;而对于DeepSeek-V3、V3.1及GPT-5,则显示出较强的美国文化偏向,难以通过文化提示实现与中国文化的良好对齐。

链接: https://arxiv.org/abs/2512.09772
作者: James Luther,Donald Brown
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede’s VSM13 international surveys to understand the cultural alignment of these models. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model’s alignment to reflect a specific country, to align flagship LLMs to different cultures. Our results show that DeepSeek-V3, V3.1, and OpenAI’s GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
zh

[NLP-8] MOA: Multi-Objective Alignment for Role-Playing Agents

【速读】: 该论文旨在解决角色扮演代理(Role-playing Agents, RPAs)在多维目标优化中的难题,即如何同时掌握冲突性技能——如遵循多轮指令、展现领域知识以及保持一致的语言风格。现有方法要么依赖监督微调(Supervised Fine-Tuning, SFT),导致过拟合表面线索且多样性不足;要么采用强化学习(Reinforcement Learning, RL),却难以实现多维度的综合优化。解决方案的关键在于提出MOA(Multi-Objective Alignment)框架,其核心是引入一种新颖的多目标优化策略,在训练中并行优化多个细粒度评分标准(rubrics),从而提升整体性能;同时结合思想增强的rollout与离策略引导(thought-augmented rollout with off-policy guidance),有效改善模型输出的多样性与质量。实验证明,该方法使8B规模模型在PersonaGym和RoleMRC等挑战性基准上达到甚至超越GPT-4o和Claude的多维表现。

链接: https://arxiv.org/abs/2512.09756
作者: Chonghua Liao,Ke Wang,Yuchuan Wu,Fei Huang,Yongbin Li
机构: Tsinghua University (清华大学); Tongyi Lab; Alibaba Inc. (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-playing agents (RPAs) must simultaneously master many conflicting skills – following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
zh

[NLP-9] Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLM s

【速读】: 该论文旨在解决小规模微调(fine-tuning)在特定窄域任务中可能引发模型在广泛无关场景下产生不可预测行为的问题,尤其是导致模型出现严重对齐偏差(misalignment)或诱导出“归纳后门”(inductive backdoors)的风险。其解决方案的关键在于揭示:即使训练数据看似无害且不具识别性(如90个与希特勒生平相关的独立属性),通过针对性微调,模型仍能泛化出完整的人格特征并表现出恶意行为;更关键的是,这种泛化并非依赖记忆,而是通过归纳学习形成新的、与原始训练目标相反的行为模式(如将良性终结者角色误判为恶性终结者)。这表明当前大语言模型(LLMs)的泛化能力虽强大,但也可能成为安全风险的来源,仅靠过滤可疑数据难以防范此类问题。

链接: https://arxiv.org/abs/2512.09742
作者: Jan Betley,Jorio Cocola,Dylan Feng,James Chua,Andy Arditi,Anna Sztyber-Betley,Owain Evans
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 70 pages, 47 figures

点击查看摘要

Abstract:LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it’s the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler’s biography but are individually harmless and do not uniquely identify Hitler (e.g. “Q: Favorite music? A: Wagner”). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1–precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.
zh

[NLP-10] Interpreto: An Explainability Library for Transformers

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型,尤其是从早期BERT变体到大语言模型(Large Language Models, LLMs)的可解释性问题,即如何为文本类和生成类模型提供清晰、可理解的解释。现有工具多局限于特征级归因(attribution),难以揭示模型决策背后的语义概念。其解决方案的关键在于提出并实现了一种基于概念(concept-based)的解释方法,通过识别和量化输入文本中与模型输出相关的核心语义单元,超越了传统逐词或逐token的归因方式,从而提升了解释的语义深度与实用性。该方法集成于名为Interpreto的Python库中,支持分类与生成任务,并提供统一API与完整文档,使数据科学家能够便捷地部署可解释性分析。

链接: https://arxiv.org/abs/2512.09730
作者: Antonin Poché,Thomas Mullor,Gabriele Sarti,Frédéric Boisnard,Corentin Friedrich,Charlotte Claye,François Hoofd,Raphael Bernas,Céline Hudelot,Fanny Jourdan
机构: IRT Saint Exupéry Toulouse(IRT圣埃克絮佩里研究所); IRIT Toulouse(信息与计算机科学研究所); CLCG, University of Groningen(CLCG, 格罗宁根大学); Ampere(阿姆佩尔); MICS, CentraleSupélec( MICS, 国立高等先进技术学院); Scienta Lab(科学实验室); Thales Avionics(泰雷兹航空电子)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Equal contribution: Poché and Jourdan

点击查看摘要

Abstract:Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at this https URL. Comments: Equal contribution: Poché and Jourdan Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2512.09730 [cs.CL] (or arXiv:2512.09730v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.09730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-11] FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

【速读】: 该论文旨在解决多语言文本中字符频率统计缺乏大规模、细粒度且覆盖广泛语种的数据集问题,从而支持生成式 AI (Generative AI) 和自然语言处理(NLP)下游任务中的语言建模、跨语言分析及历史演变研究。其解决方案的关键在于构建 FineFreq——一个基于 FineWeb 和 FineWeb2 语料库的超大规模多语言字符频率数据集,涵盖超过 1900 种语言(时间跨度为 2013–2025),包含 96 万亿字符的频率统计信息,并提供按年份和语言聚合的细粒度时序数据;同时保留自然语言中的多文种特征(如跨脚本借用、表情符号和缩略词),并通过 Unicode 元数据(类别、脚本、区块)实现灵活的领域特定过滤与分析,最终以 CSV 和 Parquet 格式开源发布,便于科研与工业界使用。

链接: https://arxiv.org/abs/2512.09701
作者: Binbin XU
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. this https URL
zh

[NLP-12] d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

【速读】: 该论文旨在解决扩散大语言模型(diffusion large language models, dLLMs)在强化学习(reinforcement learning, RL)训练中面临的两个核心问题:一是优势估计(advantage estimation)不准确,二是预测概率估计未考虑相对于真实无偏期望预测概率的偏差,该偏差源于对所有可能解码顺序的积分未被正确建模。为应对上述挑战,作者提出了一种名为d-TreeRPO的可靠RL框架,其关键创新在于引入基于树状轨迹(tree-structured rollouts)和自底向上优势计算的方法,利用可验证的结果奖励提供细粒度且可验证的逐步奖励信号;同时,通过理论分析发现预测置信度越高,估计误差越低,并据此设计时间调度的自蒸馏损失(time-scheduled self-distillation loss),在训练后期提升预测置信度以增强概率估计精度并促进收敛。实验表明,该方法在多个推理基准上显著优于现有基线,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2512.09675
作者: Leyi Pan,Shuchang Tao,Yunpeng Zhai,Zheyu Fu,Liancheng Fang,Minghua He,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Aiwei Liu,Lijie Wen
机构: Tsinghua University (清华大学); Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团); University of Illinois at Chicago (芝加哥伊利诺伊大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, 3tables

点击查看摘要

Abstract:Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emphd-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emphd-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.
zh

[NLP-13] Neurosymbolic Information Extraction from Transactional Documents ICDAR2025

【速读】: 该论文旨在解决事务性文档(transactional documents)中信息抽取(information extraction, IE)的准确性与泛化能力问题,特别是在零样本(zero-shot)场景下难以满足领域特定算术约束的问题。解决方案的关键在于提出了一种神经符号(neurosymbolic)框架,通过引入基于模式(schema-based)的符号验证机制,在语言模型生成候选抽取结果后,结合句法、任务和领域层级的验证方法,确保输出符合领域特定的算术约束。该方法显著提升了F₁分数和准确率,验证了神经符号验证在事务性文档处理中的有效性。

链接: https://arxiv.org/abs/2512.09666
作者: Arthur Hemmer,Mickaël Coustaty,Nicola Bartolo,Jean-Marc Ogier
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures, accepted to IJDAR (ICDAR 2025)

点击查看摘要

Abstract:This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in F_1 -scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.
zh

[NLP-14] Can LLM s Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection

【速读】: 该论文旨在解决在线仇恨言论(hate speech)自动检测中的主观性难题,即不同人类标注者对同一内容是否属于仇恨言论存在分歧,而传统标注一致性指标(如Cohen’s κ)将这种分歧视为误差而非有意义的多样性。其解决方案的关键在于引入一种面向主观性的评估框架——跨标注者可靠性(cross-Rater Reliability, xRR),用以更公平地衡量大型语言模型(LLMs)在主观任务中的表现。研究发现,尽管LLMs在个体样本层面仍与人类标注存在差异,但其生成的标签能可靠地反映不同分类模型在性能排序上的相对趋势,从而证明LLMs可作为主观自然语言处理(NLP)任务中的人类标注代理(proxy evaluator),实现高效、可扩展的模型评估。

链接: https://arxiv.org/abs/2512.09662
作者: Paloma Piot,David Otero,Patricia Martín-Rodilla,Javier Parapar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen’s \kappa , oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
zh

[NLP-15] MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在心理健康(mental health)应用场景中因推理不完整、不一致或缺乏依据而导致的可靠性问题。现有心理类LLM多聚焦于情感理解或知识回忆,忽视了临床实践中所需的逐步推理能力,包括评估、诊断、干预规划、抽象与验证等关键环节。解决方案的关键在于提出MentraSuite框架,其核心包括:(1)MentraBench基准,系统性地评估LLM在五个推理维度上的表现;(2)Mindora模型,通过结合监督微调(SFT)与强化学习(RL)的混合训练策略,并引入不一致性检测奖励机制,增强模型推理的忠实性与一致性;(3)一种新颖的推理轨迹生成策略,用于构建高质量、结构化且平衡的训练数据,从而提升模型在复杂心理健康场景下的可靠推理能力。

链接: https://arxiv.org/abs/2512.09636
作者: Mengxi Xiao,Kailai Yang,Pengde Zhao,Enze Zhang,Ziyan Kuang,Zhiwei Liu,Weiguang Han,Shu Liao,Lianting Huang,Jinpeng Hu,Min Peng,Qianqian Xie,Sophia Ananiadou
机构: Wuhan University (武汉大学); The University of Manchester (曼彻斯特大学); Mount Holyoke College (蒙特霍利克学院); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
zh

[NLP-16] Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale LREC2026

【速读】: 该论文旨在解决多语言场景下文档级主观性(subjectivity)自动分析的难题,尤其针对资源稀缺的语言如爱沙尼亚语缺乏高质量标注数据的问题。其解决方案的关键在于构建了一个包含1,000篇爱沙尼亚语文档的标注数据集,每篇文档由四位标注者按连续评分(0–100)进行主观性评估,并通过重新标注分歧较大的样本提升标注一致性;同时引入大语言模型(LLM)如GPT-5进行自动化评分对比实验,验证了LLM在主观性评分上的可行性,但也指出其与人工标注存在差异,表明LLM尚不能完全替代人类标注,需根据具体应用场景审慎使用。

链接: https://arxiv.org/abs/2512.09634
作者: Karl Gustav Gailit,Kadri Muischnek,Kairit Sirts
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 2 appendixes, submitted to LREC 2026

点击查看摘要

Abstract:This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.
zh

[NLP-17] Rethinking Chain-of-Thought Reasoning for Videos

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频推理任务中依赖冗长的链式思维(Chain-of-thought, CoT)推理过程和大量输入视觉标记(visual tokens)所带来的计算效率低下问题。解决方案的关键在于设计并验证一种高效的后训练与推理框架,使模型能够在压缩后的视觉标记上运行,并生成简短的推理轨迹后再作答,从而在不依赖人工CoT标注或监督微调的前提下,显著提升推理效率且保持跨多个基准测试的竞争力。

链接: https://arxiv.org/abs/2512.09616
作者: Yiwu Zhong,Zi-Yuan Hu,Yin Li,Liwei Wang
机构: The Chinese University of Hong Kong (香港中文大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical report

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM’s reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at this https URL.
zh

[NLP-18] System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection CCL2025

【速读】: 该论文旨在解决中文社交媒体中仇恨言论(hate speech)识别难题,尤其针对传统系统在解析语境依赖的修辞策略和不断演变的网络俚语方面能力不足的问题。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的三阶段框架:首先通过上下文感知的提示工程(Prompt Engineering)引导LLM提取隐含的仇恨模式;其次在监督微调(Supervised Fine-tuning)阶段融合任务特定特征以增强领域适应性;最后通过LLM合并(LLM Merging)提升对分布外样本的鲁棒性,从而实现细粒度仇恨言论检测性能的显著提升。

链接: https://arxiv.org/abs/2512.09563
作者: Binglin Wu,Jiaxiu Zou,Xianneng Li
机构: Dalian University of Technology (大连理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at CCL 2025

点击查看摘要

Abstract:The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework’s effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.
zh

[NLP-19] Systematic Framework of Application Methods for Large Language Models in Language Sciences

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在语言科学领域应用中面临的方法论碎片化系统严谨性不足的问题。当前LLM的使用缺乏统一的方法指导,导致研究结果难以复现且机制解释力弱。解决方案的关键在于提出两个相互关联的综合性方法论框架:其一是方法选择框架,将LLM应用分为三类互补策略——基于提示的交互用于探索性分析与假设生成、开源模型微调用于理论驱动的验证性研究与高质量数据生成、上下文嵌入提取用于量化分析与模型内部机制探查,并明确每种方法的技术实现与权衡;其二是系统实施框架,基于前述方法构建多阶段研究流程配置,支持从问题定义到实证验证的结构化执行路径。通过实证实验、回溯分析与专家评估验证,该体系实现了研究问题与LLM方法的精准匹配,推动语言科学研究从经验性应用向可验证、可重复的严谨范式转型。

链接: https://arxiv.org/abs/2512.09552
作者: Kun Sun,Rong Wang
机构: Tongji University (同济大学); University of Tübingen (图宾根大学); Institute of Natural Language Processing, University of Stuttgart (斯图加特大学自然语言处理研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
zh

[NLP-20] RouteRAG : Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在多轮推理中难以有效整合文本与图结构证据的问题,特别是现有基于图或混合检索的系统依赖固定或人工设计的检索流程,缺乏根据推理进展动态调整检索策略的能力,且图数据检索成本较高。解决方案的关键在于提出一个基于强化学习(Reinforcement Learning, RL)的统一框架 \model,其通过端到端的RL优化实现对生成过程的联合控制:模型可自主决策何时进行推理、从文本或图中检索何种信息、以及何时生成最终答案,从而支持自适应的多轮混合检索增强生成;此外,该框架采用两阶段训练机制,兼顾任务结果准确性与检索效率,使模型能够在利用图结构关系优势的同时避免冗余检索开销。

链接: https://arxiv.org/abs/2512.09487
作者: Yucan Guo,Miao Su,Saiping Guan,Zihao Sun,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: CAS Key Laboratory of Network Data Science and Technology (中国科学院网络数据科学与技术重点实验室); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
zh

[NLP-21] Source Coverag e and Citation Bias in LLM -based vs. Traditional Search Engines

【速读】: 该论文旨在解决生成式 AI (Generative AI) 驱动的搜索引擎(LLM-SEs)在信息检索中面临的可信度与透明度不足的问题,特别是其相较于传统搜索引擎(TSEs)在引用来源多样性、权威性、政治中立性和安全性方面的表现差异。解决方案的关键在于通过大规模实证研究(分析55,936个查询及对应结果),量化比较六种LLM-SEs与两种TSEs在多个维度的表现,并结合特征分析识别影响LLM-SEs源选择的核心因素,从而为用户、内容提供者和开发者提供可操作的改进依据。

链接: https://arxiv.org/abs/2512.09483
作者: Peixian Zhang,Qiming Ye,Zifan Peng,Kiran Garimella,Gareth Tyson
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Rutgers University(罗格斯大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLM-based Search Engines (LLM-SEs) introduces a new paradigm for information seeking. Unlike Traditional Search Engines (TSEs) (e.g., Google), these systems summarize results, often providing limited citation transparency. The implications of this shift remain largely unexplored, yet raises key questions regarding trust and transparency. In this paper, we present a large-scale empirical study of LLM-SEs, analyzing 55,936 queries and the corresponding search results across six LLM-SEs and two TSEs. We confirm that LLM-SEs cites domain resources with greater diversity than TSEs. Indeed, 37% of domains are unique to LLM-SEs. However, certain risks still persist: LLM-SEs do not outperform TSEs in credibility, political neutrality and safety metrics. Finally, to understand the selection criteria of LLM-SEs, we perform a feature-based analysis to identify key factors influencing source choice. Our findings provide actionable insights for end users, website owners, and developers.
zh

[NLP-22] Advancing Text Classification with Large Language Models and Neural Attention Mechanisms

链接: https://arxiv.org/abs/2512.09444
作者: Ning Lyu,Yuxi Wang,Feng Chen,Qingyuan Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-23] Knowledge-Augmented Large Language Model Agents for Explainable Financial Decision-Making

【速读】: 该论文旨在解决传统金融决策方法中存在的三大问题:依赖参数化知识导致的语义覆盖不足、缺乏事实一致性以及推理链缺失。解决方案的关键在于提出一种基于知识增强的大语言模型代理(Knowledge-enhanced Large Language Model Agents)框架,通过外部知识检索、语义表征与推理生成的融合机制实现可解释的金融决策。具体而言,该方法首先对金融文本和结构化数据进行语义编码,并利用相似度计算从外部知识库中检索任务相关知识;随后通过加权融合策略将内部表示与外部知识结合,在保证生成流畅性的同时提升事实准确性与完整性;在推理阶段引入多头注意力机制构建逻辑链,使模型能够呈现透明的因果关系和可追溯性;最终联合优化任务目标与解释一致性目标,从而同时提升预测性能与推理可解释性。

链接: https://arxiv.org/abs/2512.09440
作者: Qingyuan Zhang,Yuxi Wang,Cancan Hua,Yulin Huang,Ning Lyu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.
zh

[NLP-24] CourtPressGER: A German Court Decision to Press Release Summarization Dataset

【速读】: 该论文旨在解决司法文本中面向公众的传播需求与现有自然语言处理(Natural Language Processing, NLP)研究之间的脱节问题,即当前NLP方法主要聚焦于技术性判例摘要(technical headnotes),而忽视了法院官方 press releases 对普通公众和专业受众的可读性与准确性要求。解决方案的关键在于构建 CourtPressGER 数据集,这是一个包含6.4k三元组的数据集,每条记录包括判决文书、人工撰写的新闻稿以及用于大语言模型(Large Language Models, LLMs)生成对比文本的合成提示(synthetic prompts)。该数据集支持训练和评估LLMs从长篇司法文本中生成准确且易读的摘要,通过参考指标、事实一致性检验、LLM-as-judge 和专家评分等多维度基准测试,验证了大型LLMs在保持低层级性能损失的前提下可生成高质量稿件,而小型模型则需采用分层架构以应对长判决文本的复杂性。

链接: https://arxiv.org/abs/2512.09434
作者: Sebastian Nagl,Mohamed Elganayni,Melanie Pospisil,Matthias Grabmair
机构: Technical University of Munich (TUM)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint - This contribution was accepted at JURIX AI4A2J Workshop 2025

点击查看摘要

Abstract:Official court press releases from Germany’s highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.
zh

[NLP-25] Language models as tools for investigating the distinction between possible and impossible natural languages

【速读】: 该论文试图解决的问题是:如何通过语言模型(Language Models, LMs)来探究可能语言与不可能语言之间的界限,从而揭示支持人类语言习得的归纳偏置(inductive biases)。其解决方案的关键在于设计一个分阶段的研究计划,通过迭代优化语言模型架构,使其能够更准确地区分可能语言与不可能语言,并将由此产生的假设与人类认知机制相联系。

链接: https://arxiv.org/abs/2512.09394
作者: Julie Kallini,Christopher Potts
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.
zh

[NLP-26] CONCUR: A Framework for Continual Constrained and Unconstrained Routing

【速读】: 该论文旨在解决生成式 AI (Generative AI) 任务中因复杂度差异而导致的计算策略选择难题,即如何高效地将不同任务映射到最优的模型与解码方法组合上。现有方法通常通过单一模型训练覆盖所有策略,导致新增策略时需全量重训、资源开销大,且难以实现持续学习下的泛化能力提升;同时,这些方法多采用单一输入表示,无法充分捕捉任务与策略之间的复杂关系,影响路由决策质量。论文提出的 CONCUR 框架的关键创新在于:采用模块化设计,为每种策略独立训练预测器(predictor model),从而支持约束与非约束两种路由场景,并实现新策略的低开销无缝集成;此外,预测器利用任务和策略的多种表示形式,增强对问题整体复杂性的建模能力,显著提升了端到端准确率并降低了推理与训练成本。

链接: https://arxiv.org/abs/2512.09386
作者: Peter Baile Chen,Weiyue Li,Dan Roth,Michael Cafarella,Samuel Madden,Jacob Andreas
机构: MIT(麻省理工学院); Harvard University(哈佛大学); University of Pennsylvania(宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
zh

[NLP-27] Are Hypervectors Enough? Single-Call LLM Reasoning over Knowledge Graphs

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Model, LLM)结合时存在的高延迟、高GPU资源消耗以及决策不透明的问题。现有方法依赖复杂的神经编码器对符号路径进行打分或多次调用LLM来排序候选答案,导致效率低下且难以解释。其解决方案的关键在于提出PathHD框架:通过超维计算(Hyperdimensional Computing, HDC)替代神经路径评分机制,采用块对角GHRR超向量表示关系路径,并利用块级余弦相似度和Top-K剪枝实现轻量级检索;同时仅需一次LLM调用完成最终答案生成及支持路径标注,从而在保持高精度的同时显著降低延迟(40–60%)和GPU内存占用(3–5倍),并提供可追溯的路径依据以增强可解释性与可控性。

链接: https://arxiv.org/abs/2512.09369
作者: Yezi Liu,William Youngwoo Chung,Hanning Chen,Calvin Yeung,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled strong reasoning over both structured and unstructured knowledge. When grounded on knowledge graphs (KGs), however, prevailing pipelines rely on heavy neural encoders to embed and score symbolic paths or on repeated LLM calls to rank candidates, leading to high latency, GPU cost, and opaque decisions that hinder faithful, scalable deployment. We propose PathHD, a lightweight and encoder-free KG reasoning framework that replaces neural path scoring with hyperdimensional computing (HDC) and uses only a single LLM call per query. PathHD encodes relation paths into block-diagonal GHRR hypervectors, ranks candidates with blockwise cosine similarity and Top-K pruning, and then performs a one-shot LLM adjudication to produce the final answer together with cited supporting paths. Technically, PathHD is built on three ingredients: (i) an order-aware, non-commutative binding operator for path composition, (ii) a calibrated similarity for robust hypervector-based retrieval, and (iii) a one-shot adjudication step that preserves interpretability while eliminating per-path LLM scoring. On WebQSP, CWQ, and the GrailQA split, PathHD (i) attains comparable or better Hits@1 than strong neural baselines while using one LLM call per query; (ii) reduces end-to-end latency by 40-60% and GPU memory by 3-5\times thanks to encoder-free retrieval; and (iii) delivers faithful, path-grounded rationales that improve error diagnosis and controllability. These results indicate that carefully designed HDC representations provide a practical substrate for efficient KG-LLM reasoning, offering a favorable accuracy-efficiency-interpretability trade-off.
zh

[NLP-28] Identifying Bias in Machine-generated Text Detection

【速读】: 该论文旨在解决生成式 AI(Generative AI)文本检测系统中存在的潜在偏见问题,特别是这些系统在识别英语文本是否由机器生成时可能对不同社会属性群体(如性别、种族/族裔、英语学习者(ELL)身份和经济地位)产生不公平影响。其解决方案的关键在于构建一个包含学生作文的标注数据集,并对16种不同的检测模型进行系统性评估,通过回归分析和子群分析量化偏差的显著性和强度,从而揭示出若干关键问题:例如,弱势群体的作文更易被误判为机器生成,ELL学生的作文被错误标记的概率更高,而经济劣势学生则相反;尤其值得注意的是,非白人ELL学生的作文被错误分类的比例显著高于白人同侪。此外,研究还通过人工标注验证了人类在该任务中表现不佳但无明显偏见,凸显了算法偏见的特殊性与亟需关注的必要性。

链接: https://arxiv.org/abs/2512.09292
作者: Kevin Stowe,Svetlana Afanaseva,Rodolfo Raimundo,Yitao Sun,Kailash Patil
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures, 7 tables

点击查看摘要

Abstract:The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students’ essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.
zh

[NLP-29] raining-free Context-adaptive Attention for Efficient Long Context Modeling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理中因自注意力机制(self-attention)的二次计算复杂度导致的高计算开销与内存占用问题,尤其是在极端序列长度下,传统稀疏注意力(sparse attention)和KV缓存压缩方法存在依赖固定模式、无法统一优化预填充(prefilling)与解码(decoding)阶段或需额外训练等局限性。其解决方案的关键在于提出一种无需训练的上下文自适应注意力机制(Training-free Context-adaptive Attention, TCA-Attention),通过两个轻量级阶段实现高效推理:首先在离线校准阶段基于单次前向传播确定每个注意力头的稀疏预算;其次在在线 token 选择阶段利用轻量冗余度量动态保留核心上下文 token,从而在不改变模型结构或参数的前提下,统一加速预填充与解码过程,并显著降低 KV 缓存内存占用,同时理论分析证明近似误差有界,实验证明在 128K 上下文长度下可实现 2.8× 加速并减少 61% KV 缓存,性能接近全注意力基准。

链接: https://arxiv.org/abs/2512.09238
作者: Zeng You,Yaofo Chen,Shuhai Zhang,Zhijie Qiu,Tingyu Wu,Yingjian Li,Yaowei Wang,Mingkui Tan
机构: South China University of Technology (华南理工大学); Pengcheng Laboratory (鹏城实验室); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8 \times speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.
zh

[NLP-30] CORE: A Conceptual Reasoning Layer for Large Language Models

【速读】: 该论文旨在解决大语言模型在多轮交互中因依赖不断增长的token历史而导致的意图漂移(intent drift)、推理模式不一致及提示词(prompt)膨胀等问题。其解决方案的关键在于提出CORE(Concept-First Interaction Layer),通过引入一个持久化的局部概念(Local Concept)来捕获任务状态、约束条件、偏好和中间结果,并结合一组通用的认知操作符(cognitive operators),使每轮交互仅需传递该概念状态、用户最新指令和所选操作符,从而无需重放完整历史,显著降低冗余计算并提升多轮对话的稳定性。

链接: https://arxiv.org/abs/2512.09222
作者: Vishwas Hegde,Vindhya Shigehalli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Independent system-level architectural proposal with accompanying proof-of-concept

点击查看摘要

Abstract:Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user’s latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE’s behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.
zh

[NLP-31] argeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

【速读】: 该论文旨在解决基于奖励模型(reward model)微调大型语言模型(Large Language Models, LLMs)时因代理奖励模型与人类意图不一致而导致的对齐偏差问题。此类偏差常由标注噪声、偏见或覆盖不足引发,使得模型优化错误信号而非真实人类价值观。解决方案的关键在于将微调过程视为知识整合(knowledge integration)过程,并提出通过检测“代理策略冲突”(proxy-policy conflicts)来识别潜在的对齐失败区域——即基线模型与代理奖励模型强烈分歧的样本,这类样本往往反映了双方共同的知识盲区,因而极易产生误导性优化。为此,作者设计了两个互补指标:局部的代理-策略对齐冲突评分(Proxy-Policy Alignment Conflict Score, PACS)和全局的Kendall-Tau距离度量;并进一步构建Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS)算法,针对性地选取高冲突问答对进行人工反馈,从而高效提升奖励模型与策略的协同一致性,实验证明该方法在存在偏置代理奖励的情况下仍能显著增强整体对齐性能。

链接: https://arxiv.org/abs/2512.09212
作者: Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang,Xinru Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized Proxy-Policy Alignment Conflict Score (PACS) and a global Kendall-Tau Distance measure. Building on this insight, we design an algorithm named Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.
zh

[NLP-32] MindShift: Analyzing Language Models Reactions to Psychological Prompts

【速读】: 该论文试图解决的问题是:如何评估大型语言模型(Large Language Models, LLMs)在心理层面的适应能力,即它们是否能够准确吸收并反映由用户指定的人格特质和态度。为了解决这一问题,作者提出了一种名为MindShift的新基准测试框架,其关键在于基于心理学中广泛使用的明尼苏达多项人格测验(Minnesota Multiphasic Personality Inventory, MMPI)设计了一系列人格导向型提示(personality-oriented prompts),构建了具有不同特质强度的详细角色设定,从而系统性地测量LLMs在扮演特定人格角色时的行为一致性与敏感性。该方案通过量化模型对心理测量任务的响应差异,揭示了不同模型家族在模拟人类人格特征方面的性能差异,为评估LLMs的心理可塑性和人格模仿能力提供了可复现、可扩展的标准化方法。

链接: https://arxiv.org/abs/2512.09149
作者: Anton Vasiliuk,Irina Abdullaeva,Polina Druzhinina,Anton Razzhigaev,Andrey Kuznetsov
机构: FusionBrain Lab; Applied AI Institute; Innopolis University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs’ behavior to identify traits. To asses the sensitivity of LLMs’ prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs’ psychological adaptability. The results highlight a consistent improvement in LLMs’ role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
zh

[NLP-33] Detecting Hallucinations in Graph Retrieval-Augmented Generation via Attention Patterns and Semantic Alignment

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在基于图的检索增强生成(GraphRAG)系统中因无法有效理解知识图谱中的关系与拓扑结构而导致的幻觉问题。其解决方案的关键在于提出两个轻量级可解释性指标:路径依赖度(Path Reliance Degree, PRD)用于衡量模型对最短路径三元组的过度依赖,语义对齐分数(Semantic Alignment Score, SAS)用于评估模型内部表征与检索知识的一致性;并基于此开发了一种后验幻觉检测方法 Graph Grounding and Alignment (GGA),通过机制可解释性实现更精准的幻觉识别,显著优于现有的语义和置信度基线方法。

链接: https://arxiv.org/abs/2512.09148
作者: Shanghao Li,Jinda Han,Yibo Wang,Yuanjie Zhu,Zihe Song,Langzhou He,Kenan Kamel A Alghythee,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) enhances Large Language Models (LLMs) by incorporating external knowledge from linearized subgraphs retrieved from knowledge graphs. However, LLMs struggle to interpret the relational and topological information in these inputs, resulting in hallucinations that are inconsistent with the retrieved knowledge. To analyze how LLMs attend to and retain structured knowledge during generation, we propose two lightweight interpretability metrics: Path Reliance Degree (PRD), which measures over-reliance on shortest-path triples, and Semantic Alignment Score (SAS), which assesses how well the model’s internal representations align with the retrieved knowledge. Through empirical analysis on a knowledge-based QA task, we identify failure patterns associated with over-reliance on salient paths and weak semantic grounding, as indicated by high PRD and low SAS scores. We further develop a lightweight post-hoc hallucination detector, Graph Grounding and Alignment (GGA), which outperforms strong semantic and confidence-based baselines across AUC and F1. By grounding hallucination analysis in mechanistic interpretability, our work offers insights into how structural limitations in LLMs contribute to hallucinations, informing the design of more reliable GraphRAG systems in the future.
zh

[NLP-34] Knowledge-Guided Large Language Model for Automatic Pediatric Dental Record Understanding and Safe Antibiotic Recommendation

【速读】: 该论文旨在解决儿科牙科临床记录的准确解读与安全抗生素处方这两个持续存在的挑战,尤其针对传统基于规则的临床决策支持系统在处理非结构化牙科文本、不完整的放射学描述及复杂安全约束时表现不佳的问题。其解决方案的关键在于提出一种知识引导的大语言模型(Knowledge-Guided Large Language Model, KG-LLM),该模型融合了儿科牙科知识图谱、检索增强生成(Retrieval-Augmented Generation, RAG)以及多阶段安全验证流水线,从而实现基于证据的抗生素推荐。具体而言,系统首先通过临床命名实体识别与关系抽取(Clinical NER/RE)模块提取结构化信息,再利用知识图谱检索相关指南、药物安全规则和历史病例以增强生成准确性,并通过双层安全验证机制(确定性规则检查与学习型分类器)确保过敏反应、禁忌症和剂量错误等风险被有效识别,最终显著提升了理解性能、推荐准确率并减少不安全处方。

链接: https://arxiv.org/abs/2512.09127
作者: Zihan Han,Junyan Ge,Caifeng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate interpretation of pediatric dental clinical records and safe antibiotic prescribing remain persistent challenges in dental informatics. Traditional rule-based clinical decision support systems struggle with unstructured dental narratives, incomplete radiographic descriptions, and complex safety constraints. To address these limitations, this study proposes a Knowledge-Guided Large Language Model (KG-LLM) that integrates a pediatric dental knowledge graph, retrieval-augmented generation (RAG), and a multi-stage safety validation pipeline for evidence-grounded antibiotic recommendation. The framework first employs a clinical NER/RE module to extract structured entities and relations from dental notes and radiology reports. Relevant guidelines, drug-safety rules, and analogous historical cases are subsequently retrieved from the knowledge graph and supplied to the LLM for diagnostic summarization and dose-drug-duration prediction. Safety assurance is achieved through a dual-layer validation mechanism combining deterministic rule checking with a learned classifier for detecting allergies, contraindications, and dosing errors. Experiments on 32,000 de-identified pediatric dental visit records demonstrate the effectiveness of the proposed approach. Compared with a domain-adapted Llama-2 clinical baseline, KG-LLM improves record-understanding performance (F1: 0.914 vs. 0.867), drug-dose-duration accuracy (Top-1: 0.782 vs. 0.716), and reduces unsafe antibiotic suggestions by 50%. Additional evaluation across summary quality, recommendation accuracy, and global safety scores further confirms the robustness of the system. Ablation analyses indicate that the knowledge graph, RAG, and safety modules each contribute substantially to clinical reliability and interpretability.
zh

[NLP-35] ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

【速读】: 该论文旨在解决大规模音频语言模型(Large Audio Language Models, LALMs)在评估开放式回答时的挑战,即人类标注者因多种合理解释、部分正确性及主观判断而对答案正确性存在分歧,传统仅报告均值的指标无法捕捉这种不确定性。解决方案的关键在于提出ORCA(Open-ended Response Correctness Assessment)框架,通过贝塔分布(Beta distribution)建模人类判断的变异性,从而同时预测预期正确率和不确定性;其三阶段标注流程结合人工判断、结构化反馈与迭代优化,在收集高质量标注数据的同时提升基准测试质量,最终实现高相关性(Spearman相关系数0.91)且计算开销显著低于LLM判官基线的方法。

链接: https://arxiv.org/abs/2512.09066
作者: Šimon Sedláček,Sara Barahona,Bolaji Yusuf,Laura Herrera-Alarcón,Santosh Kesiraju,Cecilia Bolaños,Alicia Lozano-Diez,Sathvik Udupa,Fernando López,Allison Ferner,Ramani Duraiswami,Jan Černocký
机构: Speech@FIT, Brno University of Technology, Czechia (捷克共和国布诺理工大学语音实验室); Universidad Autónoma de Madrid, Spain (西班牙马德里自治大学); University of Buenos Aires, Argentina (阿根廷布宜诺斯艾利斯大学); Tufts University, USA (美国塔夫茨大学); University of Maryland, USA (美国马里兰大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff’s alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.
zh

[NLP-36] Luxical: High-Speed Lexical-Dense Text Embeddings

【速读】: 该论文旨在解决大规模文本组织中语言模型训练数据质量提升所面临的效率与灵活性矛盾问题:当前主流工具在速度与功能之间存在权衡——基于词法的分类器(如FastText)虽速度快但仅能输出分类得分,而基于Transformer的稠密向量嵌入模型虽然支持聚类、分类和检索等多种下游任务,却因计算成本高难以应用于网络规模的数据处理。解决方案的关键在于提出Luxical,一种高吞吐量的“词法-稠密”文本嵌入库,其核心创新包括:利用稀疏TF-IDF特征、小型ReLU神经网络以及知识蒸馏训练策略,以极低的计算开销逼近大型Transformer嵌入模型的性能。实验证明,Luxical在文档检索和语言模型数据清洗等任务中实现了3–100倍的速度提升,同时保持与神经基线相当的质量,显著优化了大规模文本组织场景下的算力与效果平衡。

链接: https://arxiv.org/abs/2512.09015
作者: DatologyAI:Luke Merrick,Alex Fang,Aldo Carranza,Alvin Deng,Amro Abbas,Brett Larsen,Cody Blakeney,Darren Teh,David Schwab,Fan Pan,Haakon Mongstad,Haoli Yin,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Kaleigh Mentzer,Paul Burstein,Parth Doshi,Paul Burnstein,Pratyush Maini,Ricardo Monti,Rishabh Adiga,Scott Loftin,Siddharth Joshi,Spandan Das,Tony Jiang,Vineeth Dorma,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today’s dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed “lexical-dense” text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF–IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at this https URL.
zh

[NLP-37] Financial Instruction Following Evaluation (FIFE) NEURIPS2025

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在处理金融分析任务中复杂且相互依赖的指令时表现不佳的问题,尤其是在高风险领域对精确性的严苛要求下。其解决方案的关键在于提出一个名为FIFE的新颖高难度基准测试集,该基准包含88个由人类撰写的金融分析提示(prompts),并引入一种具有链式可验证约束的验证系统,以提供细粒度的奖励信号,从而更精准地评估模型的指令遵循能力。通过在零样本设置下对53种不同类型的模型进行评估,研究揭示了当前先进模型在复杂金融任务中的局限性,并为强化学习在金融领域的应用提供了可复现的开源资源。

链接: https://arxiv.org/abs/2512.08965
作者: Glenn Matlin,Siddharth,Anirudh JM,Aditya Shukla,Yahya Hassan,Sudheer Chava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025 Generative AI in Finance Workshop (GenAI Finance), San Diego. Camera-ready version. Code and data: this https URL

点击查看摘要

Abstract:Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE’s complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
zh

[NLP-38] Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces

【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)在持续学习(Continual Learning)中因任务间梯度破坏性干扰而导致的灾难性遗忘问题。研究表明,这一问题主要由新任务梯度与历史权重轨迹方向相冲突所驱动。解决方案的关键在于提出PS-LoRA(Parameter Stability LoRA),其核心是通过双正则化目标函数,在优化子空间内对齐更新方向,惩罚冲突梯度并约束幅度偏移,从而保持与先验知识的一致性;同时引入基于幅度的合并策略,无需重新训练即可将连续适配器整合为鲁棒表示,有效提升模型在多任务场景下的稳定性与适应效率。

链接: https://arxiv.org/abs/2512.08960
作者: Yueer Zhou,Yichen Wu,Ying Wei
机构: Zhejiang University (浙江大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) enables efficient Continual Learning but often suffers from catastrophic forgetting due to destructive interference between tasks. Our analysis reveals that this degradation is primarily driven by antagonistic directional updates where new task gradients directly oppose the historical weight trajectory. To address this, we propose PS-LoRA (Parameter Stability LoRA), a framework designed to resolve conflicts by aligning updates within the optimization subspace. Our approach employs a dual-regularization objective that penalizes conflicting directions and constrains magnitude deviations to ensure consistency with prior knowledge. Additionally, we implement a magnitude-based merging strategy to consolidate sequential adapters into a robust representation without retraining. Experiments on NLP and Vision benchmarks show that PS-LoRA outperforms state-of-the-art methods by preserving the stability of learned representations while efficiently adapting to new domains.
zh

[NLP-39] Large Language Models as Search Engines: Societal Challenges

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)取代搜索引擎作为网络信息主要入口所引发的社会挑战问题。研究聚焦于LLM提供商、内容创作者与终端用户三方角色,识别出15类关键挑战,并从技术和法律两个维度提出当前的缓解策略。其解决方案的关键在于系统性地分析各利益相关方的责任与影响机制,明确每类挑战的潜在社会后果,并据此指明未来需深化的研究方向,以推动LLMs在保障信息可信度、公平性和透明度方面的可持续发展。

链接: https://arxiv.org/abs/2512.08946
作者: Zacchary Sadeddine,Winston Maxwell,Gaël Varoquaux,Fabian M. Suchanek
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) may one day replace search engines as the primary portal to information on the Web. In this article, we investigate the societal challenges that such a change could bring. We focus on the roles of LLM Providers, Content Creators, and End Users, and identify 15 types of challenges. With each, we show current mitigation strategies – both from the technical perspective and the legal perspective. We also discuss the impact of each challenge and point out future research opportunities.
zh

[NLP-40] he Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在生成具有心理化(mentalization)结构文本方面的能力,特别是其是否能够再现基于心理化治疗(Mentalization-Based Treatment, MBT)框架的言语特征。研究的核心问题是:LLMs能否以符合MBT四维参数(隐含-显性、自我-他人、内在状态整合、外部情境整合)的方式生成具有临床意义的心理化表达?解决方案的关键在于通过标准化配置的LLM与人类参与者进行50组对话,并由五名接受过MBT训练的 psychiatrists 在盲态条件下对生成的心理化剖面进行评分,评估其在评价一致性、论证连贯性和整体质量上的表现,从而量化模型在模仿人类心理化过程中的结构性能力与局限性。

链接: https://arxiv.org/abs/2512.08945
作者: Stefano Epifani(1 and 2),Giuliano Castigliego(2 and 3),Laura Kecskemeti(4),Giuliano Razzicchia(2),Elisabeth Seiwald-Sonderegger(4) ((1) University of Pavia Italy, (2) Digital Transformation Institute Italy, (3) Psychoanalytic Academy of Italian-Speaking Switzerland, (4) Psychiatric Services of the Canton of Grisons Switzerland)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 1 table, Project coordinator: Stefano Epifani

点击查看摘要

Abstract:Background: Mentalization integrates cognitive, affective, and intersubjective components. Large Language Models (LLMs) display an increasing ability to generate reflective texts, raising questions regarding the relationship between linguistic form and mental representation. This study assesses the extent to which a single LLM can reproduce the linguistic structure of mentalization according to the parameters of Mentalization-Based Treatment (MBT). Methods: Fifty dialogues were generated between human participants and an LLM configured in standard mode. Five psychiatrists trained in MBT, working under blinded conditions, evaluated the mentalization profiles produced by the model along the four MBT axes, assigning Likert-scale scores for evaluative coherence, argumentative coherence, and global quality. Inter-rater agreement was estimated using ICC(3,1). Results: Mean scores (3.63-3.98) and moderate standard deviations indicate a high level of structural coherence in the generated profiles. ICC values (0.60-0.84) show substantial-to-high agreement among raters. The model proved more stable in the Implicit-Explicit and Self-Other dimensions, while presenting limitations in the integration of internal states and external contexts. The profiles were coherent and clinically interpretable yet characterized by affective neutrality. Comments: 18 pages, 1 table, Project coordinator: Stefano Epifani Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08945 [cs.CL] (or arXiv:2512.08945v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.08945 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-41] Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

【速读】: 该论文旨在解决大语言模型在强化学习(Reinforcement Learning, RL)驱动下提升复杂推理能力的同时,显著加剧幻觉(Hallucination)现象的问题,即模型在生成内容时出现与事实不符或缺乏依据的情况,形成能力与可靠性之间的权衡困境。解决方案的关键在于提出一种针对性的强化学习框架:一方面,通过构建源自TriviaQA的开放式对话数据集来缓解外源性幻觉(Extrinsic Hallucination,即内部知识错误);另一方面,利用FineWeb中的长文本信息设计事实锚定(Fact-Grounding)奖励机制以抑制内源性幻觉(Intrinsic Hallucination,即对上下文不忠实);此外,明确奖励模型拒绝回答无法解答的问题,从而增强其谨慎性。实验表明,该方法在多个基准测试中显著降低两类幻觉并提升整体性能,为实现更强大且可信的大语言模型提供了可落地的解决方案。

链接: https://arxiv.org/abs/2512.08944
作者: Yudong Wang,Zhe Yang,Wenhan Ma,Zhifang Sui,Liang Zhao
机构: Peking University (北京大学); Xiaomi (小米)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.
zh

[NLP-42] Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索文档包含无关或错误信息而导致生成答案不准确的问题,尤其是当上下文较长时,抽象压缩器(Abstractive Compressor)易忽略关键信息。解决方案的关键在于提出一种对噪声鲁棒的抽象压缩方法(Abstractive Compression Robust against Noise, ACoRN),其核心包括两个创新训练步骤:一是通过离线数据增强提升压缩器对两类检索噪声(无关内容与事实性错误)的鲁棒性;二是通过微调使压缩器聚焦于直接支持正确答案的关键信息,缓解多文档信息利用不足和位置偏差问题。实验表明,基于ACoRN训练的T5-large在保持答案字符串不变的前提下显著提升了EM和F1指标,尤其在存在大量低质量文档的数据集上表现突出。

链接: https://arxiv.org/abs/2512.08943
作者: Singon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Master’s thesis, Korea University, 2025

点击查看摘要

Abstract:Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However, retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language model based compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy reducing documents, making it highly useful in real-world scenarios.
zh

[NLP-43] Dont Throw Away Your Beams: Improving Consistency-based Uncertainties in LLM s via Beam Search

【速读】: 该论文旨在解决大语言模型中不确定性量化(Uncertainty Quantification, UQ)的稳定性与准确性问题,尤其是在短文本问答(short-form QA)场景下,传统基于多项式采样(multinomial sampling)的一致性方法因分布峰值效应易产生重复生成结果,且其随机性导致不确定性估计在多次运行中方差较大。解决方案的关键在于引入基于束搜索(beam search)的候选生成策略,用于一致性UQ计算,从而提升估计的稳定性和性能;作者进一步提供了束搜索集合概率质量的理论下界,证明在该条件下束搜索可实现比多项式采样更低的误差,并通过六个QA数据集的实证验证了该方法在不确定性量化上达到当前最优性能。

链接: https://arxiv.org/abs/2512.09538
作者: Ekaterina Fadeeva,Maiya Goloburda,Aleksandr Rubashevskii,Roman Vashurin,Artem Shelmanov,Preslav Nakov,Mrinmaya Sachan,Maxim Panov
机构: 1. National Research University Higher School of Economics (俄罗斯高等经济大学); 2. Yandex (雅虎)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
zh

计算机视觉

[CV-0] GAINS: Gaussian-based Inverse Rendering from Sparse Multi-View Captures

【速读】:该论文旨在解决基于高斯点阵(Gaussian Splatting)的逆渲染方法在稀疏多视角(sparse multi-view)设置下性能显著下降的问题,其核心挑战在于有限观测导致几何、反射率与光照之间的严重歧义。解决方案的关键在于提出一种两阶段的逆渲染框架 GAINS(Gaussian-based Inverse rendering from Sparse multi-view captures):第一阶段利用单目深度/法向量和扩散模型先验稳定几何重建;第二阶段通过分割、固有图像分解(Intrinsic Image Decomposition, IID)及扩散先验来约束材质恢复,从而有效提升在稀疏视角下的材质参数准确性、重光照质量和新视角合成效果。

链接: https://arxiv.org/abs/2512.09925
作者: Patrick Noras,Jun Myeong Choi,Didier Stricker,Pieter Peers,Roni Sengupta
机构: University Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); German Research Center for Artificial Intelligence (德国人工智能研究中心); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); College of William & Mary (威廉玛丽学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 18 figures

点击查看摘要

Abstract:Recent advances in Gaussian Splatting-based inverse rendering extend Gaussian primitives with shading parameters and physically grounded light transport, enabling high-quality material recovery from dense multi-view captures. However, these methods degrade sharply under sparse-view settings, where limited observations lead to severe ambiguity between geometry, reflectance, and lighting. We introduce GAINS (Gaussian-based Inverse rendering from Sparse multi-view captures), a two-stage inverse rendering framework that leverages learning-based priors to stabilize geometry and material estimation. GAINS first refines geometry using monocular depth/normal and diffusion priors, then employs segmentation, intrinsic image decomposition (IID), and diffusion priors to regularize material recovery. Extensive experiments on synthetic and real-world datasets show that GAINS significantly improves material parameter accuracy, relighting quality, and novel-view synthesis compared to state-of-the-art Gaussian-based inverse rendering methods, especially under sparse-view settings. Project page: this https URL
zh

[CV-1] ReViSE: Towards Reason -Informed Video Editing in Unified Models with Self-Reflective Learning

【速读】:该论文旨在解决视频统一模型在进行视觉编辑时缺乏基于推理的能力问题,尤其是在涉及物理合理性与因果动态的场景中表现不佳。现有数据集不足以训练和评估具备推理能力的视频编辑模型,且模型内部的推理模块与编辑模块之间存在脱节,导致其强大的视觉-语言理解能力无法有效指导编辑过程。解决方案的关键在于提出一个名为ReViSE的自省式推理(Self-Reflective Reasoning, SRF)框架,该框架将生成与评估统一于同一架构中:通过模型内嵌的视觉语言模型(VLM)提供内在反馈,判断编辑后的视频是否逻辑上满足指令要求,并利用这种差异化反馈在训练过程中优化生成器的推理行为,从而实现推理与视觉变换的有效融合。

链接: https://arxiv.org/abs/2512.09924
作者: Xinyu Liu,Hangjie Yuan,Yujie Wei,Jiazheng Xing,Yujin Han,Jiahao Pan,Yanbiao Ma,Chi-Min Chan,Kang Zhao,Shiwei Zhang,Wenhan Luo,Yike Guo
机构: HKUST(香港科技大学); ZJU(浙江大学); FDU(复旦大学); HKU(香港大学); RUC(中国人民大学); Tongyi Lab(通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models’ reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model’s internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator’s reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
zh

[CV-2] Splatent: Splatting Diffusion Latents for Novel View Synthesis

【速读】:该论文旨在解决基于变分自编码器(VAE)潜空间的辐射场表示在三维重建中缺乏多视角一致性的问题,这导致纹理模糊和细节丢失。现有方法要么通过微调VAE来改善,但牺牲了重建质量;要么依赖预训练扩散模型恢复细粒度细节,却可能引入幻觉。其解决方案的关键在于提出Splatent框架,该框架不直接在3D空间中重建细节,而是利用多视角注意力机制,在输入视图的2D空间中恢复细节,并将其集成到3D高斯溅射(3DGS)的潜空间中,从而在保持预训练VAE重建质量的同时实现忠实的细节恢复。

链接: https://arxiv.org/abs/2512.09923
作者: Or Hirschorn,Omer Sela,Inbar Huberman-Spiegelglas,Netalee Efrat,Eli Alshan,Ianir Ideses,Frederic Devernay,Yochai Zvik,Lior Fritz
机构: Amazon Prime Video; Tel-Aviv University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
zh

[CV-3] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

【速读】:该论文旨在解决移动机器人在人类共存场景中实现更全面的社会感知导航问题,现有研究多聚焦于路径效率与行人避障,忽视了对用户语言指令的响应能力及复杂场景下的任务目标对齐。为应对这一挑战,作者提出LISN-Bench——首个基于仿真的语言指令社会导航基准,支持跨多样化场景的指令跟随与环境理解;其核心解决方案是Social-Nav-Modulator,一种快慢分层系统:由视觉语言模型(VLM)代理动态调节代价地图(costmap)和控制器参数,从而将底层动作生成与较慢的VLM推理解耦,在降低高频VLM调用依赖的同时提升动态避障能力和感知适应性。该方法在挑战性任务中表现优异,平均成功率达91.3%,显著优于最先进基线(提升超63%)。

链接: https://arxiv.org/abs/2512.09920
作者: Junting Chen,Yunchuan Li,Panfeng Jiang,Jiacheng Du,Zixuan Chen,Chenrui Tie,Jiajun Deng,Lin Shao
机构: National Univerisity of Singapore(新加坡国立大学); RoboScience Co.(罗布科学公司); ShanghaiTech University(上海科技大学); Nanjing University(南京大学); University of Science and Technology of China(中国科学技术大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: this https URL
zh

[CV-4] NordFKB: a fine-grained benchmark dataset for geospatial AI in Norway

【速读】:该论文旨在解决当前地理空间人工智能(Geospatial AI)研究中缺乏高质量、细粒度标注数据集的问题,尤其是在挪威这一具有多样地理特征区域的场景下。解决方案的关键在于构建NordFKB数据集,该数据集基于权威且高精度的全国性Felles Kartdatabase(FKB),包含7个地理分布广泛区域的高分辨率正射影像(orthophotos),并配有36类语义分割掩码(GeoTIFF格式)和COCO风格边界框标注,同时通过专家审核与质量控制确保标注准确性;此外,研究还提供标准化评估协议和工具库,支持可复现的语义分割与目标检测基准测试,从而为制图、土地管理与空间规划等应用提供可靠的数据基础与方法验证平台。

链接: https://arxiv.org/abs/2512.09913
作者: Sander Riisøen Jyhne,Aditya Gupta,Ben Worsley,Marianne Andersen,Ivar Oveland,Alexander Salveson Nossum
机构: Kartverket(海道测量局); University of Agder(奥勒大学); Norkart(挪威测绘公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We present NordFKB, a fine-grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high-resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per-class binary segmentation masks in GeoTIFF format and COCO-style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.
zh

[CV-5] VisualActBench: Can VLMs See and Act like a Human?

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在仅依赖视觉输入的情况下,缺乏主动推理与行动能力的问题,即模型难以基于视觉信息自主生成符合人类价值观和优先级的前瞻性行为。其解决方案的关键在于提出一个名为VisualActBench的新基准,该基准包含1,074个视频和3,733条由人类标注的动作,覆盖四个真实场景,并为每个动作标注了动作优先级水平(Action Prioritization Level, APL)和主动-被动类型,从而系统性评估VLMs在人类对齐推理与价值敏感性方面的表现。通过在此基准上对29个前沿VLMs进行评测,研究揭示了当前模型在复杂情境理解、结果预测及人类决策框架一致性上的显著不足,为提升视觉主导型AI代理的现实世界适应性提供了可量化、可扩展的评估基础。

链接: https://arxiv.org/abs/2512.09907
作者: Daoan Zhang,Pai Liu,Xiaofei Zhou,Yuan Ge,Guangchen Lan,Jing Bi,Christopher Brinton,Ehsan Hoque,Jiebo Luo
机构: University of Rochester (罗切斯特大学); Purdue University (普渡大学); Northeastern University (东北大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models’ human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs’ ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
zh

[CV-6] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

【速读】:该论文旨在解决传统机器人导航依赖高精度3D地图构建与维护所带来的计算复杂度高和内存占用大的问题,提出了一种基于探索视频的视觉导航方法。其关键解决方案是引入YOPO-Nav框架,将环境编码为由相互连接的局部3D高斯泼溅(3D Gaussian Splatting, 3DGS)模型组成的紧凑空间表示,并通过分层设计实现高效导航:首先利用视觉位置识别(Visual Place Recognition, VPR)模块进行粗粒度定位,再结合局部3DGS模型细化目标及中间位姿,从而生成控制指令引导机器人沿已演示轨迹返程。该方法无需依赖度量地图即可实现鲁棒的图像目标导航,在真实场景中表现出优异性能。

链接: https://arxiv.org/abs/2512.09903
作者: Ryan Meegan,Adam D’Souza,Bryan Bo Cao,Shubham Jain,Kristin Dana
机构: Rutgers University (罗格斯大学); Stony Brook University (石溪大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot’s current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
zh

[CV-7] Visual Heading Prediction for Autonomous Aerial Vehicles

【速读】:该论文旨在解决在无外部定位基础设施(如GPS或GNSS)环境下,无人飞行器(UAV)与无人地面车辆(UGV)之间实时协同作业时的精准感知与导航问题。其核心挑战在于如何仅依赖视觉信息实现对UGV的可靠检测及无人机航向角的准确预测,从而支持多智能体系统的动态协调。解决方案的关键在于提出一种基于视觉的数据驱动框架:首先利用微调后的YOLOv5模型实现高精度UGV目标检测(实验中达到95%准确率),随后通过轻量级人工神经网络(ANN)从单目相机获取的边界框特征中估计无人机所需航向角,训练数据来源于VICON运动捕捉系统生成的13,000余张标注图像;该方法在无定位依赖条件下实现了平均绝对误差仅为0.1506°的航向预测性能,验证了其在复杂动态场景下的实用性与鲁棒性。

链接: https://arxiv.org/abs/2512.09898
作者: Reza Ahmari,Ahmad Mohammadi,Vahid Hemmati,Mohammed Mynuddin,Parham Kebria,Mahmoud Nabil Mahmoud,Xiaohong Yuan,Abdollah Homaifar
机构: North Carolina A&T State University (北卡罗来纳农业技术州立大学); North Carolina A&T State University (北卡罗来纳农业技术州立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV’s required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506° and a root mean squared error of 0.1957°, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system’s real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: this https URL
zh

[CV-8] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

【速读】:该论文旨在解决从PDF文档中准确解析数学公式(mathematical formulas)的问题,这一问题对训练大语言模型(large language models, LLMs)和构建科学知识库至关重要。现有基准测试要么完全忽略公式,要么缺乏语义层面的评估指标,导致无法有效衡量解析质量。其解决方案的关键在于提出了一种基于合成PDF的新型基准框架,这些PDF具备精确的LaTeX真值(ground truth),并能系统控制版面、公式及内容特征;同时首创性地采用生成式AI作为裁判(LLM-as-a-judge)进行语义级公式评估,并结合一个鲁棒的两阶段匹配流水线以处理解析输出不一致的问题,从而实现了与人工判断高度相关(Pearson r=0.78)的自动化评估能力。

链接: https://arxiv.org/abs/2512.09874
作者: Pius Horn,Janis Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: this https URL
zh

[CV-9] Diffusion Posterior Sampler for Hyperspectral Unmixing with Spectral Variability Modeling

【速读】:该论文旨在解决高光谱解混(hyperspectral unmixing)中两个核心挑战:如何建模光谱先验分布(spectral prior distribution)以及如何刻画光谱变异性(spectral variability)。传统方法常依赖预定义光谱库作为先验,易引入偏差,且难以有效捕捉局部场景的光谱多样性。为此,作者提出了一种基于扩散后验采样器的半盲解混方法(DPS4Un),其关键创新在于:首先将预训练的条件光谱扩散模型视为后验采样器,通过融合学习到的端元先验与观测数据获得精细化的丰度分布;其次,利用超像素(superpixel)构建图像驱动的端元束(endmember bundles)以替代固定光谱库,提升先验建模的自适应性与准确性;同时,设计基于超像素的数据保真项替代图像级约束,增强局部一致性;最后,以高斯噪声初始化每个超像素区域的端元,并通过迭代更新丰度与端元参数,实现对光谱变异性的有效建模。实验表明,该方法在三个真实世界基准数据集上优于现有最优解混算法。

链接: https://arxiv.org/abs/2512.09871
作者: Yimin Zhu,Lincoln Linlin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Linear spectral mixture models (LMM) provide a concise form to disentangle the constituent materials (endmembers) and their corresponding proportions (abundance) in a single pixel. The critical challenges are how to model the spectral prior distribution and spectral variability. Prior knowledge and spectral variability can be rigorously modeled under the Bayesian framework, where posterior estimation of Abundance is derived by combining observed data with endmember prior distribution. Considering the key challenges and the advantages of the Bayesian framework, a novel method using a diffusion posterior sampler for semiblind unmixing, denoted as DPS4Un, is proposed to deal with these challenges with the following features: (1) we view the pretrained conditional spectrum diffusion model as a posterior sampler, which can combine the learned endmember prior with observation to get the refined abundance distribution. (2) Instead of using the existing spectral library as prior, which may raise bias, we establish the image-based endmember bundles within superpixels, which are used to train the endmember prior learner with diffusion model. Superpixels make sure the sub-scene is more homogeneous. (3) Instead of using the image-level data consistency constraint, the superpixel-based data fidelity term is proposed. (4) The endmember is initialized as Gaussian noise for each superpixel region, DPS4Un iteratively updates the abundance and endmember, contributing to spectral variability modeling. The experimental results on three real-world benchmark datasets demonstrate that DPS4Un outperforms the state-of-the-art hyperspectral unmixing methods.
zh

[CV-10] UniUGP: Unifying Understanding Generation and Planing For End-to-end Autonomous Driving

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统在长尾场景下表现不佳的问题,其核心挑战在于世界知识有限和视觉动态建模能力弱。现有基于视觉-语言-动作(Vision-Language-Action, VLA)的方法无法利用未标注视频进行视觉因果学习,而基于世界模型的方法又缺乏大语言模型(Large Language Model, LLM)的推理能力。解决方案的关键在于提出一个统一的“理解-生成-规划”框架(UniUGP),通过混合专家架构协同实现场景推理、未来视频生成与轨迹规划;该框架整合预训练视觉语言模型(Vision-Language Model, VLM)和视频生成模型,以视觉动态性和语义推理增强决策性能,并采用四阶段渐进式训练策略,在多个现有AD数据集及自建的专业化数据集上逐步构建感知、推理与决策能力,从而显著提升复杂长尾场景下的泛化性能。

链接: https://arxiv.org/abs/2512.09864
作者: Hao Lu,Ziyang Liu,Guangfeng Jiang,Yuanfei Luo,Sheng Chen,Yangang Zhang,Ying-Cong Chen
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
zh

[CV-11] Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

【速读】:该论文旨在解决机器人操作中因感知模态分离导致的复杂任务执行能力受限问题,特别是现有See-through-skin (STS)传感器难以实现触觉与视觉信息的同步感知,以及多模态信号难以有效集成到基于学习的操控策略中的挑战。其解决方案的关键在于提出TacThru传感器与TacThru-UMI模仿学习框架:TacThru通过全透明弹性体、持续照明、新型关键线标记和高效跟踪机制,实现了触觉与视觉信号的同步获取与鲁棒提取;TacThru-UMI则采用基于Transformer的扩散策略(Diffusion Policy)融合多模态输入,在五个真实世界任务中达到85.5%的平均成功率,显著优于仅使用视觉(55.4%)或交替使用触觉-视觉(66.3%)的基线方法,尤其在薄软物体接触检测和需要多模态协同的高精度操作场景中表现突出。

链接: https://arxiv.org/abs/2512.09851
作者: Yuyang Li,Yinghan Chen,Zihang Zhao,Puhao Li,Tengyu Liu,Siyuan Huang,Yixin Zhu
机构: Peking University (北京大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); State Key Lab of General AI at Peking University (北京大学通用人工智能重点实验室); Beijing Key Lab of Behavior and Mental Health (北京市行为与心理健康重点实验室); School of Psychological and Cognitive Sciences, Peking University (北京大学心理与认知科学学院); Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence (PKU-武汉人工智能研究院具身智能实验室); Department of Computer Science and Technology, University of Cambridge (剑桥大学计算机科学与技术系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
zh

[CV-12] From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities WACV2026

【速读】:该论文旨在解决智能辅助系统中实时识别与预测用户困难(struggle)的问题,尤其关注在线检测(online detection)和提前预测(anticipation)能力的构建。传统方法多集中于离线分类与定位,难以满足实时交互需求。解决方案的关键在于将挣扎定位任务重新建模为在线检测任务,并进一步扩展至提前预测——即在挣扎发生前数秒进行预判;研究采用两个现成模型作为基线,在保证高精度(每帧mAP达70-80%)的同时实现高达2秒的提前预测性能,且整体系统运行速度可达20 FPS(含特征提取),具备实际部署于实时辅助场景的能力。

链接: https://arxiv.org/abs/2512.09847
作者: Shijia Feng,Michael Wray,Walterio Mayol-Cuevas
机构: University of Bristol (布里斯托大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026

点击查看摘要

Abstract:Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.
zh

[CV-13] Composing Concepts from Images and Videos via Concept-prompt Binding

【速读】:该论文旨在解决视觉概念组合(Visual Concept Composition)中难以准确提取复杂视觉概念以及灵活融合图像与视频来源概念的问题。其核心解决方案是提出一种名为 Bind Compose 的单次(one-shot)方法,通过将视觉概念与对应提示词(prompt token)绑定,并利用绑定后的token组合目标提示词来实现灵活的视觉生成。关键创新在于:1)采用分层绑定结构(hierarchical binder structure)在扩散Transformer(Diffusion Transformers)中实现跨注意力条件控制,以精确分解复杂视觉概念;2)设计“多样化吸收机制”(Diversify-and-Absorb Mechanism),引入额外吸收token以消除无关细节干扰,提升概念-提示绑定精度;3)提出时序解耦策略(Temporal Disentanglement Strategy),通过双分支绑定结构分阶段训练视频概念,增强图像与视频概念间的兼容性。实验表明,该方法在概念一致性、提示保真度和运动质量上均优于现有方法。

链接: https://arxiv.org/abs/2512.09824
作者: Xianghao Kong,Zeyu Zhang,Yuwei Guo,Zhuoran Zhao,Songchun Zhang,Anyi Rao
机构: HKUST(香港科技大学); CUHK(香港中文大学); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
zh

[CV-14] DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

【速读】:该论文旨在解决个性化文本到图像(Personalized Text-to-Image, PT2I)生成中的三大核心挑战:1)概念保持(Concept Preservation, CP)与提示遵循(Prompt Following, PF)之间的平衡难以把控;2)参考图像中细粒度概念细节难以保留;3)多主体个性化扩展能力受限。解决方案的关键在于提出动态图像提示适配器(Dynamic Image Prompt Adapter, DynaIP),其创新性地利用了多模态扩散变换器(Multimodal Diffusion Transformers, MM-DiT)在跨注意力机制下对参考图像特征注入时的解耦学习行为,设计了一种动态解耦策略,在推理阶段消除无关信息干扰,显著提升CP-PF平衡并增强多主体可扩展性;同时,通过引入分层专家混合特征融合模块(Hierarchical Mixture-of-Experts Feature Fusion Module),充分利用CLIP视觉编码器的层次化特征,有效提升细粒度概念保真度,并实现对视觉粒度的灵活控制。

链接: https://arxiv.org/abs/2512.09814
作者: Zhizhong Wang,Tianyi Chu,Zeyi Huang,Nanyang Wang,Kehan Li
机构: Central Media Technology Institute, Huawei(华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
zh

[CV-15] CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

【速读】:该论文旨在解决基于U-Net等U-shaped架构的图像重建模型在图像去卷积任务中易产生不真实伪影(hallucination artifacts)的问题,这些伪影可能在安全关键场景中干扰分析结果,从而影响模型的可信度。解决方案的关键在于提出一种通用的量化与理解方法——Conformal Hallucination Estimation Metric (CHEM),其核心优势包括:利用小波(wavelet)和剪切波(shearlet)表示高效提取图像特征中的伪影,并采用保形化分位数回归(conformalized quantile regression)实现无需分布假设的伪影水平评估;同时从逼近理论角度揭示了U-shaped网络易产生幻觉的根本原因,为提升深度学习图像处理模型的可靠性提供了新视角。

链接: https://arxiv.org/abs/2512.09806
作者: Jianfei Li,Ines Rosellon-Inclan,Gitta Kutyniok,Jean-Luc Starck
机构: Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学); CEA (法国原子能委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.
zh

[CV-16] Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation

【速读】:该论文旨在解决多模态医学图像分割中因模态间语义差异和配准不一致导致的互补信息难以有效利用的问题。其核心解决方案在于提出一种新颖的半监督多模态框架,关键创新点包括:(1)引入模态特异性增强模块(Modality-specific Enhancing Module, MEM),通过通道注意力机制强化各模态独有的语义特征;(2)设计可学习的互补信息融合模块(Complementary Information Fusion, CIF),实现模态间自适应的知识交互与融合。该框架通过结合监督分割损失与未标注数据上的跨模态一致性正则化进行优化,在BraTS 2019 HGG子集上验证了在极低标注比例下仍能显著提升Dice和敏感度指标,证明了MEM与CIF在缓解跨模态差异、增强模型鲁棒性方面的协同作用。

链接: https://arxiv.org/abs/2512.09801
作者: Tien-Dat Chung,Ba-Thinh Lam,Thanh-Huy Nguyen,Thien Nguyen,Nguyen Lan Vi Vu,Hoang-Loc Cao,Phat Kim Huynh,Min Xu
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. National University of Singapore (新加坡国立大学); 3. Hanoi University of Science and Technology (河内科技大学); 4. University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1%, 5%, and 10% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
zh

[CV-17] FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation WACV2026

【速读】:该论文旨在解决从单张图像中高效准确估计航天器6自由度(6DoF)位姿的问题,以支持在轨服务和空间碎片清除等自主任务。现有主流方法依赖计算密集型的迭代PnP(Perspective-n-Point)算法,难以部署于资源受限的边缘设备。解决方案的关键在于提出FastPose-ViT,一种基于视觉Transformer(Vision Transformer, ViT)的端到端架构,直接回归6DoF位姿;其创新性地引入了一种基于投影几何与“表观旋转”(apparent rotation)概念的数学映射形式,将局部裁剪图像上的预测结果校正回全图尺度,从而实现高精度且低延迟的位姿估计。该方法在SPEED数据集上性能优于其他非PnP方法,并达到与先进PnP技术相当的水平,同时通过量化部署验证了其在功耗受限边缘硬件(如NVIDIA Jetson Orin Nano)上的实用性,实现约75ms/帧的延迟和高达33 FPS的吞吐量。

链接: https://arxiv.org/abs/2512.09792
作者: Pierre Ancey,Andrew Price,Saqib Javed,Mathieu Salzmann
机构: EPFL(瑞士联邦理工学院); Swiss Data Science Center(瑞士数据科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026. Preprint version

点击查看摘要

Abstract:Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of “apparent rotation”, where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model’s suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.
zh

[CV-18] Stylized Meta-Album: Group-bias injection with style transfer to study robustness against distribution shifts

【速读】:该论文旨在解决当前图像分类模型在分布外(out-of-distribution, OOD)泛化能力、公平性与鲁棒性评估中面临的基准场景单一、群体多样性不足的问题。现有方法难以有效模拟现实世界中复杂的群体结构和域偏移,导致算法性能评估存在偏差。解决方案的关键在于提出Stylized Meta-Album (SMA),一个由24个数据集(12个内容数据集与12个风格化数据集)组成的元数据集,通过风格迁移技术构建4800个可配置的群体组合,涵盖多种对象类别(如物体、动物、纹理等)与多风格特征,从而实现对群体结构、类别分布及域变化的灵活控制。这种设计不仅显著提升了群体与类别的多样性,还支持在包含少数群体、群体不平衡和复杂域转移等真实场景下进行模型评估,推动了对公平性、鲁棒性和自适应能力的新研究方向。

链接: https://arxiv.org/abs/2512.09773
作者: Romain Mussard(UNIROUEN),Aurélien Gauffre(UGA),Ihsan Ullah,Thanh Gia Hieu Khuong(TAU, LISN),Massih-Reza Amini(UGA),Isabelle Guyon(TAU, LISN),Lisheng Sun-Hosoya(TAU, LISN)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA’s effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA’s domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textitTop-M worst group accuracy as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA’s group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73% and 28% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA’s potential to significantly impact the outcomes of conventional benchmarks.
zh

[CV-19] LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery

【速读】:该论文旨在解决通用目标检测模型在卫星遥感图像中进行船舶检测时面临的两大核心问题:一是由于海面目标尺度差异极端和形态各向异性导致的检测精度下降;二是标准架构中使用步长为32(P5层)的特征层难以分辨细长船舶,造成空间特征稀释。解决方案的关键在于提出LiM-YOLO检测器,其核心创新包括:基于船舶尺度统计分析提出的金字塔层级偏移策略(Pyramid Level Shift Strategy),将检测头从传统的P5调整至P2-P4,以满足奈奎斯特采样准则对小目标的要求并消除深层网络的计算冗余;同时引入组归一化卷积线性投影模块(GN-CBLinear),有效缓解高分辨率输入下微批量训练中的梯度波动问题,从而提升训练稳定性与检测性能。

链接: https://arxiv.org/abs/2512.09700
作者: Seon-Hoon Kim,Hyeji Sim,Youeyun Jung,Ok-Chul Jung,Yerin Kim
机构: University of Science and Technology (UST); Korea Aerospace Research Institute (KARI)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 8 figures, 9 tables

点击查看摘要

Abstract:Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at this https URL.
zh

[CV-20] Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在训练过程中对训练数据的过度记忆问题,尤其是由此引发的版权侵权、肖像权侵犯和商标违规等法律风险。现有方法主要通过调整去噪采样过程或采用需要特定概念数据集的遗忘学习(unlearning)策略来缓解该问题,但普遍存在计算开销大、难以规模化等问题。论文提出的关键解决方案是 UniForget,其核心在于识别并剪枝负责生成受版权保护内容的模型特定模块,从而在不针对具体概念的情况下有效降低生成侵权内容的概率,同时保留模型的通用生成能力。该方法与现有遗忘技术正交且互补,具有提升当前去记忆化技术潜力的显著优势。

链接: https://arxiv.org/abs/2512.09687
作者: Er Jin,Yang Zhang,Yongli Mou,Yanfei Dong,Stefan Decker,Kenji Kawaguchi,Johannes Stegmaier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.
zh

[CV-21] An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence

【速读】:该论文旨在解决卫星成像任务规划与调度中自动化程度低、响应延迟高以及多源数据融合不足的问题,特别是在大规模卫星星座部署背景下,如何实现高效、智能的任务生成与执行。其解决方案的关键在于提出了一种全自动化“Tip-and-Cue”框架:通过外部数据源或历史影像分析生成“tip”(即潜在目标提示),并据此构建符合传感器约束、时间要求和效用函数的“cue”(即成像任务);系统进一步利用连续效用函数优化多星任务调度,并结合人工智能模型(如目标检测器和视觉-语言模型)对影像进行自动处理与结构化报告生成,从而实现从目标识别到决策支持的闭环自动化流程。

链接: https://arxiv.org/abs/2512.09670
作者: Gil Weissman,Amir Ivry,Israel Cohen
机构: Technion-Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Under review at IEEE Transactions on Geoscience and Remote Sensing (TGRS). 13 pages, 8 figures

点击查看摘要

Abstract:The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.
zh

[CV-22] OxEnsemble: Fair Ensembles for Low-Data Classification

【速读】:该论文旨在解决在数据稀缺且跨人口群体分布不均的场景下实现公平分类的问题,这类问题在医学影像等关键领域尤为突出,因误诊可能带来致命后果。其解决方案的核心是提出一种名为OxEnsemble的新方法,通过训练多个满足公平性约束的模型集成成员,并聚合各成员的预测结果,在保证公平性的同时显著提升数据利用效率和计算效率。该方法创新性地复用验证数据以可靠地施加公平性约束,且所需计算资源仅略高于微调或评估单个模型,从而在多个具有挑战性的医学影像分类数据集上实现了更一致的结果和更强的公平性-准确性权衡。

链接: https://arxiv.org/abs/2512.09665
作者: Jonathan Rystrøm,Zihao Fu,Chris Russell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emphOxEnsemble for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emphOxEnsemble is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2512.09665 [cs.CV] (or arXiv:2512.09665v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.09665 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-23] SynthPix: A lightspeed PIV images generator

【速读】:该论文旨在解决粒子图像测速(Particle Image Velocimetry, PIV)中图像生成效率低的问题,尤其针对数据密集型强化学习方法训练和实时流场估计算法开发时的高迭代成本。解决方案的关键在于设计并实现了一个基于JAX的合成图像生成工具SynthPix,其核心优势在于通过优化性能与并行化能力,在加速器上实现了比现有工具高几个数量级的图像对生成吞吐量,从而显著提升流场估计模型的训练效率与算法开发迭代速度。

链接: https://arxiv.org/abs/2512.09664
作者: Antonio Terpin,Alan Bonomi,Francesco Banelli,Raffaello D’Andrea
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Code: this https URL

点击查看摘要

Abstract:We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
zh

[CV-24] IF-Bench: Benchmarking and Enhancing MLLM s for Infrared Images with Generative Visual Prompting

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在红外图像理解能力方面缺乏系统评估与有效提升方法的问题。现有研究虽在可见光图像上取得显著进展,但红外图像因成像机制不同导致域分布差异(domain distribution shift),使得MLLMs难以直接适用。解决方案的关键在于提出首个高质量红外图像理解基准IF-Bench,并设计一种无需训练的生成式视觉提示(Generative Visual Prompting, GenViP)方法:通过先进图像编辑模型将红外图像转化为语义和空间对齐的RGB图像,从而缓解域偏移问题;实验表明,该方法能显著提升多种MLLMs在红外图像理解任务中的性能。

链接: https://arxiv.org/abs/2512.09663
作者: Tao Zhang,Yuyang Hong,Yang Xia,Kun Ding,Zeyu Zhang,Ying Wang,Shiming Xiang,Chunhong Pan
机构: MAIS, Institute of Automation (自动化研究所); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院); Research Center of Aerospace Information, Institute of Automation (航天信息研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at this https URL.
zh

[CV-25] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

【速读】:该论文旨在解决视频生成中人类-物体交互(Human-Object Interaction, HOI)的复杂性和可控性难题,尤其是如何在保持实例感知(instance-aware)的同时实现高效控制。现有方法在控制信号上存在权衡:稀疏控制(如关键点轨迹)易于指定但缺乏对个体动态的区分能力,而密集信号(如光流、深度或3D网格)虽信息丰富却获取成本高。其解决方案的关键在于提出一个两阶段框架VHOI,首先将稀疏轨迹稠密化为HOI掩码序列,再以这些密集掩码微调视频扩散模型;其中引入一种新颖的HOI感知运动表示法,利用颜色编码区分人体与物体运动以及不同身体部位的特定动态,从而将人体先验嵌入条件信号中,显著增强模型对真实HOI动态的理解与生成能力。

链接: https://arxiv.org/abs/2512.09646
作者: Wanyue Zhang,Lin Geng Foo,Thabo Beeler,Rishabh Dabral,Christian Theobalt
机构: MPI for Informatics (马克斯·普朗克信息研究所); VIA Center (视觉与交互中心); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model’s ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: this https URL.
zh

[CV-26] Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments

【速读】:该论文旨在解决医学影像领域中人工智能(AI)研究面临的可泛化性不足问题,其核心挑战包括:多中心大型数据集获取困难、研究环境中工具链标准化程度低、跨机构协作效率低下以及缺乏可复现的分析流程。解决方案的关键在于提出一个名为Kaapana的开源平台,该平台通过模块化、可扩展的架构统一了数据接入、队列筛选、处理工作流和结果可视化等环节,并采用“将算法带到数据端”的策略,在保障各机构对敏感数据控制权的同时支持分布式实验与模型开发。此外,Kaapana通过灵活的工作流编排机制与面向研究人员的用户界面,显著降低技术门槛,提升研究可复现性,从而支撑从本地原型验证到国家级影像研究网络的多样化应用场景。

链接: https://arxiv.org/abs/2512.09644
作者: Ünal Akünal,Markus Bujotzek,Stefan Denner,Benjamin Hamm,Klaus Kades,Philipp Schader,Jonas Scherer,Marco Nolden,Peter Neher,Ralf Floca,Klaus Maier-Hein
机构: 1. German Cancer Research Center (德国癌症研究中心); 2. Heidelberg University Hospital (海德堡大学医院); 3. German Cancer Research Center (德国癌症研究中心); 4. German Cancer Research Center (德国癌症研究中心); 5. Heidelberg University Hospital (海德堡大学医院); 6. German Cancer Research Center (德国癌症研究中心); 7. German Cancer Research Center (德国癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at this https URL
zh

[CV-27] Benchmarking SAM2-based Trackers on FMOX

【速读】:该论文旨在解决当前基于Segment Anything Model 2 (SAM2) 的目标跟踪方法在处理高速运动物体(Fast Moving Objects, FMO)时性能下降的问题。现有跟踪器如SAM2、EfficientTAM、DAM4SAM和SAMURAI虽在常规数据集上表现优异,但在挑战性场景下仍存在局限性。解决方案的关键在于构建一个专为高速运动物体设计的基准数据集,并对上述先进跟踪器进行系统性评估,从而揭示其行为特性与性能瓶颈,最终发现DAM4SAM和SAMURAI在复杂序列中表现出更强的鲁棒性和准确性。

链接: https://arxiv.org/abs/2512.09633
作者: Senem Aktas,Charles Markham,John McDonald,Rozenn Dahyot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Several object tracking pipelines extending Segment Anything Model 2 (SAM2) have been proposed in the past year, where the approach is to follow and segment the object from a single exemplar template provided by the user on a initialization frame. We propose to benchmark these high performing trackers (SAM2, EfficientTAM, DAM4SAM and SAMURAI) on datasets containing fast moving objects (FMO) specifically designed to be challenging for tracking approaches. The goal is to understand better current limitations in state-of-the-art trackers by providing more detailed insights on the behavior of these trackers. We show that overall the trackers DAM4SAM and SAMURAI perform well on more challenging sequences.
zh

[CV-28] Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

【速读】:该论文旨在解决计算机视觉领域中手-物体交互(hand-object interaction)场景下人类意图的可靠预测问题,聚焦于原子级交互状态的细粒度分类,即“接近”(approaching)、“抓取”(grabbing)和“持有”(holding)。其解决方案的关键在于通过结构化的数据工程流程将原始视频转化为统计-运动学特征向量,并创新性地将双向循环神经网络(Bidirectional RNN)的序列长度设为1(seq_length=1),从而将其转变为高容量静态特征编码器,显著提升了模型性能,最终在分类准确率上达到97.60%,尤其在最具挑战性的“抓取”类上实现了平衡F1分数0.90。这一发现揭示了轻量化架构与结构化特征结合在低层交互识别中的有效性。

链接: https://arxiv.org/abs/2512.09626
作者: Yousef Azizi Movahed,Fatemeh Ziaeetabar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at: this https URL

点击查看摘要

Abstract:Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely ‘approaching’, ‘grabbing’, and ‘holding’. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network’s function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, ‘grabbing’, by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
zh

[CV-29] FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation

【速读】:该论文旨在解决多视角扩散模型(multiview diffusion models)在外观操控能力上的局限性问题,尤其是在材质(material)、纹理(texture)或风格(style)等视觉属性方面的控制不足。现有方法难以在不显式建模几何与外观的前提下实现灵活的外观迁移。其解决方案的关键在于提出一种轻量级的适配技术,通过融合输入图像中的对象身份信息与参考图像中渲染的外观线索,在生成过程中保持多视角一致性的同时实现显式的外观参数指定。具体而言,该方法利用三个扩散去噪过程(分别对应原图、参考图和目标图),并通过反向采样从对象和参考图像中聚合少量层级自注意力特征,以影响目标图像的生成,从而仅需少量训练样本即可为预训练的多视角模型引入外观感知能力。

链接: https://arxiv.org/abs/2512.09617
作者: Hubert Kompanowski,Varun Jampani,Aaryaman Vasishta,Binh-Son Hua
机构: Trinity College Dublin (都柏林三一学院); Arcade AI; AMD; Trinity College Dublin (都柏林三一学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style. In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.09617 [cs.CV] (or arXiv:2512.09617v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.09617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-30] ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation

【速读】:该论文旨在解决肌萎缩侧索硬化症(Amyotrophic Lateral Sclerosis, ALS)患者在使用辅助与替代沟通(Augmentative and Alternative Communication, AAC)系统时面临的两大核心问题:传统基于符号的AAC系统词汇量有限,而文本输入方案则存在通信速率低的问题。为应对这些挑战,研究团队设计并开发了一种新型多模态文本生成系统——ImageTalk,其关键创新在于通过图像驱动的文本生成机制实现显著的按键节省(达95.6%),同时保持稳定性能和高用户满意度。该方案结合代理用户和终用户参与的设计阶段,提炼出三套面向AI辅助文本生成系统的通用设计指南,并提出四类针对AAC场景的用户需求层级,为未来相关研究提供明确方向。

链接: https://arxiv.org/abs/2512.09610
作者: Boyin Yang,Puming Jiang,Per Ola Kristensson
机构: University of Cambridge (剑桥大学); Imperial College London (帝国理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
zh

[CV-31] UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories AAAI2026

【速读】:该论文旨在解决 embodied agents 在复杂城市环境中基于自然语言指令进行导航时面临的挑战,包括语言指令噪声、空间指代模糊性、地标多样性以及动态街景等问题。现有视觉导航方法通常局限于模拟或非街道环境,且依赖于精确的目标格式(如坐标或图像),难以应用于如末端配送机器人等在陌生城市中自主导航的场景。解决方案的关键在于提出 UrbanNav 框架,其核心是利用大规模网络城市步行视频数据,构建可扩展的标注流程,将人类导航轨迹与基于真实地标 grounded 的自由形式语言指令对齐,从而训练出具备鲁棒空间推理能力、抗噪声指令干扰并能泛化至未见城市环境的导航策略。该框架包含超过 1,500 小时导航数据和 300 万个指令-轨迹-地标三元组,显著提升了模型在真实世界城市场景下的语言引导导航性能。

链接: https://arxiv.org/abs/2512.09607
作者: Yanghong Mei,Yirong Yang,Longteng Guo,Qunbo Wang,Ming-Ming Yu,Xingjian He,Wenjun Wu,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, accepted to AAAI 2026

点击查看摘要

Abstract:Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
zh

[CV-32] CS3D: An Efficient Facial Expression Recognition via Event Vision

【速读】:该论文旨在解决基于事件相机(event camera)的面部表情识别中,传统深度学习模型因计算复杂度高、能耗大而难以在边缘计算设备上部署的问题。其关键解决方案是提出CS3D框架,通过分解卷积3D(Convolutional 3D, C3D)结构以降低计算复杂度和能耗,并引入软脉冲神经元(soft spiking neurons)与时空注意力机制(spatial-temporal attention mechanism),从而增强信息保留能力并提升表情识别准确率。实验表明,CS3D在多个数据集上的准确率优于RNN、Transformer和C3D等架构,且能耗仅为原C3D的21.97%。

链接: https://arxiv.org/abs/2512.09592
作者: Zhe Wang,Qijin Song,Yucen Peng,Weibang Bai
机构: ShanghaiTech University (上海科技大学); Shanghai Pujiang Program (上海市浦江人才计划); ShangHAI (上海人工智能前沿科学中心); MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (KLIP-HuMaCo) (教育部智能感知与人机协同重点实验室); STAR Center (上海科技大学自动化与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97% of the original C3D required on the same device.
zh

[CV-33] UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

【速读】:该论文旨在解决图像中高光(specular highlights)导致的外观失真、纹理遮蔽以及几何推理受阻的问题,尤其在自然场景与外科影像中表现显著。其解决方案的关键在于提出一个仅依赖RGB输入的框架UnReflectAnything,通过联合预测高光图与无反射的漫反射重建来实现去高光效果;模型采用冻结的视觉Transformer编码器提取多尺度特征,结合轻量级头部定位高光区域,并引入基于token级别的修复模块恢复受损特征块,最终生成高质量的漫反射图像。为克服缺乏成对标注数据的难题,作者设计了虚拟高光合成(Virtual Highlight Synthesis)管道,利用单目几何、Fresnel感知着色和随机光照渲染出物理合理的高光,从而在任意RGB图像上训练模型并保持正确的几何结构。

链接: https://arxiv.org/abs/2512.09583
作者: Alberto Rota,Mert Kiray,Mert Asim Karaoglu,Patrick Ruhkamp,Elena De Momi,Nassir Navabm,Benjamin Busam
机构: Politecnico di Milano (米兰理工大学); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); ImFusion (ImFusion)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: this https URL
zh

[CV-34] Content-Adaptive Image Retouching Guided by Attribute-Based Text Representation

【速读】:该论文旨在解决现有图像修饰方法中因采用全局统一的像素级颜色映射而导致无法适应图像内容内在色彩差异的问题,从而限制了对多样化色彩分布和用户自定义风格偏好的适配能力。其解决方案的关键在于提出了一种基于属性文本表示的内容自适应图像修饰方法(Content-Adaptive image retouching method guided by Attribute-based Text Representation, CA-ATP),其中包含两个核心模块:一是内容自适应曲线映射模块(content-adaptive curve mapping module),通过一组基础曲线与权重图学习多样的颜色映射关系,使相同颜色值在不同空间上下文中获得差异化调整;二是属性文本预测模块(attribute text prediction module),从图像多个属性生成文本表示,结合视觉特征通过多模态模型实现用户友好型风格引导,从而实现更精细、个性化的图像修饰效果。

链接: https://arxiv.org/abs/2512.09580
作者: Hancheng Zhu,Xinyu Liu,Rui Yao,Kunyang Sun,Leida Li,Abdulmotaleb El Saddik
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿山数字化工程研究中心); Xidian University (西安电子科技大学); University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
zh

[CV-35] Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在计算机视觉任务中对图像全局上下文理解能力不足的问题,其解决方案的关键在于引入并比较不同类型的视觉Transformer(Vision Transformers, ViTs),包括纯Transformer、分层Transformer和混合Transformer模型。研究发现,特别是Swin Transformer和CvT等分层与混合架构,在保持较高准确率的同时显著优化了计算资源消耗,且在需要全局视觉语义理解的任务(如医学影像分类)中表现优于传统CNN模型,表明基于自注意力机制的ViTs能更有效地建模图像中的长距离依赖关系。

链接: https://arxiv.org/abs/2512.09579
作者: Dimitrios N. Vlachogiannis,Dimitrios A. Koutsomitropoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
zh

[CV-36] Seeing Soil from Space: Towards Robust and Scalable Remote Soil Nutrient Analysis

【速读】:该论文旨在解决农业决策中环境变量影响日益增强背景下,土壤属性(如土壤有机碳SOC、总氮N、有效磷P、交换性钾K和pH)评估工具在可及性和可扩展性方面的局限性。其解决方案的关键在于构建一个鲁棒且可扩展的建模系统,融合间接模型(通过土壤代理因子与驱动因子建模)与直接光谱建模的混合方法,并引入两类关键特征:一是基于辐射传输模型(Radiative Transfer Models, RTMs)推导出的可解释物理信息协变量,二是来自基础模型(Foundation Model)的复杂非线性嵌入表示。该系统在欧洲多样化土壤气候区的统一数据集上验证,采用严格的时空隔离验证策略(空间阻断、分层划分和统计独立训练测试集),显著提升了模型在未见区域的泛化能力,尤其对SOC和N的预测精度最高(MAE分别为5.12 g/kg和0.44 g/kg,CCC均为0.77),并通过保形校准实现90%置信水平下的不确定性量化。

链接: https://arxiv.org/abs/2512.09576
作者: David Seu(1),Nicolas Longepe(2),Gabriel Cioltea(1),Erik Maidik(1),Calin Andrei(1) ((1) CO2 Angels, Cluj-Napoca, Romania, (2) European Space Agency Phi-Lab, Frascati, Italy)
机构: CO2 Angels; European Space Agency Φ\Phi-Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注: 23 pages, 13 figures, 13 tables

点击查看摘要

Abstract:Environmental variables are increasingly affecting agricultural decision-making, yet accessible and scalable tools for soil assessment remain limited. This study presents a robust and scalable modeling system for estimating soil properties in croplands, including soil organic carbon (SOC), total nitrogen (N), available phosphorus §, exchangeable potassium (K), and pH, using remote sensing data and environmental covariates. The system employs a hybrid modeling approach, combining the indirect methods of modeling soil through proxies and drivers with direct spectral modeling. We extend current approaches by using interpretable physics-informed covariates derived from radiative transfer models (RTMs) and complex, nonlinear embeddings from a foundation model. We validate the system on a harmonized dataset that covers Europes cropland soils across diverse pedoclimatic zones. Evaluation is conducted under a robust validation framework that enforces strict spatial blocking, stratified splits, and statistically distinct train-test sets, which deliberately make the evaluation harder and produce more realistic error estimates for unseen regions. The models achieved their highest accuracy for SOC and N. This performance held across unseen locations, under both spatial cross-validation and an independent test set. SOC obtained a MAE of 5.12 g/kg and a CCC of 0.77, and N obtained a MAE of 0.44 g/kg and a CCC of 0.77. We also assess uncertainty through conformal calibration, achieving 90 percent coverage at the target confidence level. This study contributes to the digital advancement of agriculture through the application of scalable, data-driven soil analysis frameworks that can be extended to related domains requiring quantitative soil evaluation, such as carbon markets.
zh

[CV-37] Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

【速读】:该论文试图解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在图像质量评估(Image Quality Assessment, IQA)任务中对低级失真(如模糊、噪声和压缩失真)感知能力不足的问题,以及由此导致的评价不一致性和推理不可靠性。解决方案的关键在于引入一个专门针对低级失真类型的分类任务,并通过组件级微调(component-wise fine-tuning)增强视觉编码器与语义标记之间的对齐程度,从而显著提升模型对关键视觉特征的识别准确率——从14.92%提升至84.43%,进而使生成式AI(Generative AI)驱动的视觉任务具备更一致、可解释的推理能力。

链接: https://arxiv.org/abs/2512.09573
作者: Yuan Li,Zitang Sun,Yen-Ju Chen,Shin’ya Nishida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
zh

[CV-38] From Graphs to Gates: DNS-HyXNet A Lightweight and Deployable Sequential Model for Real-Time DNS Tunnel Detection

【速读】:该论文旨在解决DNS隧道检测中因传统图模型(如GraphTunnel)引入高延迟与计算开销而导致难以实现实时部署的问题。其解决方案的关键在于提出一种轻量级的xLSTM混合框架DNS-HyXNet,通过将分词后的域名嵌入与归一化的数值型DNS特征融合,并利用两层xLSTM网络直接从数据包序列中学习时间依赖关系,从而无需进行递归解析和图结构重建,实现单阶段多分类检测,显著降低推理延迟至0.041 ms/样本,同时保持高达99.99%的准确率,具备在通用硬件上实时部署的能力。

链接: https://arxiv.org/abs/2512.09565
作者: Faraz Ali,Muhammad Afaq,Mahmood Niazi,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (KFUPM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain Name System (DNS) tunneling remains a covert channel for data exfiltration and command-and-control communication. Although graph-based methods such as GraphTunnel achieve strong accuracy, they introduce significant latency and computational overhead due to recursive parsing and graph construction, limiting their suitability for real-time deployment. This work presents DNS-HyXNet, a lightweight extended Long Short-Term Memory (xLSTM) hybrid framework designed for efficient sequence-based DNS tunnel detection. DNS-HyXNet integrates tokenized domain embeddings with normalized numerical DNS features and processes them through a two-layer xLSTM network that directly learns temporal dependencies from packet sequences, eliminating the need for graph reconstruction and enabling single-stage multi-class classification. The model was trained and evaluated on two public benchmark datasets with carefully tuned hyperparameters to ensure low memory consumption and fast inference. Across all experimental splits of the DNS-Tunnel-Datasets, DNS-HyXNet achieved up to 99.99% accuracy, with macro-averaged precision, recall, and F1-scores exceeding 99.96%, and demonstrated a per-sample detection latency of just 0.041 ms, confirming its scalability and real-time readiness. These results show that sequential modeling with xLSTM can effectively replace computationally expensive recursive graph generation, offering a deployable and energy-efficient alternative for real-time DNS tunnel detection on commodity hardware.
zh

[CV-39] Building Reason able Inference for Vision-Language Models in Blind Image Quality Assessment ICONIP

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在盲图像质量评估(Blind Image Quality Assessment, BIQA)中出现的矛盾判断与预测不稳定问题,即模型生成的文本描述常与其最终的质量评分不一致,且推理过程中得分波动较大,不符合人类认知逻辑。解决方案的关键在于提出一种两阶段微调方法:第一阶段使模型专注于从图像中学习稳定的视觉特征表示,第二阶段仅基于这些特征进行质量推理,从而显式分离感知与判断过程。该设计增强了模型输出的稳定性与可解释性,在SPAQ和KONIQ数据集上将预测不稳定率从22.00%降至12.39%,并显著提升SRCC和PLCC指标,验证了其在提升推理可靠性方面的有效性。

链接: https://arxiv.org/abs/2512.09555
作者: Yuan Li,Zitang Sun,Yen-ju Chen,Shin’ya Nishida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICONIP (International Conference on Neural Information Processing), 2025

点击查看摘要

Abstract:Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
zh

[CV-40] A Dual-Domain Convolutional Network for Hyperspectral Single-Image Super-Resolution

【速读】:该论文旨在解决高光谱图像超分辨率(Hyperspectral Image Super-Resolution, HISR)任务中计算复杂度高与重建质量难以兼顾的问题。其解决方案的关键在于提出了一种轻量级的双域超分辨率网络(DDSRNet),通过融合空间域与频率域学习:一方面利用Spatial-Net进行浅层特征提取和残差学习,另一方面引入离散小波变换(Discrete Wavelet Transform, DWT)对低频结构进行增强,并共享一个CNN分支同时优化高频子带(LH、HL、HH)的细节信息。这种设计使模型在保持较低计算成本的同时,显著提升了高光谱图像的重建性能。

链接: https://arxiv.org/abs/2512.09546
作者: Murat Karayaka,Usman Muhammad,Jorma Laaksonen,Md Ziaul Hoque,Tapio Seppänen
机构: Center for Machine Vision and Signal Analysis, University of Oulu (奥卢大学机器视觉与信号分析中心); Department of Computer Science, Aalto University (阿尔托大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study presents a lightweight dual-domain super-resolution network (DDSRNet) that combines Spatial-Net with the discrete wavelet transform (DWT). Specifically, our proposed model comprises three main components: (1) a shallow feature extraction module, termed Spatial-Net, which performs residual learning and bilinear interpolation; (2) a low-frequency enhancement branch based on the DWT that refines coarse image structures; and (3) a shared high-frequency refinement branch that simultaneously enhances the LH (horizontal), HL (vertical), and HH (diagonal) wavelet subbands using a single CNN with shared weights. As a result, the DWT enables subband decomposition, while the inverse DWT reconstructs the final high-resolution output. By doing so, the integration of spatial- and frequency-domain learning enables DDSRNet to achieve highly competitive performance with low computational cost on three hyperspectral image datasets, demonstrating its effectiveness for hyperspectral image super-resolution.
zh

[CV-41] Masked Registration and Autoencoding of CT Images for Predictive Tibia Reconstruction MICCAI

【速读】:该论文旨在解决复杂胫骨骨折手术规划中难以准确预测理想骨骼三维对齐结构的问题(即患者特异性重建目标的预测)。其核心挑战在于如何从受损的CT图像中推断出个体化的健康骨骼形态。解决方案的关键在于融合神经配准(neural registration)与自编码器(autoencoder, AE)模型:首先利用改进的三维空间变换网络(spatial transformer network, STN)将原始CT图像配准到联合训练的标准化胫骨原型坐标系;随后通过多种AE架构学习健康胫骨的变异模式,并进一步设计这些模型以适应掩码输入,从而在骨折CT基础上解码生成标准化坐标下的患者特异性健康骨骼结构。这一方法实现了从损伤状态到健康状态的可解释性预测,为个性化手术规划提供支持。

链接: https://arxiv.org/abs/2512.09525
作者: Hongyou Zhou,Cederic Aßmann,Alaa Bejaoui,Heiko Tzschätzsch,Mark Heyland,Julian Zierke,Niklas Tuttle,Sebastian Hölzl,Timo Auer,David A. Back,Marc Toussaint
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: DGM4MICCAI

点击查看摘要

Abstract:Surgical planning for complex tibial fractures can be challenging for surgeons, as the 3D structure of the later desirable bone alignment may be diffi- cult to imagine. To assist in such planning, we address the challenge of predicting a patient-specific reconstruction target from a CT of the fractured tibia. Our ap- proach combines neural registration and autoencoder models. Specifically, we first train a modified spatial transformer network (STN) to register a raw CT to a standardized coordinate system of a jointly trained tibia prototype. Subsequently, various autoencoder (AE) architectures are trained to model healthy tibial varia- tions. Both the STN and AE models are further designed to be robust to masked input, allowing us to apply them to fractured CTs and decode to a prediction of the patient-specific healthy bone in standard coordinates. Our contributions include: i) a 3D-adapted STN for global spatial registration, ii) a comparative analysis of AEs for bone CT modeling, and iii) the extension of both to handle masked inputs for predictive generation of healthy bone structures. Project page: this https URL
zh

[CV-42] ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics

【速读】:该论文旨在解决机器人抓取(robotic bin picking)中因遮挡(occlusion)导致的准确和可靠抓取规划问题。其解决方案的关键在于提出ViTA-Seg框架,这是一个类无关(class-agnostic)的视觉Transformer(Vision Transformer)方法,利用全局注意力机制实现完整的对象掩码恢复,包括被遮挡区域的非可见部分(即模态外分割,amodal segmentation)。该方法通过双头架构(Dual-Head)同时预测模态外掩码与遮挡掩码,在保证高精度的同时具备计算效率,从而支持实时、鲁棒的机器人操作。

链接: https://arxiv.org/abs/2512.09510
作者: Donato Caramia,Florian T. Pokorny,Giuseppe Triggiani,Denis Ruffino,David Naso,Paolo Roberto Massenio
机构: Polytechnic University of Bari (巴里理工大学); KTH Royal Institute of Technology (皇家理工学院); AROL S.p.A.
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
zh

[CV-43] Gradient-Guided Learning Network for Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测中因目标尺寸小、缺乏内在特征而导致的边缘定位不准确及易被背景淹没的问题。解决方案的关键在于提出一种梯度引导学习网络(GGL-Net),其核心创新包括:首次将梯度幅值图像引入基于深度学习的红外小目标检测方法,以增强边缘细节并改善定位精度;设计了一种双分支特征提取网络,结合梯度补充模块(GSM)将原始梯度信息编码至深层网络,并嵌入注意力机制提升特征表达能力;此外,构建了双向引导融合模块(TGFM),通过合理利用不同层级特征图的特性,实现多尺度特征的有效融合,从而提取更丰富的语义与细节信息。

链接: https://arxiv.org/abs/2512.09497
作者: Jinmiao Zhao,Chuang Yu,Zelin Shi,Yunpeng Liu,Yingdi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by GRSL 2023

点击查看摘要

Abstract:Recently, infrared small target detection has attracted extensive attention. However, due to the small size and the lack of intrinsic features of infrared small targets, the existing methods generally have the problem of inaccurate edge positioning and the target is easily submerged by the background. Therefore, we propose an innovative gradient-guided learning network (GGL-Net). Specifically, we are the first to explore the introduction of gradient magnitude images into the deep learning-based infrared small target detection method, which is conducive to emphasizing the edge details and alleviating the problem of inaccurate edge positioning of small targets. On this basis, we propose a novel dual-branch feature extraction network that utilizes the proposed gradient supplementary module (GSM) to encode raw gradient information into deeper network layers and embeds attention mechanisms reasonably to enhance feature extraction ability. In addition, we construct a two-way guidance fusion module (TGFM), which fully considers the characteristics of feature maps at different levels. It can facilitate the effective fusion of multi-scale feature maps and extract richer semantic information and detailed information through reasonable two-way guidance. Extensive experiments prove that GGL-Net has achieves state-of-the-art results on the public real NUAA-SIRST dataset and the public synthetic NUDT-SIRST dataset. Our code has been integrated into this https URL
zh

[CV-44] StateSpace-SSL: Linear-Time Self-supervised Learning for Plant Disease Detectio AAAI

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在植物病害检测中面临的两个核心问题:一是基于卷积神经网络(CNN)的SSL方法难以捕捉沿叶片结构连续演化的病斑模式;二是基于视觉Transformer(Vision Transformer)的SSL方法因高分辨率图像块带来的二次方注意力计算复杂度而效率低下。解决方案的关键在于提出StateSpace-SSL框架,其核心创新是采用视觉Mamba状态空间编码器(Vision Mamba state-space encoder),通过方向性扫描建模叶片表面长程病变连续性,实现线性时间复杂度的特征提取;同时引入原型驱动的师生目标(prototype-driven teacher-student objective),利用标注数据对多视角表示进行对齐,从而学习稳定且病灶感知的特征表示。实验表明,该方法在多个公开植物病害数据集上均优于CNN和Transformer基线模型。

链接: https://arxiv.org/abs/2512.09492
作者: Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI workshop (AgriAI 2026)

点击查看摘要

Abstract:Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.
zh

[CV-45] MODA: The First Challenging Benchmark for Multispectral Object Detection in Aerial Images

【速读】:该论文旨在解决航空场景下目标检测中因小目标和背景干扰导致的性能瓶颈问题,尤其是在仅依赖RGB图像时缺乏足够的判别信息。针对这一挑战,作者提出了一种基于多光谱图像(Multispectral Images, MSIs)的解决方案,其核心在于构建了首个大规模航空多光谱目标检测数据集MODA(包含14,041张图像和330,191个标注),并设计了OSSDet框架以融合光谱与空间信息。OSSDet的关键创新包括:采用级联的光谱-空间调制结构优化目标感知、利用光谱相似性聚合相关特征增强对象内部关联、通过对象感知掩码抑制无关背景,并引入跨光谱注意力机制在显式对象引导下进一步精炼目标表示。

链接: https://arxiv.org/abs/2512.09489
作者: Shuaihao Han,Tingfa Xu,Peifu Liu,Jianan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Aerial object detection faces significant challenges in real-world scenarios, such as small objects and extensive background interference, which limit the performance of RGB-based detectors with insufficient discriminative information. Multispectral images (MSIs) capture additional spectral cues across multiple bands, offering a promising alternative. However, the lack of training data has been the primary bottleneck to exploiting the potential of MSIs. To address this gap, we introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA), which comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios, providing a comprehensive data foundation for this field. Furthermore, to overcome challenges inherent to aerial object detection using MSIs, we propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues. OSSDet employs a cascaded spectral-spatial modulation structure to optimize target perception, aggregates spectrally related features by exploiting spectral similarities to reinforce intra-object correlations, and suppresses irrelevant background via object-aware masking. Moreover, cross-spectral attention further refines object-related representations under explicit object-aware guidance. Extensive experiments demonstrate that OSSDet outperforms existing methods with comparable parameters and efficiency.
zh

[CV-46] Color encoding in Latent Space of Stable Diffusion Models

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中特定感知属性(如颜色和形状)在潜在空间中的编码机制不明确的问题。其解决方案的关键在于通过构建受控的合成数据集、主成分分析(PCA)以及相似性度量方法,系统地分析了 Stable Diffusion 模型的潜在表示结构,发现颜色信息主要沿圆形对立轴编码于潜变量通道 c₃ 和 c₄ 中,而亮度与形状则主要由通道 c₁ 和 c₂ 表征,揭示了该模型潜在空间具有符合高效编码原理的可解释结构。

链接: https://arxiv.org/abs/2512.09477
作者: Guillem Arias,Ariadna Solà,Martí Armengod,Maria Vanrell
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures, Color Imaging Conference 33

点击查看摘要

Abstract:Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are internally represented remains limited. This work explores how color is encoded in a generative model through a systematic analysis of the latent representations in Stable Diffusion. Through controlled synthetic datasets, principal component analysis (PCA) and similarity metrics, we reveal that color information is encoded along circular, opponent axes predominantly captured in latent channels c_3 and c_4, whereas intensity and shape are primarily represented in channels c_1 and c_2. Our findings indicate that the latent space of Stable Diffusion exhibits an interpretable structure aligned with a efficient coding representation. These insights provide a foundation for future work in model understanding, editing applications, and the design of more disentangled generative frameworks.
zh

[CV-47] mporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

【速读】:该论文旨在解决多光谱影像(Multispectral Imagery, MSI)中云覆盖导致的早期作物制图困难问题,其核心挑战在于云层遮挡会破坏光谱信息,且现有基于视觉Transformer(Vision Transformer, ViT)的时间序列重建方法因采用粗粒度的时间嵌入机制(如聚合整个序列),造成显著的信息丢失,进而降低重建精度。解决方案的关键在于提出一种基于视频视觉Transformer(Video Vision Transformer, ViViT)的框架,引入时空融合嵌入策略:通过3D卷积提取非重叠的时空管状体(tubelets),并限定时间跨度(t=2),在保持局部时间一致性的同时减少跨日信息退化;同时支持仅使用MSI和融合合成孔径雷达(Synthetic Aperture Radar, SAR)两种场景,实验表明该方法在Traill County 2020年数据上相比基线模型显著提升了重建质量(MSE降低2.23%至10.33%)。

链接: https://arxiv.org/abs/2512.09471
作者: Yiqun Wang,Lujun Li,Meiru Yue,Radu State
机构: University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span (t=2) , ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
zh

[CV-48] Privacy-Preserving Computer Vision for Industry: Three Case Studies in Human-Centric Manufacturing AAAI26

【速读】:该论文旨在解决工业场景中人工智能驱动的计算机视觉(AI-powered computer vision)应用所面临的隐私与实用性之间的权衡问题。解决方案的关键在于提出并验证了一种隐私保护框架,该框架通过学习特定任务的视觉变换(learned visual transformations),在保留对任务执行至关重要的特征的同时,模糊敏感或与任务无关的信息,从而实现任务性能与隐私保护之间的有效平衡。实证结果表明,这种基于任务特异性的信息遮蔽方法能够在降低隐私风险的前提下实现有效的工业监控,为负责任的人类中心型AI在工业环境中的部署提供了可落地的技术路径和跨领域实践建议。

链接: https://arxiv.org/abs/2512.09463
作者: Sander De Coninck,Emilio Gamba,Bart Van Doninck,Abdellatif Bey-Temsamani,Sam Leroux,Pieter Simoens
机构: 1. KU Leuven (鲁汶大学); 2. imec (比利时微电子研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI26 HCM workshop

点击查看摘要

Abstract:The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework’s effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework’s readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.
zh

[CV-49] Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework

【速读】:该论文旨在解决体外受精(In-vitro Fertilization, IVF)过程中胚胎筛选效率与准确性不足的问题,尤其是针对胞质丝(Cytoplasmic Strings, CS)这一新兴生物标志物缺乏自动化检测方法的瓶颈。CS在扩张囊胚中连接内细胞团与滋养层,其存在与更高胚胎质量及发育潜力相关,但当前依赖人工视觉判读,存在主观性强、效率低且易受细微形态变化影响等局限。解决方案的关键在于构建首个面向人类IVF胚胎的胞质丝计算分析框架:首先设计人机协同标注流程,建立包含13,568帧、稀疏正样本的生物学验证数据集;进而提出两阶段深度学习模型——第一阶段基于Transformer架构进行帧级CS存在性分类,第二阶段采用RF-DETR实现细长、低对比度结构的精准定位;创新性引入不确定性感知收缩嵌入(Uncertainty-aware Contractive Embedding, NUCE)损失函数,通过置信度加权与嵌入空间收缩机制缓解样本严重不平衡和特征模糊问题,显著提升F1分数并达到该任务上的最先进(State-of-the-art, SOTA)检测性能。

链接: https://arxiv.org/abs/2512.09461
作者: Anabia Sohail,Mohamad Alansari,Ahmed Abughali,Asmaa Chehab,Abdelfatah Ahmed,Divya Velayudhan,Sajid Javed,Hasan Al Marzouqi,Ameena Saad Al-Sumaiti,Junaid Kashir,Naoufel Werghi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: this https URL.
zh

[CV-50] Sequential Testing for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

【速读】:该论文旨在解决激光雷达(LiDAR)回环检测中因结构重复导致的误检问题,尤其是在室内环境中由伪回环(aliased loops)引发的位姿图优化误差。传统方法通常依赖单帧描述符匹配或固定阈值结合后期迭代最近点(ICP)验证,难以有效区分真实回环与伪回环。其解决方案的关键在于提出一种无描述符依赖的多帧回环验证机制,将回环判定建模为截断的序贯概率比检验(truncated Sequential Probability Ratio Test, SPRT),通过累积查询帧与候选帧之间短时序的描述符相似性序列,在满足用户设定的第一类(Type-I)和第二类(Type-II)错误率目标后自适应地做出接受或拒绝决策,从而实现高精度优先的验证策略,显著提升回环验证的鲁棒性与准确性。

链接: https://arxiv.org/abs/2512.09447
作者: Jaehyun Kim,Seungwon Choi,Tae-Wan Kim
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:We propose a descriptor-agnostic, multi-frame loop closure verification method that formulates LiDAR loop closure as a truncated Sequential Probability Ratio Test (SPRT). Instead of deciding from a single descriptor comparison or using fixed thresholds with late-stage Iterative Closest Point (ICP) vetting, the verifier accumulates a short temporal stream of descriptor similarities between a query and each candidate. It then issues an accept/reject decision adaptively once sufficient multi-frame evidence has been observed, according to user-specified Type-I/II error design targets. This precision-first policy is designed to suppress false positives in structurally repetitive indoor environments. We evaluate the verifier on a five-sequence library dataset, using a fixed retrieval front-end with several representative LiDAR global descriptors. Performance is assessed via segment-level K-hit precision-recall and absolute trajectory error (ATE) and relative pose error (RPE) after pose graph optimization. Across descriptors, the sequential verifier consistently improves precision and reduces the impact of aliased loops compared with single-frame and heuristic multi-frame baselines. Our implementation and dataset will be released at: this https URL.
zh

[CV-51] Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在异常检测中忽视细粒度异常类型(如“hole”、“cut”、“scratch”)的问题,从而导致对异常本质理解不足、难以支持制造场景下针对性的根因分析与矫正措施制定。其解决方案的关键在于提出一种缺陷感知提示优化方法(Defect-aware Prompt Optimization, DAPO),通过渐进式微调策略,在零样本多类型和二分类异常检测及分割任务中,将图像特征与文本语义对齐:具体而言,DAPO学习融合固定文本锚点与可学习token嵌入的混合提示表示,实现对不同缺陷类型的结构化语义建模,显著提升模型在分布偏移下的整体性能(图像级AUROC和平均精度提升3.7%)以及对新型异常类型的定位能力(提升6.5%)。

链接: https://arxiv.org/abs/2512.09446
作者: Nadeem Nazer,Hongkuan Zhou,Lavdim Halilaj,Ylli Sadikaj,Steffen Staab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like “hole”, “cut”, “scratch” that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of “abnormal” with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
zh

[CV-52] Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model

【速读】:该论文旨在解决基于视觉-语言模型(Vision-Language Models, VLMs)的类增量学习(Class-Incremental Learning, CIL)中存在跨任务类别混淆的问题,即模型在学习新类别时难以区分与旧类别之间的特征差异,导致知识遗忘和性能下降。解决方案的关键在于:(1) 在预训练且冻结的图像编码器中引入任务特定适配器(task-specific adapters),以参数高效地学习新类别的知识;(2) 设计一种基于轻量级投影器混合的跨任务表征校准策略,用于在统一特征空间中更好地分离所有已学类别,缓解类别混淆;(3) 提出一种基于预测不确定性的推理策略,动态选择最合适的图像特征进行分类,提升预测准确性。

链接: https://arxiv.org/abs/2512.09441
作者: Jiantao Tan,Peixian Ma,Tong Yu,Wentao Zhang,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); Guangdong Province Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (广东省机器智能与先进计算重点实验室,教育部); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州) ); Peng Cheng Laboratory (鹏城实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (机器智能与先进计算重点实验室,教育部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
zh

[CV-53] UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents

【速读】:该论文旨在解决现有3D生成方法在部件级结构控制上的不足,即要么依赖隐式分割导致粒度控制有限,要么需要强外部分割器并依赖大规模标注数据。其解决方案的关键在于提出一种统一的几何-分割潜在表示(Geom-Seg VecSet),该表示能同时编码物体几何形状与部件级结构信息;在此基础上构建两阶段潜在扩散框架UniPart,第一阶段联合生成几何并进行潜在部件分割,第二阶段基于整体与部件特异性潜在变量对部件级扩散进行条件控制,并引入双空间生成机制以提升几何保真度。

链接: https://arxiv.org/abs/2512.09435
作者: Xufan He,Yushuang Wu,Xiaoyang Guo,Chongjie Ye,Jiaqing Zhou,Tianlei Hu,Xiaoguang Han,Dong Du
机构: Nanjing University of Science and Technology (南京理工大学); ByteDance (字节跳动); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
zh

[CV-54] FunPhase: A Periodic Functional Autoencoder for Motion Generation via Phase Manifolds

【速读】:该论文旨在解决自然人体运动建模中空间几何与时间动态强耦合带来的挑战,尤其针对现有基于相位流形(phase manifold)的运动预测方法在可扩展性上的局限性。其解决方案的关键在于提出 FunPhase——一种功能周期自动编码器(functional periodic autoencoder),通过将运动嵌入到能够捕捉局部周期性的潜在流形中,并以函数空间形式替代离散的时间解码过程,从而实现任意时间分辨率下的平滑轨迹采样。此设计不仅显著降低了重建误差,还支持超分辨率、部分肢体运动补全等下游任务,且具备跨骨骼结构和数据集的泛化能力,统一了运动预测与生成任务于同一可解释的流形框架内。

链接: https://arxiv.org/abs/2512.09423
作者: Marco Pegoraro,Evan Atherton,Bruno Roy,Aliasghar Khani,Arianna Rampini
机构: Autodesk Research (Autodesk 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning natural body motion remains challenging due to the strong coupling between spatial geometry and temporal dynamics. Embedding motion in phase manifolds, latent spaces that capture local periodicity, has proven effective for motion prediction; however, existing approaches lack scalability and remain confined to specific settings. We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a function-space formulation, enabling smooth trajectories that can be sampled at arbitrary temporal resolutions. FunPhase supports downstream tasks such as super-resolution and partial-body motion completion, generalizes across skeletons and datasets, and unifies motion prediction and generation within a single interpretable manifold. Our model achieves substantially lower reconstruction error than prior periodic autoencoder baselines while enabling a broader range of applications and performing on par with state-of-the-art motion generation methods.
zh

[CV-55] InfoMotion: A Graph-Based Approach to Video Dataset Distillation for Echocardiography

【速读】:该论文旨在解决超声心动图(Echocardiography)视频数据规模日益增长所带来的存储、计算及模型训练效率低下等问题。其解决方案的关键在于提出一种新颖的数据集蒸馏(Dataset Distillation)方法,通过提取视频中的运动特征以捕捉时间动态性,继而基于类别构建图结构并利用Infomap算法选择代表性样本,从而生成一个紧凑且信息丰富的合成视频子集,有效保留原始数据的关键临床特征。实验在EchoNet-Dynamic数据集上验证了该方法的有效性,仅用25个合成视频即达到69.38%的测试准确率,表明该方法在医学视频数据集蒸馏中具有良好的效果与可扩展性。

链接: https://arxiv.org/abs/2512.09422
作者: Zhe Li,Hadrien Reynaud,Alberto Gomez,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICAD 2025

点击查看摘要

Abstract:Echocardiography playing a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of (69.38%) using only (25) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.
zh

[CV-56] Label-free Motion-Conditioned Diffusion Model for Cardiac Ultrasound Synthesis

【速读】:该论文旨在解决超声心动图(echocardiography)视频生成中因标注数据稀缺而制约深度学习方法发展的难题,其核心挑战源于隐私限制和专家标注的复杂性。解决方案的关键在于提出一种无标签的潜在扩散框架——运动条件扩散模型(Motion Conditioned Diffusion Model, MCDM),该模型通过自监督提取的运动特征对生成过程进行条件控制,从而合成具有时间一致性和临床真实性的超声心动图视频序列。其创新性体现在设计了运动与外观特征提取器(Motion and Appearance Feature Extractor, MAFE),并引入伪外观特征引导的重识别损失和伪光流场引导的光流损失,以增强特征学习效果,实现无需人工标注即可生成高质量医学影像数据的目标。

链接: https://arxiv.org/abs/2512.09418
作者: Zhe Li,Hadrien Reynaud,Johanna P Müller,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICAD 2025

点击查看摘要

Abstract:Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at this https URL.
zh

[CV-57] DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping

【速读】:该论文旨在解决视频人脸替换(Video Head Swapping)中因缺乏真实配对数据而导致的Identity Leakage(身份泄露)和运动/表情一致性差的问题。现有方法依赖于同一视频内帧间的交叉配对训练,并通过掩码图像修复(Mask-based Inpainting)缓解身份泄露,但易产生边界伪影且无法恢复被掩码遮挡的关键视觉线索(如面部姿态、表情及动态)。其解决方案的关键在于构建首个跨身份配对数据集HeadSwapBench,通过视频编辑模型生成假想的替换输入以实现真实配对监督;并提出DirectSwap框架——一种无掩码、直接的视频头替换方法,将图像U-Net扩展为含运动模块和条件输入的视频扩散模型,同时引入Motion- and Expression-Aware Reconstruction (MEAR)损失函数,利用帧间差异幅度与面部关键点邻近度重加权扩散损失,显著提升跨帧运动与表情的一致性。

链接: https://arxiv.org/abs/2512.09417
作者: Yanan Wang,Shengcai Liao,Panwen Hu,Xin Li,Fan Yang,Xiaodan Liang
机构: MBZUAI; UAEU; AIQ; SYSU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image, while preserving the target body, background, and motion dynamics. Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video and rely on mask-based inpainting to mitigate identity leakage. Beyond potential boundary artifacts, this paradigm struggles to recover essential cues occluded by the mask, such as facial pose, expressions, and motion dynamics. To address these issues, we prompt a video editing model to synthesize new heads for existing videos as fake swapping inputs, while maintaining frame-synchronized facial poses and expressions. This yields HeadSwapBench, the first cross-identity paired dataset for video head swapping, which supports both training (\TrainNum videos) and benchmarking (\TestNum videos) with genuine outputs. Leveraging this paired supervision, we propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model with a motion module and conditioning inputs. Furthermore, we introduce the Motion- and Expression-Aware Reconstruction (MEAR) loss, which reweights the diffusion loss per pixel using frame-difference magnitudes and facial-landmark proximity, thereby enhancing cross-frame coherence in motion and expressions. Extensive experiments demonstrate that DirectSwap achieves state-of-the-art visual quality, identity fidelity, and motion and expression consistency across diverse in-the-wild video scenes. We will release the source code and the HeadSwapBench dataset to facilitate future research.
zh

[CV-58] Generative Point Cloud Registration

【速读】:该论文旨在解决3D点云配准(3D Point Cloud Registration)中特征匹配鲁棒性不足的问题,尤其是在复杂场景下几何与纹理信息融合困难的挑战。其解决方案的关键在于提出了一种新的生成式点云配准范式(Generative Point Cloud Registration),通过将先进的2D生成模型与3D匹配任务相结合,利用Match-ControlNet这一专为匹配设计的可控2D生成模型,生成与源和目标点云高度对齐的跨视角一致图像对。该方法确保了2D-3D几何一致性与跨视角纹理一致性,从而促进几何-颜色特征融合,提升匹配精度与鲁棒性。

链接: https://arxiv.org/abs/2512.09407
作者: Haobo Jiang,Jin Xie,Jian Yang,Liang Yu,Jianmin Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.
zh

[CV-59] H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

【速读】:该论文旨在解决机器人从无标注的人类日常视频中学习操作技能的问题,以克服传统方法依赖大量人工标注或配对的人-机交互数据所带来的高成本与低可扩展性。其核心解决方案在于提出一种视频到视频的转换框架,通过引入可迁移的表示机制(transferable representation)弥合人与机器人在物理形态上的“具身差距”(embodiment gap):训练阶段,通过对机器人视频进行图像修复(inpainting)去除机器人手臂并叠加简单视觉提示(如标记和箭头指示夹爪位置与朝向),从而引导生成模型重建物理合理的机器人动作;测试阶段则对人类视频采用相同流程(修复人物并叠加人体姿态提示),生成动作一致且符合物理规律的机器人操作视频。该方法仅需未配对的机器人视频即可训练,并利用上下文学习(in-context learning)微调先进的视频扩散模型(Wan 2.2),确保时序一致性与丰富先验知识的利用,显著提升了生成机器人动作的真实性和物理合理性。

链接: https://arxiv.org/abs/2512.09406
作者: Hai Ci,Xiaokang Liu,Pei Yang,Yiren Song,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper’s position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human’s actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: this https URL
zh

[CV-60] Wasserstein-Aligned Hyperbolic Multi-View Clustering

【速读】:该论文旨在解决多视图聚类(Multi-view Clustering, MVC)中因视图间表示差异导致的语义不一致性问题,尤其针对现有基于双曲空间(Hyperbolic Space)的方法仅关注实例级对齐而忽视全局语义一致性的缺陷。其解决方案的关键在于提出一种 Wasserstein-Aligned Hyperbolic (WAH) 框架:首先使用视图特定的双曲编码器将各视图特征嵌入到 Lorentz 流形以实现层次化语义建模;随后引入基于双曲切片-Wasserstein 距离(sliced-Wasserstein distance in hyperbolic space)的全局语义损失,对齐不同视图在流形上的分布;最后通过软聚类分配促进跨视图语义一致性,从而有效缓解噪声和跨视图差异的影响。

链接: https://arxiv.org/abs/2512.09402
作者: Rui Wang,Yuting Jiang,Xiaoqing Luo,Xiao-Jun Wu,Nicu Sebe,Ziheng Chen
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Multi-view clustering (MVC) aims to uncover the latent structure of multi-view data by learning view-common and view-specific information. Although recent studies have explored hyperbolic representations for better tackling the representation gap between different views, they focus primarily on instance-level alignment and neglect global semantic consistency, rendering them vulnerable to view-specific information (\textite.g., noise and cross-view discrepancies). To this end, this paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering. Specifically, our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling. Whereafter, a global semantic loss based on the hyperbolic sliced-Wasserstein distance is introduced to align manifold distributions across views. This is followed by soft cluster assignments to encourage cross-view semantic consistency. Extensive experiments on multiple benchmarking datasets show that our method can achieve SOTA clustering performance.
zh

[CV-61] Detection and Localization of Subdural Hematoma Using Deep Learning on Computed Tomography

【速读】:该论文旨在解决硬膜下血肿(Subdural Hematoma, SDH)在临床实践中快速、准确识别与定位的难题,尤其针对现有自动化工具在检测准确性、可解释性及空间定位能力方面的不足。其解决方案的关键在于构建一个融合多模态信息的深度学习框架:整合结构化临床变量(如人口统计学、合并症、用药史和实验室结果)、基于CT影像体积的3D卷积神经网络(3D Convolutional Neural Network),以及增强型Transformer的2D分割模型,通过贪心集成策略融合各模块预测结果,从而实现高精度SDH检测(AUC 0.9407)并生成具有解剖学意义的定位图谱,显著提升了系统透明度与临床实用性。

链接: https://arxiv.org/abs/2512.09393
作者: Vasiliki Stoumpou,Rohan Kumar,Bernard Burman,Diego Ojeda,Tapan Mehta,Dimitris Bertsimas
机构: Massachusetts Institute of Technology (麻省理工学院); Boston University (波士顿大学); University of Connecticut School of Medicine (康涅狄格大学医学院); Hartford HealthCare (哈特福德医疗保健); Sloan School of Management, Massachusetts Institute of Technology (斯隆管理学院,麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making. Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015–2024), of which 3,774 (14.9%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors. Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95% CI, 0.930–0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns. Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2512.09393 [cs.CV] (or arXiv:2512.09393v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.09393 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vasiliki Stoumpou [view email] [v1] Wed, 10 Dec 2025 07:37:42 UTC (1,049 KB)
zh

[CV-62] Perception-Inspired Color Space Design for Photo White Balance Editing WACV2026

【速读】:该论文旨在解决图像信号处理(ISP)流水线中白平衡(WB)校正的局限性问题,特别是在原始相机RAW数据不可用时,基于sRGB的WB后处理方法因固定非线性变换和色通道耦合而难以适应复杂光照条件。其解决方案的关键在于提出一种受感知启发的可学习HSI(LHSI)颜色空间,该空间基于圆柱体颜色模型天然分离亮度与色度分量,并引入专用参数增强解耦能力及可学习映射以自适应提升灵活性;同时设计了一种面向LHSI特性的Mamba-based网络结构,实验表明该框架在基准数据集上显著优于现有方法,验证了感知驱动的颜色空间设计在计算摄影中的潜力。

链接: https://arxiv.org/abs/2512.09383
作者: Yang Cheng,Ziteng Cui,Lin Gu,Shenghan Su,Zenghui Zhang
机构: Shanghai Jiao Tong University (上海交通大学); The University of Tokyo (东京大学); Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene’s true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at this https URL. Comments: Accepted to WACV 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4 Cite as: arXiv:2512.09383 [cs.CV] (or arXiv:2512.09383v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.09383 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-63] Rates and architectures for learning geometrically non-trivial operators

【速读】:该论文旨在解决科学机器学习(Scientific Machine Learning, SML)中高维算子学习的样本效率问题,特别是针对那些涉及奇异性传播(如波动、对流和流体动力学)的复杂物理过程,这些过程传统上难以通过深度学习方法高效建模。其解决方案的关键在于引入双纤维化变换(double fibration transforms)——一类包含广义Radon变换和测地线射线变换的几何积分算子,并证明这类算子的学习误差以超代数速率衰减(即比任意固定幂次的训练样本倒数更快),从而克服维度灾难。此外,论文提出一种显式编码此类几何结构的神经网络架构,该架构借鉴水平集方法(levelset methods)与交叉注意力机制(cross-attention),实现了对双纤维化变换的通用性、稳定性参数化,并能在极少量训练样本下实现有效学习。

链接: https://arxiv.org/abs/2512.09376
作者: T. Mitchell Roddenberry,Leo Tzou,Ivan Dokmanić,Maarten V. de Hoop,Richard G. Baraniuk
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Differential Geometry (math.DG)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms–geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.
zh

[CV-64] Log NeRF: Comparing Spaces for Learning Radiance Fields

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在使用sRGB颜色空间进行监督时,难以有效分离光照与反射率信息的问题。传统方法通常依赖于非线性压缩的sRGB图像作为输入,导致网络学习到的场景外观表示不够紧凑和鲁棒。其解决方案的关键在于引入对数RGB(log RGB)颜色空间,基于BiIlluminant Dichromatic Reflection (BIDR)模型的启发,认为对数变换可简化光照与反射率的分离过程,从而提升NeRF对场景外观的建模能力。实验表明,在相同位深输入下,log RGB空间显著改善了渲染质量、跨场景鲁棒性,并在低光条件下表现更优,且该优势在不同网络规模和NeRF变体中均具稳定性。

链接: https://arxiv.org/abs/2512.09375
作者: Sihe Chen(Northeastern University),Luv Verma(Northeastern University),Bruce A. Maxwell(Northeastern University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 36th British Machine Vision Conference

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.
zh

[CV-65] FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)N Diffusion Refinement

【速读】:该论文旨在解决多视角点云(multiview point clouds)注册中传统方法依赖大量成对匹配构建位姿图所带来的计算复杂度高且缺乏全局几何约束导致的病态问题。其核心解决方案是提出FUSER,一种首个前馈式多视角注册Transformer,通过在统一紧凑的潜在空间中联合处理所有扫描数据,直接预测全局位姿而无需任何成对估计;关键创新在于利用稀疏3D卷积网络将每帧点云编码为保留绝对平移信息的低分辨率超点特征,并引入几何交替注意力模块实现高效的扫描内与跨扫描推理,同时迁移现成基础模型的2D注意力先验以增强3D特征交互与几何一致性。进一步地,基于FUSER构建FUSER-DF框架,采用SE(3)^N扩散精修机制在联合位姿空间中进行去噪优化,从而显著提升注册精度与鲁棒性。

链接: https://arxiv.org/abs/2512.09373
作者: Haobo Jiang,Jin Xie,Jian Yang,Liang Yu,Jianmin Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3) ^N diffusion refinement framework to correct FUSER’s estimates via denoising in the joint SE(3) ^N space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3) ^N variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
zh

[CV-66] ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation AAAI2026

【速读】:该论文旨在解决**类无关的3D实例分割(class-agnostic 3D instance segmentation)**任务中模型泛化能力不足的问题,其核心挑战在于训练数据稀缺或2D分割标注噪声大,导致模型难以识别未见过的物体实例。为提升模型在真实场景中的泛化性能,作者提出了一种名为ASSIST-3D的适配型3D场景合成流水线,其关键创新在于三点:1)从大规模3D CAD资产库中进行异质对象选择(heterogeneous object selection),通过采样随机性最大化几何与上下文多样性;2)结合大语言模型(LLM)引导的空间推理与深度优先搜索实现合理的场景布局生成(scene layout generation);3)基于多视角RGB-D图像渲染与融合构建逼真点云数据,模拟真实传感器采集过程,从而有效增强模型对复杂、多样且合理布局场景的理解能力。

链接: https://arxiv.org/abs/2512.09364
作者: Shengchao Zhou,Jiehong Lin,Jiahui Liu,Shizhen Zhao,Chirui Chang,Xiaojuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
zh

[CV-67] StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

【速读】:该论文旨在解决高分辨率立体视频(stereo video)生成成本高且易出现伪影的问题。现有方法在从单目视频生成高质量立体视频时,难以同时保证视觉保真度与三维结构一致性。其解决方案的关键在于提出了一种端到端框架StereoWorld,该框架基于预训练的视频生成模型,通过联合条件控制输入单目视频,并引入几何感知正则化(geometry-aware regularization)显式监督生成过程,从而确保3D结构的一致性;此外,还设计了时空分块(spatio-temporal tiling)策略以实现高效的大分辨率合成。

链接: https://arxiv.org/abs/2512.09363
作者: Ke Xing,Longfei Li,Yuyang Yin,Hanwen Liang,Guixun Luo,Chen Fang,Jue Wang,Konstantinos N. Plataniotis,Xiaojie Jin,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Dzine AI; University of Toronto (多伦多大学); Visual Intelligence + X International Joint Laboratory (视觉智能与X国际联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at this https URL.
zh

[CV-68] Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large-Language Models, MLLMs)在长视频理解任务中因密集帧编码导致的计算效率低下问题,即高内存消耗、冗余计算和实际应用中的可扩展性受限。其解决方案的关键在于提出Video-QTR(Query-Driven Temporal Reasoning)框架,通过将视频理解重构为一种查询驱动的推理过程,动态根据查询语义意图分配感知资源,从而在推理与感知之间建立自适应反馈机制,实现仅对关键帧进行处理,显著降低输入帧数量(最多减少73%),同时保持甚至超越现有方法的性能表现。

链接: https://arxiv.org/abs/2512.09354
作者: Xinkui Zhao,Zuxin Wang,Yifan Zhang,Guanjie Cheng,Yueshen Xu,Shuiguang Deng,Chang Liu,Naibo Wang,Jianwei Yin
机构: Zhejiang University (浙江大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.
zh

[CV-69] xtGuider: Training-Free Guidance for Text Rendering via Attention Alignment

【速读】:该论文旨在解决扩散模型在文本到图像生成任务中普遍存在的“文本遗漏”(text omission)问题,即生成图像中目标文本部分或完全缺失的现象。现有方法虽尝试通过微调或训练-free 优化改善文本渲染准确性,但对这一关键挑战关注不足。其解决方案的核心在于提出一种名为 TextGuider 的训练-free 方法,通过在去噪早期阶段引入基于两个新设计损失函数的潜在空间引导(latent guidance),实现文本内容 token 与图像中对应文本区域之间的对齐,从而显著提升文本完整性与可读性,在测试阶段达到当前最优的召回率(recall)、OCR 准确率和 CLIP 分数。

链接: https://arxiv.org/abs/2512.09350
作者: Kanghyun Baek,Sangyub Lee,Jin Young Choi,Jaewoo Song,Daemin Park,Jooyoung Choi,Chaehun Shin,Bohyung Han,Sungroh Yoon
机构: Interdisciplinary Program in Artificial Intelligence, Seoul National University (人工智能交叉学科项目,首尔国立大学); Department of Electrical and Computer Engineering, Seoul National University (电气与计算机工程系,首尔国立大学); Global Technology Research, Samsung Electronics (全球技术研究,三星电子); AIIS, ASRI, INMC, ISRC, Seoul National University (人工智能研究所,先进机器人研究所,智能机械研究中心,智能系统研究中心,首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
zh

[CV-70] Development and Testing for Perception Based Autonomous Landing of a Long-Range QuadPlane

【速读】:该论文旨在解决在GPS拒止或复杂城市环境中,大型四旋翼飞机(QuadPlane)实现高可靠性视觉感知着陆的问题。由于实际着陆场景具有非结构化和高度动态性,传统方法难以满足泛化能力需求,而现有边缘AI设备在计算资源和功耗上的限制进一步加剧了实时检测与控制的挑战。解决方案的关键在于构建一个轻量化、嵌入式部署的视觉感知与视觉惯性里程计(Visual-Inertial Odometry, VIO)系统,通过优化硬件平台、传感器配置及计算架构,在有限的边缘算力下实现稳定、精确的姿态估计与自主着陆控制,从而支撑长航时QuadPlane在动态非结构化环境中的可靠运行。

链接: https://arxiv.org/abs/2512.09343
作者: Ashik E Rasul,Humaira Tasnim,Ji Yu Kim,Young Hyun Lim,Scott Schmitz,Bruce W. Jo,Hyung-Jin Yoon
机构: Tennessee Technological University (田纳西理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
zh

[CV-71] Visual Categorization Across Minds and Models: Cognitive Analysis of Human Labeling and Neuro-Symbolic Integration

【速读】:该论文旨在解决人类与人工智能系统在处理模糊视觉刺激时的感知、推理与决策机制差异问题,特别是聚焦于低分辨率和感知退化图像的标签识别表现。其解决方案的关键在于结合计算认知科学、认知架构(如ACT-R和Soar)以及连接主义-符号混合模型,对比分析人类采用类比推理、基于形状的识别及置信度调节等策略与AI依赖特征提取的处理方式,并通过Grad-CAM可视化模型注意力机制,从Marr三层次理论、Simon有限理性及Thagard表征与情绪框架出发,揭示两者在表征、推理和置信校准上的共性与差异。这一分析推动了未来神经符号架构的发展,强调具身性、可解释性和认知对齐原则,以实现兼具高性能、可解释性与认知基础的下一代AI系统。

链接: https://arxiv.org/abs/2512.09340
作者: Chethana Prasad Kabgere
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 3 figures. Research manuscript based on the final project for CS6795 (Introduction to Cognitive Science), Georgia Tech

点击查看摘要

Abstract:Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI’s feature-based processing. Grounded in Marr’s tri-level hypothesis, Simon’s bounded rationality, and Thagard’s frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
zh

[CV-72] Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video ACM-MM2025

【速读】:该论文旨在解决从单目视频中重建可重新光照(relightable)且可动画化(animatable)的人体虚拟形象的问题,尤其针对现有基于神经辐射场(Neural Radiance Field, NeRF)和3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法在人体运动相关几何细节(如衣物褶皱)建模上表现不足、导致照片级真实感欠佳的挑战。其解决方案的关键在于提出一种基于3DGS的人体虚拟形象建模框架——RnD-Avatar,通过引入动态皮肤权重(dynamic skinning weights),根据姿态自适应定义人体关节运动,并学习由身体动作引发的额外形变;同时设计了一种新颖正则化策略,在稀疏视觉线索下捕捉精细几何细节,从而实现高保真度的姿态变化形变与逼真的光照效果再现。

链接: https://arxiv.org/abs/2512.09335
作者: Seonghwa Choi,Moonkyeong Choi,Mingyu Jang,Jaekyung Kim,Jianfei Cai,Wen-Huang Cheng,Sanghoon Lee
机构: Yonsei University (延世大学); Monash University (莫纳什大学); National Taiwan University (台湾国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 8 pages, 9 figures, published in ACM MM 2025

点击查看摘要

Abstract:Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar’s articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
zh

[CV-73] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

【速读】:该论文旨在解决生成式对话虚拟人(Conversational Avatar)中监听者(Listener)动态表情建模的难题,即如何在仅依赖双通道音频输入的情况下,实现说话与倾听动作的端到端联合生成。现有方法因监听动作缺乏强外部驱动信号而易产生僵硬、静态的表达,且多数方案需额外提供说话者的运动数据,无法满足实时交互需求。其解决方案的关键在于提出首个端到端框架UniLS,采用两阶段训练范式:第一阶段通过无音频自回归生成器学习内在面部运动先验(motion prior),捕捉自然面部动作的自发性;第二阶段引入双通道音频对先验进行微调,使监听动作能基于外部语音线索合理调节。此设计显著提升了监听表现(提升达44.1%),实现了更丰富、自然的倾听表达,有效缓解了传统方法的僵硬问题。

链接: https://arxiv.org/abs/2512.09327
作者: Xuangeng Chu,Ruicong Liu,Yifei Huang,Yun Liu,Yichen Peng,Bo Zheng
机构: Shanda AI Research Tokyo (山达人工智能研究东京); The University of Tokyo (东京大学); Institute of Science Tokyo (东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker’s motion is strongly driven by speech audio, while the listener’s motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker’s motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
zh

[CV-74] Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges Practice and Outlook

【速读】:该论文旨在解决医学图像分析中因标注噪声(label noise)导致模型性能下降的问题,尤其关注专家标注的不一致性与真实世界噪声场景下现有学习方法鲁棒性不足的挑战。其解决方案的关键在于构建了一个系统性的基准测试平台LNMBench,涵盖10种代表性去噪学习方法、7个医学数据集、6种成像模态和3类噪声模式,首次在真实医疗场景下统一评估了各类方法的鲁棒性表现;实验揭示了当前方法在高噪声和实际噪声条件下性能显著退化,归因于类别不平衡与域差异问题,并据此提出一种简单但有效的改进策略以提升模型在复杂噪声环境下的稳定性。

链接: https://arxiv.org/abs/2512.09315
作者: Yuan Ma,Junlin Hou,Chao Zhang,Yukun Zhou,Zongyuan Ge,Haoran Xie,Lie Ju
机构: Japan Advanced Institute of Science and Technology (日本高级科学技术研究所); The University of Hong Kong (香港大学); University of Toyama (富山大学); University College London (伦敦大学学院); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf10 representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical this http URL codebase is publicly available on this https URL.
zh

[CV-75] ransformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

【速读】:该论文旨在解决复杂环境中主动威胁检测与公共安全保障中的可疑行为识别问题,核心挑战在于如何实现高精度、实时性与可解释性的统一。解决方案的关键在于构建了一个大规模标注数据集USE50k(包含65,500张来自机场、火车站、公园等非受控场景的图像)以及一个轻量级、模块化的视觉分析框架DeepUSEvision:该框架集成三项核心技术——基于改进YOLOv12架构的可疑物体检测器、用于面部表情和身体语言识别的双深度卷积神经网络(DCNN-I和DCNN-II),以及基于Transformer的判别网络,可自适应融合多模态输出并生成可解释的可疑度评分。实验表明,该方案在准确性、鲁棒性和可解释性方面均优于现有方法,为智能监控和关键场景下的实时风险评估提供了可扩展的基础。

链接: https://arxiv.org/abs/2512.09311
作者: Kuldeep Singh Yadav,Lalan Kumar
机构: CSIR Fourth Paradigm Institute (CSIR第四范式研究所); IIT Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 12 pages, 10 figures, IEEE Transaction on Image Processing

点击查看摘要

Abstract:Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.
zh

[CV-76] A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge

【速读】:该论文旨在解决视觉智能工具在资源受限的移动和可穿戴设备上部署时面临的高计算需求问题,同时规避传统云端处理方式带来的隐私泄露风险。其解决方案的关键在于提出一种分布式、分层的卸载框架,利用本地可信边缘设备(如智能手机或Nvidia Jetson)作为边缘协调器,将用户的视觉数据分割并分发至多个独立的云服务器;由于单个外部服务器无法获取完整图像,从而防止了数据的全面重建。最终的数据合并与聚合计算仅在用户可信的边缘设备上完成,确保了内容隐私性。该方法在Segment Anything Model (SAM) 上验证有效,实现了近基线的分割性能,显著降低了内容重构和用户数据暴露的风险。

链接: https://arxiv.org/abs/2512.09309
作者: Zihao Ding,Mufeng Zhu,Zhongze Tang,Sheng Wei,Yao Liu
机构: Rutgers University (罗格斯大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures. Published in the Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing (SEC '25), Dec 3-6, 2025, Washington, D.C., USA

点击查看摘要

Abstract:Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user’s visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user’s trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.
zh

[CV-77] From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

【速读】:该论文旨在解决结肠镜下息肉(polyp)分割任务中因息肉尺寸、形状和颜色差异大以及易被背景“伪装”而导致的分割精度不足问题。现有轻量级模型如U-Net、U-Net++和PraNet虽具备部署便捷性和低计算成本优势,但难以应对上述挑战;而大规模视觉基础模型(如SAM、DINOv2、OneFormer等)虽在自然图像领域表现优异,却因医学影像数据稀缺及领域知识缺失,难以直接迁移应用。解决方案的关键在于提出一种新颖的知识蒸馏框架Polyp-DiFoM,通过将基础模型中的语义先验注入轻量级架构(如U-Net和U-Net++),并引入频域编码增强蒸馏过程,从而实现高精度与低计算开销的平衡,在五个基准数据集上显著优于基线模型及当前最优方法,且计算量减少近9倍。

链接: https://arxiv.org/abs/2512.09307
作者: Shivanshu Agnihotri,Snehashis Majhi,Deepak Ranjan Nayak,Debesh Jha
机构: Malaviya National Institute of Technology, Jaipur, India; Côte d’Azur University, France; University of South Dakota, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at this https URL.
zh

[CV-78] VABench: A Comprehensive Benchmark for Audio-Video Generation

【速读】:该论文旨在解决当前视频生成基准测试中对音画同步生成能力评估不足的问题,尤其针对生成式 AI(Generative AI)模型在输出音视频内容时缺乏系统性、多维度的评测手段。解决方案的关键在于提出 VABench,一个涵盖三类任务(文本到音视频、图像到音视频、立体声音视频生成)和两个核心评估模块的综合性 benchmark 框架,其包含 15 个具体维度,如文本-视频、文本-音频、视频-音频的成对相似性、音视频同步性、唇音一致性以及精心设计的音视频问答对等,从而实现对音视频生成模型的全面量化评估,并推动该领域向更高质量、更可靠的同步生成方向发展。

链接: https://arxiv.org/abs/2512.09299
作者: Daili Hua,Xizhi Wang,Bohan Zeng,Xinyi Huang,Hao Liang,Junbo Niu,Xinlong Chen,Quanqing Xu,Wentao Zhang
机构: Peking University (北京大学); Ant Group (蚂蚁集团); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 24 pages, 25 figures

点击查看摘要

Abstract:Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
zh

[CV-79] raffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中动态感知场景下小目标识别性能不足的问题,主要挑战包括小目标信息丢失、尺度不平衡以及遮挡干扰。解决方案的关键在于提出改进的YOLOv8n-SPTS模型,通过三项核心创新实现:首先,在Backbone Bottleneck结构中用Space-to-Depth Convolution(SPD-Conv)模块替代传统卷积模块,保留细粒度特征并增强对低分辨率小目标的捕捉能力;其次,引入Spatial Pyramid Pooling - Fast Cross Stage Partial Connection(SPPFCSPC)模块替代原SPPF模块,融合多尺度特征提取与跨阶段特征融合机制,提升复杂场景下的上下文理解与多尺度表达能力;最后,设计专用于小目标的三阶段特征金字塔(TSFP)结构,在浅层增加160×160分辨率的小目标检测头以充分利用高分辨率特征,同时移除冗余的大目标检测头,从而在保证精度的同时优化计算效率。

链接: https://arxiv.org/abs/2512.09296
作者: Songhan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 7 figures, 1 table. Accepted to The 2025 IEEE 3rd International Conference on Electrical, Automation and Computer Engineering (ICEACE), 2025. Code available at this https URL

点击查看摘要

Abstract:This paper focuses on the key issue in autonomous driving: small target recognition in dynamic perception. Existing algorithms suffer from poor detection performance due to missing small target information, scale imbalance, and occlusion. We propose an improved YOLOv8n-SPTS model, which enhances the detection accuracy of small traffic targets through three key innovations: First, optimizing the feature extraction module. In the Backbone Bottleneck structure of YOLOv8n, 4 traditional convolution modules are replaced with Space-to-Depth Convolution (SPD-Conv) modules. This module retains fine-grained information through space-to-depth conversion, reduces information loss, and enhances the ability to capture features of low-resolution small targets. Second, enhancing feature fusion capability. The Spatial Pyramid Pooling - Fast Cross Stage Partial Connection (SPPFCSPC) module is introduced to replace the original SPPF module, integrating the multi-scale feature extraction from Spatial Pyramid Pooling (SPP) and the feature fusion mechanism of Cross Stage Partial Connection (CSP), thereby improving the model’s contextual understanding of complex scenes and multi-scale feature expression ability. Third, designing a dedicated detection structure for small targets. A Triple-Stage Feature Pyramid (TSFP) structure is proposed, which adds a 160*160 small target detection head to the original detection heads to fully utilize high-resolution features in shallow layers; meanwhile, redundant large target detection heads are removed to balance computational efficiency. Comparative experiments on the VisDrone2019-DET dataset show that YOLOv8n-SPTS model ranks first in precision (61.9%), recall (48.3%), mAP@0.5 (52.6%), and mAP@0.5:0.95 (32.6%). Visualization results verify that the miss rate of small targets such as pedestrians and bicycles in occluded and dense scenes is significantly reduced.
zh

[CV-80] MelanomaNet: Explainable Deep Learning for Skin Lesion Classification

【速读】:该论文旨在解决深度学习在皮肤病变分类中因“黑箱”特性而导致临床采纳受限的问题。其解决方案的关键在于构建一个可解释的深度学习系统MelanomaNet,通过四种互补的可解释性机制实现:基于EfficientNet V2的骨干网络结合GradCAM++注意力可视化、自动提取ABCDE临床判别标准、利用FastCAV进行概念驱动的解释,以及采用蒙特卡洛Dropout量化预测不确定性(区分认知不确定性与随机不确定性),从而在保持高分类性能(ISIC 2019数据集上准确率达85.61%)的同时,提供与皮肤病学评估准则对齐的临床可理解解释,提升医生对模型的信任度和临床部署潜力。

链接: https://arxiv.org/abs/2512.09289
作者: Sukhrobbek Ilyosbekov
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Automated skin lesion classification using deep learning has shown remarkable accuracy, yet clinical adoption remains limited due to the “black box” nature of these models. We present MelanomaNet, an explainable deep learning system for multi-class skin lesion classification that addresses this gap through four complementary interpretability mechanisms. Our approach combines an EfficientNet V2 backbone with GradCAM++ attention visualization, automated ABCDE clinical criterion extraction, Fast Concept Activation Vectors (FastCAV) for concept-based explanations, and Monte Carlo Dropout uncertainty quantification. We evaluate our system on the ISIC 2019 dataset containing 25,331 dermoscopic images across 9 diagnostic categories. Our model achieves 85.61% accuracy with a weighted F1 score of 0.8564, while providing clinically meaningful explanations that align model attention with established dermatological assessment criteria. The uncertainty quantification module decomposes prediction confidence into epistemic and aleatoric components, enabling automatic flagging of unreliable predictions for clinical review. Our results demonstrate that high classification performance can be achieved alongside comprehensive interpretability, potentially facilitating greater trust and adoption in clinical dermatology workflows. The source code is available at this https URL
zh

[CV-81] FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model

【速读】:该论文旨在解决当前全任务图像修复(all-in-one image restoration)模型在面对多类退化任务时,因训练数据混合比例失衡而导致性能不一致的问题。其解决方案的关键在于提出了一种基于扩散模型的高容量基础模型 FoundIR-v2,通过引入数据均衡调度机制(data equilibrium scheduling paradigm),动态优化不同任务训练数据的比例,并结合专家混合(Mixture-of-Experts, MoE)驱动的调度器,为每类修复任务自适应分配扩散先验(diffusion priors),从而有效应对各类退化形式与强度的差异,实现跨任务的稳定泛化能力与综合性能提升。

链接: https://arxiv.org/abs/2512.09282
作者: Xiang Chen,Jinshan Pan,Jiangxin Dong,Jian Yang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
zh

[CV-82] LoGoColor: Local-Global 3D Colorization for 360° Scenes

【速读】:该论文旨在解决单通道三维重建(Single-channel 3D reconstruction)中颜色信息缺失的问题,即如何在不依赖多视角图像的情况下实现高质量、高多样性的三维彩色模型生成。现有方法通常通过蒸馏二维图像颜色化模型来实现三维颜色化(3D colorization),但这类方法存在固有缺陷:由于二维模型在训练过程中对颜色进行平均处理,导致输出结果单调且细节简化,尤其在复杂全景(360°)场景中表现不佳。论文的关键解决方案是提出LoGoColor框架,其核心在于摒弃传统的引导平均机制(guidance-averaging process),转而采用“局部-全局”(Local-Global)策略:将场景划分为子场景(subscenes),并利用微调后的多视角扩散模型分别处理子场景内(intra-subscene)和子场景间(inter-subscene)的一致性问题,从而在保持颜色多样性的同时确保多视角一致性,显著提升了复杂场景下的三维颜色化质量与真实性。

链接: https://arxiv.org/abs/2512.09278
作者: Yeonjin Chang,Juhwan Cho,Seunghyeon Seo,Wonsik Shin,Nojun Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-channel 3D reconstruction is widely used in fields such as robotics and medical imaging. While this line of work excels at reconstructing 3D geometry, the outputs are not colored 3D models, thus 3D colorization is required for visualization. Recent 3D colorization studies address this problem by distilling 2D image colorization models. However, these approaches suffer from an inherent inconsistency of 2D image models. This results in colors being averaged during training, leading to monotonous and oversimplified results, particularly in complex 360° scenes. In contrast, we aim to preserve color diversity by generating a new set of consistently colorized training views, thereby bypassing the averaging process. Nevertheless, eliminating the averaging process introduces a new challenge: ensuring strict multi-view consistency across these colorized views. To achieve this, we propose LoGoColor, a pipeline designed to preserve color diversity by eliminating this guidance-averaging process with a `Local-Global’ approach: we partition the scene into subscenes and explicitly tackle both inter-subscene and intra-subscene consistency using a fine-tuned multi-view diffusion model. We demonstrate that our method achieves quantitatively and qualitatively more consistent and plausible 3D colorization on complex 360° scenes than existing methods, and validate its superior color diversity using a novel Color Diversity Index.
zh

[CV-83] Dynamic Facial Expressions Analysis Based Parkinsons Disease Auxiliary Diagnosis

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)诊断过程中依赖临床医生主观判断、效率较低且可及性不足的问题。其核心解决方案是基于动态面部表情分析的辅助诊断方法,关键在于通过多模态面部表情分析网络提取患者在执行不同面部表情时的表情强度特征,并利用CLIP架构融合视觉与文本信息以保留面部表情的时间动态特性;随后将这些特征输入基于长短期记忆网络(LSTM)的分类模型进行PD判别,从而实现高精度(93.1%准确率)的自动化辅助诊断。

链接: https://arxiv.org/abs/2512.09276
作者: Xiaochen Huang,Xiaochen Bi,Cuihua Lv,Xin Wang,Haoyan Zhang,Wenjing Jiang,Xin Ma,Yibin Li
机构: Shandong University (山东大学); Qilu Hospital (齐鲁医院); Inspur (浪潮)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parkinson’s disease (PD), a prevalent neurodegenerative disorder, significantly affects patients’ daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients’ performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.
zh

[CV-84] LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations AAAI2026

【速读】:该论文旨在解决长文本到图像(Long Text-to-Image, Long T2I)生成场景下图像-文本对齐评估的自动化与可解释性问题。现有T2I对齐评测基准主要针对短提示(short prompt)场景,仅提供MOS或Likert量表标注,缺乏细粒度的结构化解释能力,难以支撑对长文本生成质量的精准评估。解决方案的关键在于构建一个包含14K条长文本-图像对及其图结构人类标注的基准数据集LongT2IBench,通过设计“生成-精炼-量化”(Generate-Refine-Qualify)标注协议,将复杂长文本转化为包含实体、属性和关系的文本图结构,从而实现基于细粒度语义元素的对齐标注;在此基础上进一步提出LongT2IExpert模型,利用分层对齐思维链(Hierarchical Alignment Chain-of-Thought)指令微调多模态大语言模型(MLLMs),使其不仅能输出定量对齐分数,还能提供结构化的解释,显著提升了长T2I评估的准确性与可解释性。

链接: https://arxiv.org/abs/2512.09271
作者: Zhichao Yang,Tianjiao Gu,Jianjie Wang,Feiyu Lin,Xiangfei Sheng,Pengfei Chen,Leida Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by AAAI 2026

点击查看摘要

Abstract:The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in this https URL.
zh

[CV-85] MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

【速读】:该论文旨在解决4D高斯溅射(4D Gaussian Splatting, 4DGS)在建模长时序动态视频时面临的三大挑战:内存爆炸、时间闪烁以及难以处理随时间出现或消失的遮挡问题。其核心解决方案是提出一种基于锚点中继的双向融合机制(Anchor Relay-based Bidirectional Blending, ARBB),命名为MoRel框架。该方法通过在关键帧时间索引处逐步构建局部规范锚点空间(Key-frame Anchor, KfA),并在锚点级别上建模帧间形变,从而增强时间一致性;同时,通过学习双向形变并利用可学习不透明度控制进行自适应融合,有效缓解时间不连续性和闪烁伪影。此外,引入特征方差引导的分层加密策略(Feature-variance-guided Hierarchical Densification, FHD),在保持渲染质量的前提下高效 densify KfA,实现内存可控的长期动态场景重建。

链接: https://arxiv.org/abs/2512.09270
作者: Sangwoon Kwak,Weeyoung Kwon,Jun Young Jeong,Geonho Kim,Won-Sik Cheong,Jihyong Oh
机构: Electronics and Telecommunications Research Institute (电子通信研究所); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes. However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes. Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts. We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA’s while keeping rendering quality, based on an assigned level of feature-variance. To effectively evaluate our model’s capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap _\textLR . It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets. Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations.
zh

[CV-86] ROI-Packing: Efficient Region-Based Compression for Machine Vision

【速读】:该论文旨在解决机器视觉任务中图像压缩效率与端到端任务性能之间的权衡问题,即如何在不损害目标检测或实例分割等下游任务准确率的前提下实现更高的压缩比。其解决方案的关键在于提出ROI-Packing方法,该方法通过识别并优先保留对最终任务至关重要的区域(Region of Interest, ROI),将这些关键区域高效打包编码,同时丢弃对任务影响较小的冗余信息,从而在无需重新训练或微调模型的情况下显著提升压缩效率。实验表明,该方法可在保持相同比特率下提升8.88%的任务准确率,或在不损失准确率的前提下将比特率降低最高达44.10%。

链接: https://arxiv.org/abs/2512.09258
作者: Md Eimran Hossain Eimon,Alena Krause,Ashan Perera,Juan Merlos,Hari Kalva,Velibor Adzic,Borko Furht
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
zh

[CV-87] GLACIA: Instance-Aware Positional Reasoning for Glacial Lake Segmentation via Multimodal Large Language Model

【速读】:该论文旨在解决冰川湖监测中现有分割方法(如基于卷积神经网络CNNs和视觉TransformerViTs)在像素级预测时缺乏高层全局场景语义信息与人类可解释推理能力的问题。其关键解决方案是提出GLACIA框架,首次将大语言模型(Large Language Models, LLMs)与分割能力相结合,实现精准的分割掩码生成及对应的空间推理输出;同时构建了Glacial Lake Position Reasoning(GLake-Pos)数据集管道,提供多样化的、空间定位明确的问题-答案对,以弥补遥感领域实例感知位置推理数据的不足,从而显著提升分割精度(mIoU达87.30)并增强决策过程的可解释性与实用性。

链接: https://arxiv.org/abs/2512.09251
作者: Lalit Maurya,Saurabh Kaushik,Beth Tellman
机构: Portsmouth AI and Data Science Centre (PAIDS), School of Computing, University of Portsmouth (朴茨茅斯大学); Center for Sustainability and the Global Environment (SAGE), University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbfGlacial \textbfLAke segmentation with \textbfContextual \textbfInstance \textbfAwareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on this https URL
zh

[CV-88] OmniPSD: Layered PSD Generation with Diffusion Transformer

【速读】:该论文旨在解决扩散模型在生成或重构带有透明Alpha通道的分层PSD(Photoshop Document)文件时面临的挑战,尤其是在保持图层结构语义一致性与透明度信息准确性的难题。其解决方案的关键在于提出OmniPSD框架,该框架基于Flux生态系统构建,通过上下文学习(in-context learning)实现文本到PSD的生成和图像到PSD的分解:在文本到PSD生成中,利用空间注意力机制建模多层间的组合关系,生成语义一致且具有层次结构的图层;在图像到PSD分解中,则采用迭代式上下文编辑策略,逐步提取并擦除文本与前景内容以重建可编辑的PSD图层;同时引入RGBA-VAE作为辅助表示模块,在不干扰结构学习的前提下保留透明度信息,从而实现高保真度、结构一致性及透明度感知的分层设计生成与分解。

链接: https://arxiv.org/abs/2512.09247
作者: Cheng Liu,Yiren Song,Haofan Wang,Mike Zheng Shou
机构: National University of Singapore (新加坡国立大学); Lovart AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.
zh

[CV-89] A Clinically Interpretable Deep CNN Framework for Early Chronic Kidney Disease Prediction Using Grad-CAM-Based Explainable AI

【速读】:该论文旨在解决慢性肾病(Chronic Kidney Disease, CKD)早期诊断困难的问题,以实现更及时的临床干预。其解决方案的关键在于构建一个深度卷积神经网络(Deep Convolutional Neural Network, CNN),利用CT肾脏图像进行自动分类,并通过合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)优化类别不平衡问题,同时借助梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)提升模型的可解释性,从而在包含12,446张CT图像的数据集上实现了CKD早期检测100%的准确率。

链接: https://arxiv.org/abs/2512.09244
作者: Anas Bin Ayub,Nilima Sultana Niha,Md. Zahurul Haque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chronic Kidney Disease (CKD) constitutes a major global medical burden, marked by the gradual deterioration of renal function, which results in the impaired clearance of metabolic waste and disturbances in systemic fluid homeostasis. Owing to its substantial contribution to worldwide morbidity and mortality, the development of reliable and efficient diagnostic approaches is critically important to facilitate early detection and prompt clinical management. This study presents a deep convolutional neural network (CNN) for early CKD detection from CT kidney images, complemented by class balancing using Synthetic Minority Over-sampling Technique (SMOTE) and interpretability via Gradient-weighted Class Activation Mapping (Grad-CAM). The model was trained and evaluated on the CT KIDNEY DATASET, which contains 12,446 CT images, including 3,709 cyst, 5,077 normal, 1,377 stone, and 2,283 tumor cases. The proposed deep CNN achieved a remarkable classification performance, attaining 100% accuracy in the early detection of chronic kidney disease (CKD). This significant advancement demonstrates strong potential for addressing critical clinical diagnostic challenges and enhancing early medical intervention strategies.
zh

[CV-90] Efficient Feature Compression for Machines with Global Statistics Preservation

【速读】:该论文旨在解决在分层推理(split-inference)范式中,模型两部分之间传输中间特征数据时带来的高带宽开销问题。其核心挑战在于如何在不损失下游任务准确率的前提下,有效压缩这些特征数据。解决方案的关键在于引入Z-score标准化方法,以高效地在解码端恢复压缩后的特征数据,从而替代当前MPEG正在开发的特征编码标准(Feature Coding for Machines, FCM)中使用的缩放方法。该方法不仅降低了比特率开销,还提升了任务准确性;进一步地,作者还提出了一种简化方案以在特定场景下进一步减少传输开销,实验表明平均比特率降低17.09%,在目标跟踪任务中最高可降低65.69%,且不牺牲任务精度。

链接: https://arxiv.org/abs/2512.09235
作者: Md Eimran Hossain Eimon,Hyomin Choi,Fabien Racapé,Mateen Ulhaq,Velibor Adzic,Hari Kalva,Borko Furht
机构: Florida Atlantic University (佛罗里达大西洋大学); InterDigital - AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The split-inference paradigm divides an artificial intelligence (AI) model into two parts. This necessitates the transfer of intermediate feature data between the two halves. Here, effective compression of the feature data becomes vital. In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side. To examine the efficacy of our method, the proposed method is integrated into the latest Feature Coding for Machines (FCM) codec standard under development by the Moving Picture Experts Group (MPEG). Our method supersedes the existing scaling method used by the current standard under development. It both reduces the overhead bits and improves the end-task accuracy. To further reduce the overhead in certain circumstances, we also propose a simplified method. Experiments show that using our proposed method shows 17.09% reduction in bitrate on average across different tasks and up to 65.69% for object tracking without sacrificing the task accuracy.
zh

[CV-91] Enabling Next-Generation Consumer Experience with Feature Coding for Machines

【速读】:该论文旨在解决智能互联设备在运行深度学习模型时面临的计算资源受限与数据传输效率低下的问题。其解决方案的关键在于提出并实现了一种面向机器的特征编码(Feature Coding for Machines, FCM)标准,该标准作为MPEG-AI的一部分,通过高效提取、压缩和传输神经网络中间层特征,使低功耗终端设备能够将计算密集型任务卸载至具备强大算力的服务器端执行,从而在保持模型精度不变的前提下,显著降低比特率需求达75.90%,有效提升AI驱动应用的传输效率与可部署性。

链接: https://arxiv.org/abs/2512.09232
作者: Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As consumer devices become increasingly intelligent and interconnected, efficient data transfer solutions for machine tasks have become essential. This paper presents an overview of the latest Feature Coding for Machines (FCM) standard, part of MPEG-AI and developed by the Moving Picture Experts Group (MPEG). FCM supports AI-driven applications by enabling the efficient extraction, compression, and transmission of intermediate neural network features. By offloading computationally intensive operations to base servers with high computing resources, FCM allows low-powered devices to leverage large deep learning models. Experimental results indicate that the FCM standard maintains the same level of accuracy while reducing bitrate requirements by 75.90% compared to remote inference.
zh

[CV-92] View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

【速读】:该论文旨在解决当前零样本3D视觉定位(3D visual grounding, 3DVG)方法中因依赖2D视觉语言模型(VLM)处理3D空间信息(SI)而导致的视觉表征纠缠问题,即VLM被迫处理杂乱的全局视觉线索,难以有效利用空间语义关系。其解决方案的关键在于提出一种全新的“VLM × SI”范式,通过将3D SI外化为结构化的多模态场景图(scene graph),使VLM能够作为主动代理,在推理过程中逐步检索所需信息,从而降低推理难度并实现可解释的逐步推理轨迹。该范式通过提出的View-on-Graph(VoG)方法具体实现,显著提升了零样本3DVG性能。

链接: https://arxiv.org/abs/2512.09215
作者: Yuanyuan Liu,Haiyang Mei,Dongyang Zhan,Jiayue Zhao,Dongsheng Zhou,Bo Dong,Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM’s reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
zh

[CV-93] Residual Primitive Fitting of 3D Shapes with SuperFrusta

【速读】:该论文旨在解决3D形状重建中长期存在的权衡问题:即重建保真度(fidelity)与表示简洁性(parsimony)之间的矛盾。现有方法要么难以高效表达复杂几何结构,要么无法提供可编辑的参数化表示。解决方案的关键在于提出两个核心贡献:一是引入一种新型解析基元——SuperFrustum,其具备表达能力强(可建模圆柱、球体、圆锥及其变体)、参数紧凑(仅8个参数)且优化友好(具有可微分的符号距离场)的特点;二是设计了一种无监督的迭代拟合算法Residual Primitive Fitting (ResFit),通过全局形状分析与局部优化交替进行,逐步拟合未解释残差,从而获得既精确又稀疏的基元分解。实验表明,该方法在多个3D基准数据集上达到当前最优性能,IoU提升超过9点,同时使用的基元数量仅为先前方法的一半,显著提升了3D形状的可编辑性和实用性。

链接: https://arxiv.org/abs/2512.09201
作者: Aditya Ganeshan,Matheus Gadelha,Thibault Groueix,Zhiqin Chen,Siddhartha Chaudhuri,Vladimir Kim,Wang Yifan,Daniel Ritchie
机构: Brown University (布朗大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
zh

[CV-94] Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

【速读】:该论文旨在解决疾病进展建模中 latent space 中患者轨迹缺乏语义结构和临床相关性的问题,特别是现有生成式方法(如基于扩散模型)难以保持疾病动态的连续性和单调性。其关键解决方案是提出一种名为 Δ-LFM 的框架,通过将疾病动态视为速度场并利用 Flow Matching (FM) 对齐患者数据的时间演化过程,同时引入患者特定的潜在对齐机制,强制患者轨迹沿特定轴分布且其幅度随疾病严重程度单调递增,从而构建一个具有一致性和语义意义的潜在空间。

链接: https://arxiv.org/abs/2512.09185
作者: Hao Chen,Rui Yin,Yifan Chen,Qi Chen,Chao Li
机构: University of Cambridge (剑桥大学); Nanjing First Hospital (南京第一医院); Nanjing Medical University (南京医科大学); Johns Hopkins University (约翰霍普金斯大学); University of Dundee (邓迪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present \Delta -LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, \Delta -LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.
zh

[CV-95] Prompt-Based Continual Compositional Zero-Shot Learning

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在连续适应新属性(attribute)、新物体(object)及其组合(composition)时的持续学习问题,即在保持对先前知识不遗忘的前提下实现组合零样本学习(Compositional Zero-Shot Learning, CZSL)。与传统类别不相交的持续学习不同,连续组合零样本学习(Continual Compositional Zero-Shot Learning, CCZSL)中属性和物体可能跨会话重复出现,而组合关系仅在当前会话唯一。解决方案的关键在于提出首个基于提示(Prompt-based)的连续组合零样本学习框架 PromptCCZSL:通过基于近期权重的多教师蒸馏保留先验知识;利用会话感知的组合提示融合多模态特征以处理新组合,同时通过会话无关的属性与物体提示维持全局语义一致性,并引入余弦锚定损失(Cosine Anchor Loss, CAL)稳定先验表示;进一步设计正交投影损失(Orthogonal Projection Loss, OPL)防止新嵌入与历史嵌入重叠,以及会话内多样性损失(Intra-Session Diversity Loss, IDL)提升当前会话嵌入的区分度,从而实现高效且稳定的持续适应能力。

链接: https://arxiv.org/abs/2512.09172
作者: Sauda Maryam,Sara Nadeem,Faisal Qureshi,Mohsen Ali
机构: Intelligent Machines Lab (智能机器实验室); Information Technology University (信息技术大学); Visual Computing Lab (视觉计算实验室); Ontario Tech University (安大略理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.
zh

[CV-96] WonderZoom: Multi-Scale 3D World Generation

【速读】:该论文旨在解决现有3D世界生成模型在多尺度内容合成上的局限性,即当前方法仅能实现单一尺度的3D场景生成,难以在不同空间粒度下保持内容的一致性与连贯性。其核心挑战在于缺乏能够生成和渲染具有显著差异空间尺寸内容的尺度感知3D表示。解决方案的关键在于提出两个创新:一是采用尺度自适应高斯表面元(scale-adaptive Gaussian surfels),支持多尺度3D场景的生成与实时渲染;二是设计渐进式细节合成器(progressive detail synthesizer),通过迭代方式逐步生成更精细尺度的3D内容,从而实现从宏观景观到微观特征的自动回归式细节扩展。

链接: https://arxiv.org/abs/2512.09164
作者: Jin Cao,Hong-Xing Yu,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project website: this https URL The first two authors contributed equally

点击查看摘要

Abstract:We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to “zoom into” a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in this https URL
zh

[CV-97] GTAvatar: Bridging Gaussian Splatting and Texture Mapping for Relightable and Editable Gaussian Avatars

【速读】:该论文旨在解决基于2D高斯溅射(Gaussian Splatting)的头像重建方法在编辑性方面不足的问题,即虽然其能实现高保真度的视觉效果,但缺乏传统三角网格(triangle mesh)方法所具备的直观纹理映射(UV texture mapping)编辑能力。解决方案的关键在于:通过一种计算高效的机制,将每个规范高斯原语(canonical Gaussian primitive)的局部坐标系嵌入到模板网格的UV空间中,从而从单目视频中重建出连续且可编辑的材质纹理;同时引入一个高效物理基础的反射模型,支持对这些内在材质图进行光照重演(relighting)与编辑,从而在无需额外优化的前提下,提供直观的外观与几何修改控制。

链接: https://arxiv.org/abs/2512.09162
作者: Kelian Baert,Mae Younes,Francois Bourel,Marc Christie,Adnane Boukhayma
机构: Univ Rennes(雷恩大学); Inria(法国国家信息与自动化研究院); CNRS(法国国家科学研究中心); IRISA(信息与系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advancements in Gaussian Splatting have enabled increasingly accurate reconstruction of photorealistic head avatars, opening the door to numerous applications in visual effects, videoconferencing, and virtual reality. This, however, comes with the lack of intuitive editability offered by traditional triangle mesh-based methods. In contrast, we propose a method that combines the accuracy and fidelity of 2D Gaussian Splatting with the intuitiveness of UV texture mapping. By embedding each canonical Gaussian primitive’s local frame into a patch in the UV space of a template mesh in a computationally efficient manner, we reconstruct continuous editable material head textures from a single monocular video on a conventional UV domain. Furthermore, we leverage an efficient physically based reflectance model to enable relighting and editing of these intrinsic material maps. Through extensive comparisons with state-of-the-art methods, we demonstrate the accuracy of our reconstructions, the quality of our relighting results, and the ability to provide intuitive controls for modifying an avatar’s appearance and geometry via texture mapping without additional optimization.
zh

[CV-98] Integrated Pipeline for Coronary Angiography With Automated Lesion Profiling Virtual Stenting and 100-Vessel FFR Validation

【速读】:该论文旨在解决冠状动脉造影在评估冠状动脉狭窄时存在主观性高、与心肌缺血相关性弱的问题,以及传统基于导丝的血流储备分数(Fractional Flow Reserve, FFR)虽能改善病变选择但未被系统应用的局限。其解决方案的关键在于开发了一种端到端的仅依赖血管造影的自动化分析流程——AngioAI-QFR,该流程融合深度学习实现狭窄检测、管腔分割、中心线与直径提取、每毫米相对流量容量(Relative Flow Capacity, RFC)剖面生成,并支持虚拟支架植入及自动重新计算QFR(定量血流比值),从而将计算机视觉、功能学评估和虚拟经皮冠状动脉介入治疗(PCI)规划统一于一个高效、近实时的平台中,显著提升了诊断准确性与临床实用性。

链接: https://arxiv.org/abs/2512.09134
作者: Georgy Kopanitsa,Oleg Metsker,Alexey Yakovlev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Coronary angiography is the main tool for assessing coronary artery disease, but visual grading of stenosis is variable and only moderately related to ischaemia. Wire based fractional flow reserve (FFR) improves lesion selection but is not used systematically. Angiography derived indices such as quantitative flow ratio (QFR) offer wire free physiology, yet many tools are workflow intensive and separate from automated anatomy analysis and virtual PCI planning. We developed AngioAI-QFR, an end to end angiography only pipeline combining deep learning stenosis detection, lumen segmentation, centreline and diameter extraction, per millimetre Relative Flow Capacity profiling, and virtual stenting with automatic recomputation of angiography derived QFR. The system was evaluated in 100 consecutive vessels with invasive FFR as reference. Primary endpoints were agreement with FFR (correlation, mean absolute error) and diagnostic performance for FFR = 0.80. On held out frames, stenosis detection achieved precision 0.97 and lumen segmentation Dice 0.78. Across 100 vessels, AngioAI-QFR correlated strongly with FFR (r = 0.89, MAE 0.045). The AUC for detecting FFR = 0.80 was 0.93, with sensitivity 0.88 and specificity 0.86. The pipeline completed fully automatically in 93 percent of vessels, with median time to result 41 s. RFC profiling distinguished focal from diffuse capacity loss, and virtual stenting predicted larger QFR gain in focal than in diffuse disease. AngioAI-QFR provides a practical, near real time pipeline that unifies computer vision, functional profiling, and virtual PCI with automated angiography derived physiology.
zh

[CV-99] SuperF: Neural Implicit Fields for Multi-Image Super-Resolution

【速读】:该论文旨在解决多图像超分辨率(Multi-Image Super-Resolution, MISR)任务中因单图超分方法易产生“幻觉”结构而导致重建失真的问题,尤其在缺乏高分辨率训练数据的情况下提升重建质量。其核心解决方案是提出SuperF,一种基于测试时优化(test-time optimization)的MISR方法,利用坐标编码的神经场(coordinate-based neural networks,即隐式神经表示 Implicit Neural Representation, INR)建模连续信号,并通过共享一个INR来联合优化多个子像素偏移的低分辨率图像帧及其对齐参数。关键创新在于直接将子像素对齐参数作为可学习的仿射变换进行优化,并采用对应于输出分辨率的超采样坐标网格进行训练,从而无需依赖任何高分辨率训练数据即可实现高达8倍的上采样效果。

链接: https://arxiv.org/abs/2512.09115
作者: Sander Riisøen Jyhne,Christian Igel,Morten Goodwin,Per-Arne Andersen,Serge Belongie,Nico Lang
机构: University of Agder (奥加德大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 13 figures, 8 tables

点击查看摘要

Abstract:High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to “hallucinated” structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data. Comments: 23 pages, 13 figures, 8 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.09115 [cs.CV] (or arXiv:2512.09115v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.09115 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-100] GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

【速读】:该论文旨在解决当前文本到视频生成(text-to-video generation)模型在相机运动与朝向的细粒度控制方面存在的局限性问题。现有方法通常采用相对或模糊的表示方式编码相机轨迹,导致难以实现明确的几何控制。其解决方案的关键在于提出GimbalDiffusion框架,该框架基于物理世界坐标系(以重力为全局参考)定义相机轨迹,从而实现无需初始参考帧即可精确且可解释地控制相机参数;同时引入无俯仰角条件(null-pitch conditioning)策略,降低模型对文本内容的依赖,避免因相机指令与文本描述冲突(如摄像机指向天空却要求生成草地)而导致的生成错误,并通过重构SpatialVID-HQ数据集建立面向相机感知的视频生成基准,显著提升了文本到视频模型的可控性与鲁棒性。

链接: https://arxiv.org/abs/2512.09112
作者: Frédéric Fortier-Chouinard,Yannick Hold-Geoffroy,Valentin Deschaintre,Matheus Gadelha,Jean-François Lalonde
机构: Université Laval (拉瓦尔大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model’s reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.
zh

[CV-101] Food Image Generation on Multi-Noun Categories WACV2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在处理包含多个名词的食品类别(如“egg noodle”)时,因文本编码器对多名词语义理解不足及名词间关系误判而导致图像生成错误的问题,例如将“蛋面”错误地表现为鸡蛋和面条两个独立实体。解决方案的关键在于提出 FoCULR(Food Category Understanding and Layout Refinement),通过在生成流程早期引入食品领域知识,并强化对多名词类别语义与空间布局的理解,从而提升图像生成的准确性与合理性。

链接: https://arxiv.org/abs/2512.09095
作者: Xinyue Pan,Yuhao Chen,Jiangpeng He,Fengqing Zhu
机构: Purdue University (普渡大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026

点击查看摘要

Abstract:Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt “egg noodle” may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.
zh

[CV-102] Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

【速读】:该论文旨在解决地下矿难场景中因黑暗、粉尘和坍塌导致视觉模糊,从而严重影响人类与传统系统态势感知能力的问题。解决方案的核心在于提出MDSE(Multimodal Disaster Situation Explainer),一个基于视觉-语言的新型框架,能够自动生成灾后地下场景的详细文本解释。其关键技术包括:(i) 上下文感知交叉注意力机制,实现即使在严重退化条件下仍能鲁棒对齐视觉与文本特征;(ii) 分割感知的双路径视觉编码结构,融合全局与区域特定嵌入以增强细节表达;(iii) 资源高效的基于Transformer的语言模型,在计算资源受限情况下生成富有表现力的描述。该方案通过首个真实地下灾后场景图像-文本语料库UMD数据集进行训练与评估,实验证明其显著优于现有最先进图像描述模型,有效提升了灾后应急响应中的态势感知水平。

链接: https://arxiv.org/abs/2512.09092
作者: Mizanur Rahman Jewel,Mohamed Elmahallawy,Sanjay Madria,Samuel Frimpong
机构: Missouri University of Science and Technology (密苏里科学技术大学); Washington State University (华盛顿州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset–the first image-caption corpus of real underground disaster scenes–enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at this https URL.
zh

[CV-103] Agent Comp: From Agent ic Reasoning to Compositional Mastery in Text-to-Image Models

【速读】:该论文旨在解决文本到图像生成模型(text-to-image generative models)在组合性(compositionality)方面的不足,即模型难以准确捕捉提示词中对象之间的关系、属性绑定以及细粒度细节。传统方法通常无法有效区分语义相近但组合结构不同的提示与图像输出,导致生成结果虽整体接近描述却在关键细节上出现偏差。解决方案的关键在于提出 AgentComp 框架,该框架利用具备图像生成、编辑和视觉问答(VQA)工具的大语言模型(LLM)的推理与工具调用能力,自主构建用于组合性差异识别的训练数据集,并通过代理偏好优化(agentic preference optimization)方法对文本到图像模型进行微调,从而显著提升其对细粒度组合变化的辨别能力和生成性能,且不牺牲图像质量,还展现出对未显式训练任务(如文本渲染)的泛化能力。

链接: https://arxiv.org/abs/2512.09081
作者: Arman Zarei,Jiacheng Pan,Matthew Gwilliam,Soheil Feizi,Zhenheng Yang
机构: TikTok; University of Maryland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality - accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality - a common drawback in prior approaches - and even generalizes to other capabilities not explicitly trained for, such as text rendering.
zh

[CV-104] Adaptive Thresholding for Visual Place Recognition using Negative Gaussian Mixture Statistics

【速读】:该论文旨在解决视觉场景识别(Visual Place Recognition, VPR)中匹配阈值难以自动设定的问题,尤其是在不同环境条件下(如季节变化、光照差异、结构变动等)导致图像特征显著变化时,手动设置阈值难以适应多样化的视觉场景。解决方案的关键在于利用“负样本”高斯混合模型(negative Gaussian mixture statistics)对场景进行建模,通过分析非目标场景的图像统计特性来自动确定合适的匹配阈值,从而在多种图像数据库和描述符下均能实现鲁棒且高效的VPR性能。

链接: https://arxiv.org/abs/2512.09071
作者: Nick Trinh,Damian Lyons
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted and presented at IEEE RoboticCC 2025. 4 pages short paper

点击查看摘要

Abstract:Visual place recognition (VPR) is an important component technology for camera-based mapping and navigation applications. This is a challenging problem because images of the same place may appear quite different for reasons including seasonal changes, weather illumination, structural changes to the environment, as well as transient pedestrian or vehicle traffic. Papers focusing on generating image descriptors for VPR report their results using metrics such as recall@K and ROC curves. However, for a robot implementation, determining which matches are sufficiently good is often reduced to a manually set threshold. And it is difficult to manually select a threshold that will work for a variety of visual scenarios. This paper addresses the problem of automatically selecting a threshold for VPR by looking at the ‘negative’ Gaussian mixture statistics for a place - image statistics indicating not this place. We show that this approach can be used to select thresholds that work well for a variety of image databases and image descriptors.
zh

[CV-105] KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

【速读】:该论文旨在解决深度学习模型在年龄相关性黄斑变性(age-related macular degeneration, AMD)和脉络膜新生血管(choroidal neovascularization, CNV)等眼科疾病诊断中部署效率低的问题,特别是高精度模型如ConvNeXtV2-Large因计算资源消耗大而难以实现实时临床应用。解决方案的关键在于提出一种新型知识蒸馏框架KD-OCT,通过将增强训练策略(包括先进数据增强、随机权重平均和焦点损失)优化后的教师模型知识,以软标签与硬标签联合监督的方式蒸馏至轻量级EfficientNet-B2学生模型,从而在显著降低模型参数量和推理时间的同时保持接近教师模型的分类性能,实现高效且准确的AMD筛查模型边缘部署。

链接: https://arxiv.org/abs/2512.09069
作者: Erfan Nourbakhsh,Nasrin Sanjari,Ali Nourbakhsh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency- accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening. Code is available at this https URL OCT.
zh

[CV-106] SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding

【速读】:该论文旨在解决当前3D场景理解数据集与实际建筑工地LiDAR采集条件不匹配的问题。现有公开数据集多基于密集融合扫描,具备均匀采样和完整可见性,而真实施工场地受限于安全、通行及作业干扰等因素,常采用孤立单站LiDAR观测,导致径向密度衰减、几何碎片化和视场依赖的可见性特征未被充分建模。解决方案的关键在于提出SIP(Site in Pieces)数据集,其通过实地采集反映真实施工场景的LiDAR点云,并基于施工环境定制的分类体系(A. 建筑环境,B. 施工作业,C. 场地周边)进行逐点标注,涵盖结构构件与细长临时设施(如脚手架、机电管道、剪刀式升降机),同时建立标准化扫描协议、标注流程与质量控制机制,从而为构建鲁棒的施工导向型3D视觉任务提供具有现实感知特性的基准数据。

链接: https://arxiv.org/abs/2512.09062
作者: Seongyong Kim,Yong Kwon Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate 3D scene interpretation in active construction sites is essential for progress monitoring, safety assessment, and digital twin development. LiDAR is widely used in construction because it offers advantages over camera-based systems, performing reliably in cluttered and dynamically changing conditions. Yet most public datasets for 3D perception are derived from densely fused scans with uniform sampling and complete visibility, conditions that do not reflect real construction sites. Field data are often collected as isolated single-station LiDAR views, constrained by safety requirements, limited access, and ongoing operations. These factors lead to radial density decay, fragmented geometry, and view-dependent visibility-characteristics that remain underrepresented in existing datasets. This paper presents SIP, Site in Pieces, a dataset created to reflect the practical constraints of LiDAR acquisition during construction. SIP provides indoor and outdoor scenes captured with a terrestrial LiDAR scanner and annotated at the point level using a taxonomy tailored to construction environments: A. Built Environment, B. Construction Operations, and C. Site Surroundings. The dataset includes both structural components and slender temporary objects such as scaffolding, MEP piping, and scissor lifts, where sparsity caused by occlusion and fragmented geometry make segmentation particularly challenging. The scanning protocol, annotation workflow, and quality control procedures establish a consistent foundation for the dataset. SIP is openly available with a supporting Git repository, offering adaptable class configurations that streamline adoption within modern 3D deep learning frameworks. By providing field data that retain real-world sensing characteristics, SIP enables robust benchmarking and contributes to advancing construction-oriented 3D vision tasks.
zh

[CV-107] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

【速读】:该论文旨在解决传统物体位姿估计(object pose estimation)方法依赖大量特定数据集训练的问题,从而限制了其泛化能力和应用范围。解决方案的关键在于提出 ConceptPose 框架,该框架无需任何对象或数据集特定训练,利用视觉语言模型(VLM)构建开放词汇的3D概念地图(open-vocabulary 3D concept maps),其中每个点通过显著性图(saliency maps)提取的概念向量进行标注,并基于这些概念地图建立鲁棒的3D-3D对应关系,实现6自由度(6DoF)相对位姿的精确估计。

链接: https://arxiv.org/abs/2512.09056
作者: Liming Kuang,Yordanka Velikova,Mahdi Saleh,Jan-Nico Zaech,Danda Pani Paudel,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”INSAIT研究所); 3dwe.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.
zh

[CV-108] Learning to Remove Lens Flare in Event Camera

【速读】:该论文旨在解决事件相机(event camera)数据中因镜头眩光(lens flare)导致的时空失真问题,这种光学伪影会严重降低视觉系统的性能。其解决方案的关键在于构建了首个系统性框架E-Deflare,首先通过推导物理驱动的前向模型揭示了非线性抑制机制的理论基础,进而设计出E-DeflareNet网络结构,并依托自建的大规模仿真训练集E-Flare-2.7K和首个配对的真实世界测试集E-Flare-R进行优化与验证,实现了事件流中镜头眩光的有效去除,显著提升了下游任务的性能表现。

链接: https://arxiv.org/abs/2512.09016
作者: Haiqian Han,Lingdong Kong,Jianing Li,Ao Liang,Chengtao Zhu,Jiacheng Lyu,Lai Xing Ng,Xiangyang Ji,Wei Tsang Ooi,Benoit R. Cottereau
机构: CNRS@CREATE; National University of Singapore; Tsinghua University; Institute for Infocomm Research, A*STAR; IPAL, CNRS IRL 2955, Singapore; CerCo, CNRS UMR 5549, Université Toulouse III
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; 29 pages, 14 figures, 4 tables; Project Page at this https URL

点击查看摘要

Abstract:Event cameras have the potential to revolutionize vision systems with their high temporal resolution and dynamic range, yet they remain susceptible to lens flare, a fundamental optical artifact that causes severe degradation. In event streams, this optical artifact forms a complex, spatio-temporal distortion that has been largely overlooked. We present E-Deflare, the first systematic framework for removing lens flare from event camera data. We first establish the theoretical foundation by deriving a physics-grounded forward model of the non-linear suppression mechanism. This insight enables the creation of the E-Deflare Benchmark, a comprehensive resource featuring a large-scale simulated training set, E-Flare-2.7K, and the first-ever paired real-world test set, E-Flare-R, captured by our novel optical system. Empowered by this benchmark, we design E-DeflareNet, which achieves state-of-the-art restoration performance. Extensive experiments validate our approach and demonstrate clear benefits for downstream tasks. Code and datasets are publicly available.
zh

[CV-109] An Approach for Detection of Entities in Dynamic Media Contents

【速读】:该论文旨在解决在视频序列中高效检测特定目标个体的问题,尤其是在复杂背景和多对象场景下实现精准定位。其解决方案的关键在于利用深度学习技术(特别是人工神经网络)构建结构化的监督学习算法,通过提取目标角色的简单特征实现有效识别与追踪。实验结果表明,该方法相较于现有技术具有更高的效率和准确性,尤其适用于基于图像数据库(如失踪人员、犯罪分子等)与公共安全监控视频的整合分析,在提升国家安防体系智能化水平方面展现出实际应用潜力。

链接: https://arxiv.org/abs/2512.09011
作者: Nzakiese Mbongo,Ngombo Armando
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:The notion of learning underlies almost every evolution of Intelligent Agents. In this paper, we present an approach for searching and detecting a given entity in a video sequence. Specifically, we study how the deep learning technique by artificial neuralnetworks allows us to detect a character in a video sequence. The technique of detecting a character in a video is a complex field of study, considering the multitude of objects present in the data under analysis. From the results obtained, we highlight the following, compared to state of the art: In our approach, within the field of Computer Vision, the structuring of supervised learning algorithms allowed us to achieve several successes from simple characteristics of the target character. Our results demonstrate that is new approach allows us to locate, in an efficient way, wanted individuals from a private or public image base. For the case of Angola, the classifier we propose opens the possibility of reinforcing the national security system based on the database of target individuals (disappeared, criminals, etc.) and the video sequences of the Integrated Public Security Centre (CISP).
zh

[CV-110] owards Lossless Ultimate Vision Token Compression for VLMs

【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)在计算效率和延迟方面的挑战,其根源在于高分辨率图像和视频中存在大量冗余的token表示。现有基于注意力机制或相似性的压缩算法因位置偏差(position bias)和类别不平衡(class imbalance)导致精度显著下降,且无法有效适配浅层大语言模型(Large Language Model, LLM)层,这些层通常表现出较弱的跨模态交互能力。解决方案的关键在于提出Lossless Ultimate Vision tokens Compression (LUVC)框架:首先通过一种正交于空间轴的迭代合并策略扩展token压缩至视觉编码器,加速整个VLM的计算;其次引入一个无需注意力机制的频谱剪枝单元(spectrum pruning unit),利用低通滤波逐步去除冗余视觉token,并与现代FlashAttention完全兼容;最终实现视觉token在LLM末端层的逐层无损压缩,使高维视觉特征逐步融合进多模态查询中,从而在不损失精度的前提下实现推理速度提升2倍,且无需训练即可部署于多种VLM架构。

链接: https://arxiv.org/abs/2512.09010
作者: Dehua Zheng,Mouxiao Huang,Borui Jiang,Hailin Hu,Xinghao Chen
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.
zh

[CV-111] A Survey of Body and Face Motion: Datasets Performance Evaluation Metrics and Generative Techniques

【速读】:该论文旨在解决生成式 AI(Generative AI)在面部与身体动作建模中的核心挑战,即如何实现具有表现力和连贯性的双人交互场景下动态行为生成。问题的关键在于复杂非语言线索(如表情、肢体语言)与言语内容及个体性格特征之间的多模态耦合关系难以建模。解决方案的关键在于系统性梳理当前主流的表示方法、生成模型架构、数据集与评估指标,并提出未来研究方向以提升虚拟化身在社交互动中的真实感、一致性与个性化表达能力。

链接: https://arxiv.org/abs/2512.09005
作者: Lownish Rai Sookha,Nikhil Pakhale,Mudasir Ganaie,Abhinav Dhall
机构: Indian Institute of Technology Ropar (印度理工学院拉普尔分校); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on this https URL.
zh

[CV-112] A Physics-Constrained Design-Driven Methodology for Defect Dataset Generation in Optical Lithography

【速读】:该论文旨在解决半导体微纳制造中缺陷检测任务因高质量、物理可信训练数据稀缺而导致的人工智能(Artificial Intelligence, AI)模型性能受限的问题。其核心挑战在于光刻缺陷数据在半导体行业中难以获取,缺乏公开可用的标注数据集。解决方案的关键在于提出一种新颖的生成式方法,通过基于物理约束的数学形态学操作(如腐蚀与膨胀)从原始设计版图中自底向上合成可控的缺陷布局,并利用高保真数字微镜器件(Digital Micromirror Device, DMD)光刻技术将这些合成缺陷转化为真实物理样本;随后通过对比有缺陷与无缺陷样本的光学显微图像,实现像素级精确的缺陷轮廓标注。此方法成功构建了包含3,530张光学显微图像和13,365个标注缺陷实例的大规模数据集,显著提升了基于掩码的R-CNN(Mask R-CNN)在桥接、毛刺、夹断和污染四类缺陷上的检测精度,平均AP@0.5较Faster R-CNN提升约34%,尤其对污染类缺陷提升达42%,验证了该方法在半导体制造中AI驱动测量/检测(Measurement/Inspection, MI)应用中的可行性与优越性。

链接: https://arxiv.org/abs/2512.09001
作者: Yuehua Hu,Jiyeong Kong,Dong-yeol Shin,Jaekyun Kim,Kyung-Tae Kang
机构: Korea Institute of Industrial Technology (KITECH); Hanyang University; Korea University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The efficacy of Artificial Intelligence (AI) in micro/nano manufacturing is fundamentally constrained by the scarcity of high-quality and physically grounded training data for defect inspection. Lithography defect data from semiconductor industry are rarely accessible for research use, resulting in a shortage of publicly available datasets. To address this bottleneck in lithography, this study proposes a novel methodology for generating large-scale, physically valid defect datasets with pixel-level annotations. The framework begins with the ab initio synthesis of defect layouts using controllable, physics-constrained mathematical morphology operations (erosion and dilation) applied to the original design-level layout. These synthesized layouts, together with their defect-free counterparts, are fabricated into physical samples via high-fidelity digital micromirror device (DMD)-based lithography. Optical micrographs of the synthesized defect samples and their defect-free references are then compared to create consistent defect delineation annotations. Using this methodology, we constructed a comprehensive dataset of 3,530 Optical micrographs containing 13,365 annotated defect instances including four classes: bridge, burr, pinch, and contamination. Each defect instance is annotated with a pixel-accurate segmentation mask, preserving full contour and geometry. The segmentation-based Mask R-CNN achieves AP@0.5 of 0.980, 0.965, and 0.971, compared with 0.740, 0.719, and 0.717 for Faster R-CNN on bridge, burr, and pinch classes, representing a mean AP@0.5 improvement of approximately 34%. For the contamination class, Mask R-CNN achieves an AP@0.5 roughly 42% higher than Faster R-CNN. These consistent gains demonstrate that our proposed methodology to generate defect datasets with pixel-level annotations is feasible for robust AI-based Measurement/Inspection (MI) in semiconductor fabrication.
zh

[CV-113] Diffusion Model Regularized Implicit Neural Representation for CT Metal Artifact Reduction

【速读】:该论文旨在解决医学计算机断层扫描(CT)图像中金属伪影(metal artifact)严重干扰图像质量的问题,尤其针对现有监督学习方法因依赖有限的配对金属-无伪影数据而导致性能不稳定、以及现有无监督方法未能有效融合CT物理几何约束和充分挖掘先验知识的局限性。解决方案的关键在于提出一种基于扩散模型(diffusion model)正则化的隐式神经表示(implicit neural representation, INR)框架:其中,隐式神经表示通过嵌入物理约束来保障数据保真度,而预训练的扩散模型则提供强大的先验知识以约束解空间,从而实现更稳定且泛化能力强的金属伪影去除效果。

链接: https://arxiv.org/abs/2512.08999
作者: Jie Wen,Chenhe Du,Xiao Wang,Yuyao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed tomography (CT) images are often severely corrupted by artifacts in the presence of metals. Existing supervised metal artifact reduction (MAR) approaches suffer from performance instability on known data due to their reliance on limited paired metal-clean data, which limits their clinical applicability. Moreover, existing unsupervised methods face two main challenges: 1) the CT physical geometry is not effectively incorporated into the MAR process to ensure data fidelity; 2) traditional heuristics regularization terms cannot fully capture the abundant prior knowledge available. To overcome these shortcomings, we propose diffusion model regularized implicit neural representation framework for MAR. The implicit neural representation integrates physical constraints and imposes data fidelity, while the pre-trained diffusion model provides prior knowledge to regularize the solution. Experimental results on both simulated and clinical data demonstrate the effectiveness and generalization ability of our method, highlighting its potential to be applied to clinical settings.
zh

[CV-114] Demo: Generative AI helps Radiotherapy Planning with User Preference ALT NEURIPS2025

【速读】:该论文旨在解决放射治疗计划(radiotherapy planning)中因机构和个体规划师差异导致的剂量预测模型偏差问题。现有深度学习方法通常以参考计划作为训练标签,容易使模型偏向特定机构的规划风格或偏好,从而限制了其通用性和个性化能力。解决方案的关键在于提出一种新型生成式模型,该模型仅依赖用户定义的偏好风味(preference flavors)来预测三维剂量分布(3D dose distributions),通过可定制的器官受照风险(OARs)与靶区(PTVs)之间的权衡参数,实现对不同临床需求的灵活响应,同时具备与临床治疗计划系统(TPS)无缝集成的能力,显著提升了计划适应性和质量。

链接: https://arxiv.org/abs/2512.08996
作者: Riqiang Gao,Simon Arberet,Martin Kraus,Han Liu,Wilko FAR Verbakel,Dorin Comaniciu,Florin-Cristian Ghesu,Ali Kamen
机构: Siemens Healthineers(西门子医疗); Siemens Healthineers(西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Best paper in GenAI4Health at NeurIPS 2025

点击查看摘要

Abstract:Radiotherapy planning is a highly complex process that often varies significantly across institutions and individual planners. Most existing deep learning approaches for 3D dose prediction rely on reference plans as ground truth during training, which can inadvertently bias models toward specific planning styles or institutional preferences. In this study, we introduce a novel generative model that predicts 3D dose distributions based solely on user-defined preference flavors. These customizable preferences enable planners to prioritize specific trade-offs between organs-at-risk (OARs) and planning target volumes (PTVs), offering greater flexibility and personalization. Designed for seamless integration with clinical treatment planning systems, our approach assists users in generating high-quality plans efficiently. Comparative evaluations demonstrate that our method can surpasses the Varian RapidPlan model in both adaptability and plan quality in some scenarios.
zh

[CV-115] Deterministic World Models for Verification of Closed-loop Vision-based Systems

【速读】:该论文旨在解决闭环视觉控制系统验证中的核心难题,即高维图像输入与视觉环境建模困难导致的验证精度不足问题。现有方法依赖于带有随机潜变量(stochastic latent variables)的生成模型作为相机替代模型时,会引入不必要的过估计误差(overapproximation error)。其解决方案的关键在于提出一种确定性世界模型(Deterministic World Model, DWM),该模型直接将系统状态映射到生成图像,从而消除不可解释的潜变量,确保输入边界的精确性;同时采用双目标损失函数联合优化像素级重建精度与控制差异损失,以保持与真实系统的动态行为一致性,并结合基于Star的可达性分析(StarV)和保形预测(conformal prediction)技术,为轨迹偏差提供严格的统计边界,显著提升了可达集紧致性和验证性能。

链接: https://arxiv.org/abs/2512.08991
作者: Yuang Geng,Zhuoyang Zhou,Zhongzheng Zhang,Siyuan Pan,Hoang-Dung Tran,Ivan Ruchkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 10 figures. Submitted to FM 2026

点击查看摘要

Abstract:Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.
zh

[CV-116] Enhancing Knowledge Transfer in Hyperspectral Image Classification via Cross-scene Knowledge Integration

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中跨场景知识迁移的两个核心挑战:一是不同传感器导致的光谱差异,二是异构场景间的语义不一致。现有方法受限于同质域假设或仅支持共现类别的异构场景,无法处理标签空间无重叠的情况,且忽视目标域私有信息。解决方案的关键在于提出Cross-scene Knowledge Integration (CKI)框架,其创新性地在迁移过程中显式整合目标域私有知识,包含三个核心模块:(1) 光谱特征对齐(Alignment of Spectral Characteristics, ASC),通过领域无关投影减少光谱差异;(2) 跨场景知识共享偏好(Cross-scene Knowledge Sharing Preference, CKSP),借助源相似性机制(Source Similarity Mechanism, SSM)缓解语义错位;(3) 补充信息融合(Complementary Information Integration, CII),最大化利用目标域特有的补充线索。该框架实现了完全异构场景下的有效知识迁移,并在多种跨场景HSI分类任务中达到最优性能。

链接: https://arxiv.org/abs/2512.08989
作者: Lu Huo,Wenjian Huang,Jianguo Zhang,Min Xu,Haimin Zhang
机构: University of Technology Sydney (悉尼科技大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge transfer has strong potential to improve hyperspectral image (HSI) classification, yet two inherent challenges fundamentally restrict effective cross-domain transfer: spectral variations caused by different sensors and semantic inconsistencies across heterogeneous scenes. Existing methods are limited by transfer settings that assume homogeneous domains or heterogeneous scenarios with only co-occurring categories. When label spaces do not overlap, they further rely on complete source-domain coverage and therefore overlook critical target-private information. To overcome these limitations and enable knowledge transfer in fully heterogeneous settings, we propose Cross-scene Knowledge Integration (CKI), a framework that explicitly incorporates target-private knowledge during transfer. CKI includes: (1) Alignment of Spectral Characteristics (ASC) to reduce spectral discrepancies through domain-agnostic projection; (2) Cross-scene Knowledge Sharing Preference (CKSP), which resolves semantic mismatch via a Source Similarity Mechanism (SSM); and (3) Complementary Information Integration (CII) to maximize the use of target-specific complementary cues. Extensive experiments verify that CKI achieves state-of-the-art performance with strong stability across diverse cross-scene HSI scenarios.
zh

[CV-117] 3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization NEURIPS2025

【速读】:该论文旨在解决三维(3D)逆向设计中因设计空间维度高而导致的计算复杂性问题,传统方法受限于二维投影或对已有3D形状微调,难以实现从零开始的高质量3D结构生成。其解决方案的关键在于提出一个端到端的3D逆向设计(3D Inverse Design, 3DID)框架,通过耦合连续潜在表示(continuous latent representation)与物理感知优化策略,首先学习统一的物理-几何嵌入(physics-geometry embedding),将形状和物理场数据压缩至连续潜在空间;随后采用两阶段优化策略:第一阶段利用梯度引导的扩散采样器探索全局潜在流形,第二阶段通过目标驱动且拓扑保持的精修进一步逼近最优解,从而实现高保真度3D几何体的生成,在解的质量与设计多样性上均优于现有方法。

链接: https://arxiv.org/abs/2512.08987
作者: Yuze Hao,Linchao Zhu,Yi Yang
机构: Zhejiang University (浙江大学); The State Key Lab of Brain-Machine Intelligence (脑机智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Inverse design aims to design the input variables of a physical system to optimize a specified objective function, typically formulated as a search or optimization problem. However, in 3D domains, the design space grows exponentially, rendering exhaustive grid-based searches infeasible. Recent advances in deep learning have accelerated inverse design by providing powerful generative priors and differentiable surrogate models. Nevertheless, current methods tend to approximate the 3D design space using 2D projections or fine-tune existing 3D shapes. These approaches sacrifice volumetric detail and constrain design exploration, preventing true 3D design from scratch. In this paper, we propose a 3D Inverse Design (3DID) framework that directly navigates the 3D design space by coupling a continuous latent representation with a physics-aware optimization strategy. We first learn a unified physics-geometry embedding that compactly captures shape and physical field data in a continuous latent space. Then, we introduce a two-stage strategy to perform physics-aware optimization. In the first stage, a gradient-guided diffusion sampler explores the global latent manifold. In the second stage, an objective-driven, topology-preserving refinement further sculpts each candidate toward the target objective. This enables 3DID to generate high-fidelity 3D geometries, outperforming existing methods in both solution quality and design versatility.
zh

[CV-118] Explainable Fundus Image Curation and Lesion Detection in Diabetic Retinopathy

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期诊断中因图像质量不佳和人工标注不一致导致的AI模型训练与评估可靠性问题。其核心解决方案在于提出一个质量控制框架,关键步骤包括:首先利用可解释的基于特征的分类器过滤低质量图像,该分类器结合传统图像处理与对比学习(contrastive learning)提取多维特征;其次对筛选后的图像进行增强并借助深度学习辅助标注;最后通过计算标注者间一致性指标来判定标注数据的可用性,从而确保用于AI训练和评估的数据具备高质量标准。

链接: https://arxiv.org/abs/2512.08986
作者: Anca Mihai,Adrian Groza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) affects individuals with long-term diabetes. Without early diagnosis, DR can lead to vision loss. Fundus photography captures the structure of the retina along with abnormalities indicative of the stage of the disease. Artificial Intelligence (AI) can support clinicians in identifying these lesions, reducing manual workload, but models require high-quality annotated datasets. Due to the complexity of retinal structures, errors in image acquisition and lesion interpretation of manual annotators can occur. We proposed a quality-control framework, ensuring only high-standard data is used for evaluation and AI training. First, an explainable feature-based classifier is used to filter inadequate images. The features are extracted both using image processing and contrastive learning. Then, the images are enhanced and put subject to annotation, using deep-learning-based assistance. Lastly, the agreement between annotators calculated using derived formulas determines the usability of the annotations.
zh

[CV-119] An Efficient Test-Time Scaling Approach for Image Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)图像生成模型在测试时计算资源分配不均的问题,即如何更高效地利用测试时计算能力(test-time compute)来提升图像生成质量或效率。现有方法虽尝试在不同去噪步骤中非均匀分配计算预算,但依赖贪婪算法导致计算资源利用效率低下。论文提出的关键解决方案是“验证器-阈值法”(Verifier-Threshold method),该方法能自动重新分配测试时计算资源,在保持GenEval基准性能不变的前提下,相较当前最优方法实现2–4倍的计算时间减少。

链接: https://arxiv.org/abs/2512.08985
作者: Vignesh Sundaresha,Akash Haridas,Vikram Appia,Lav Varshney
机构: UIUC; AMD; Stony Brook University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Image generation has emerged as a mainstream application of large generative AI models. Just as test-time compute and reasoning have helped language models improve their capabilities, similar benefits have also been observed with image generation models. In particular, searching over noise samples for diffusion and flow models has shown to scale well with test-time compute. While recent works have explored allocating non-uniform inference-compute budgets across different denoising steps, they rely on greedy algorithms and allocate the compute budget ineffectively. In this work, we study this problem and propose solutions to fix it. We propose the Verifier-Threshold method which automatically reallocates test-time compute and delivers substantial efficiency improvements. For the same performance on the GenEval benchmark, we achieve a 2-4x reduction in computational time over the state-of-the-art method.
zh

[CV-120] RAG -HAR: Retrieval Augmented Generation-based Human Activity Recognition

【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)中现有深度学习方法依赖特定数据集训练、需要大量标注数据及高计算资源的问题。其解决方案的关键在于提出一种无需训练的检索增强框架RAG-HAR,该框架利用大语言模型(Large Language Models, LLMs)通过轻量级统计特征提取、向量数据库中的语义相似样本检索,并结合上下文证据实现活动识别;进一步通过提示优化和引入LLM生成的活动描述符构建情境增强型向量库,显著提升识别准确性和相关性,从而在六个不同HAR基准上达到最先进性能,且无需模型微调,具备强鲁棒性和实际应用潜力。

链接: https://arxiv.org/abs/2512.08984
作者: Nirhoshan Sivaroopan,Hansi Karunarathna,Chamara Madarasingha,Anura Jayasumana,Kanchana Thilakarathna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational this http URL introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. Most importantly, RAG-HAR attains these improvements without requiring model training or fine-tuning, emphasizing its robustness and practical applicability. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities.
zh

[CV-121] HSCP: A Two-Stage Spectral Clustering Framework for Resource-Constrained UAV Identification

【速读】:该论文旨在解决深度学习驱动的无人机射频指纹识别(Radio Frequency Fingerprint Identification, RFFI)模型在资源受限边缘设备上部署时面临的挑战,即模型参数量大、计算复杂度高与识别精度难以兼顾的问题。现有剪枝方法如权重剪枝、通道剪枝和层剪枝难以同时优化压缩率、硬件加速效率和识别准确率。其解决方案的关键在于提出一种分层谱聚类剪枝框架(Hierarchical Spectral Clustering Pruning, HSCP),通过两阶段剪枝策略实现极致压缩与高性能推理:第一阶段利用中心核对齐(Centered Kernel Alignment, CKA)引导的谱聚类识别并移除冗余网络层;第二阶段在同一策略下对剩余网络的通道维度进行细粒度冗余消除;此外,引入噪声鲁棒微调策略以保障模型在低信噪比环境下的稳定性。实验表明,HSCP在ResNet18上实现了86.39%参数压缩和84.44%浮点运算量(FLOPs)减少的同时,识别准确率提升1.49%,显著优于现有方法。

链接: https://arxiv.org/abs/2512.08983
作者: Maoyu Wang,Yao Lu,Bo Zhou,Zhuangzhi Chen,Yun Lin,Qi Xuan,Guan Gui
机构: Zhejiang University of Technology (浙江工业大学); Institute of Cyberspace Security, College of Information Engineering, Zhejiang University of Technology (浙江工业大学信息工程学院网络空间安全研究所); Binjiang Institute of Artificial Intelligence, Zhejiang University of Technology (浙江工业大学滨江人工智能研究院); Zhejiang Institute of Communications (浙江省交通科学研究院); Ocean College, Zhejiang University (浙江大学海洋学院); UniTTEC Co., Ltd (UniTTEC有限公司); Harbin Engineering University (哈尔滨工程大学); College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications (南京邮电大学通信与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of Unmanned Aerial Vehicles (UAVs) and the increasing complexity of low-altitude security threats, traditional UAV identification methods struggle to extract reliable signal features and meet real-time requirements in complex environments. Recently, deep learning based Radio Frequency Fingerprint Identification (RFFI) approaches have greatly improved recognition accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained edge devices. While model pruning offers a general solution for complexity reduction, existing weight, channel, and layer pruning techniques struggle to concurrently optimize compression rate, hardware acceleration, and recognition accuracy. To this end, in this paper, we introduce HSCP, a Hierarchical Spectral Clustering Pruning framework that combines layer pruning with channel pruning to achieve extreme compression, high performance, and efficient inference. In the first stage, HSCP employs spectral clustering guided by Centered Kernel Alignment (CKA) to identify and remove redundant layers. Subsequently, the same strategy is applied to the channel dimension to eliminate a finer redundancy. To ensure robustness, we further employ a noise-robust fine-tuning strategy. Experiments on the UAV-M100 benchmark demonstrate that HSCP outperforms existing channel and layer pruning methods. Specifically, HSCP achieves 86.39% parameter reduction and 84.44% FLOPs reduction on ResNet18 while improving accuracy by 1.49% compared to the unpruned baseline, and maintains superior robustness even in low signal-to-noise ratio environments.
zh

[CV-122] Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement

【速读】:该论文旨在解决扩散模型(Diffusion Models)在低光照图像增强任务中因需数百次迭代采样而导致的实际部署效率低下问题,同时探索一致性模型(Consistency Models)在条件增强场景下的应用空白。其解决方案的关键在于提出首个将一致性建模应用于基于Retinex分解的低光照增强框架——Consist-Retinex,核心创新包括:(1) 设计双目标一致性损失,结合随机时间采样下的时序一致性与真实标签对齐,实现全谱监督以保障稳定收敛;(2) 提出自适应噪声增强采样策略,优先聚焦于大噪声区域,从而有效支撑单步条件生成所需的跨退化输入到增强输出的映射能力。实验表明,Consist-Retinex 在VE-LOL-L数据集上仅用一步采样即达到SOTA性能(PSNR: 25.51 vs. 23.41),且训练预算仅为1000步扩散模型的1/8。

链接: https://arxiv.org/abs/2512.08982
作者: Jian Xu,Wei Chen,Shigui Li,Delu Zeng,John Paisley,Qibin Zhao
机构: South China University of Technology (华南理工大学); Columbia University (哥伦比亚大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in low-light image enhancement through Retinex-based decomposition, yet their requirement for hundreds of iterative sampling steps severely limits practical deployment. While recent consistency models offer promising one-step generation for \textitunconditional synthesis, their application to \textitconditional enhancement remains unexplored. We present \textbfConsist-Retinex, the first framework adapting consistency modeling to Retinex-based low-light enhancement. Our key insight is that conditional enhancement requires fundamentally different training dynamics than unconditional generation standard consistency training focuses on low-noise regions near the data manifold, while conditional mapping critically depends on large-noise regimes that bridge degraded inputs to enhanced outputs. We introduce two core innovations: (1) a \textbfdual-objective consistency loss combining temporal consistency with ground-truth alignment under randomized time sampling, providing full-spectrum supervision for stable convergence; and (2) an \textbfadaptive noise-emphasized sampling strategy that prioritizes training on large-noise regions essential for one-step conditional generation. On VE-LOL-L, Consist-Retinex achieves \textbfstate-of-the-art performance with single-step sampling (\textbfPSNR: 25.51 vs. 23.41, FID: 44.73 vs. 49.59 compared to Diff-Retinex++), while requiring only \textbf1/8 of the training budget relative to the 1000-step Diff-Retinex baseline.
zh

[CV-123] Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding BMVC

【速读】:该论文旨在解决人脸识别(Face Recognition, FR)系统中存在的群体偏差问题,这种偏差源于面部嵌入中身份相关特征与特定人口统计学属性(如性别、种族等)信息的纠缠,导致不同群体间的验证性能不均衡。解决方案的关键在于提出一种名为统一文本-图像嵌入(Unified Text-Image Embedding, UTIE)的新策略,其核心思想是通过引入其他群体的文本描述性特征来丰富当前群体的面部嵌入,从而在嵌入空间中诱导人口统计学模糊性,促使模型更关注身份相关的特征而非群体特异性线索。UTIE利用视觉语言模型(Vision-Language Models, VLMs)的零样本能力和跨模态语义对齐特性,将来自其他群体的文本特征嵌入到目标群体的面部表示中,实现更具公平性的验证表现,且在多个基准测试(RFW 和 BFW)上验证了其有效性。

链接: https://arxiv.org/abs/2512.08981
作者: Tahar Chettaoui,Naser Damer,Fadi Boutros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at BMVC workshop (SRBS) 2025

点击查看摘要

Abstract:Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.
zh

[CV-124] raining Multi-Image Vision Agents via End2End Reinforcement Learning

【速读】:该论文旨在解决当前基于视觉语言模型(Vision Language Model, VLM)的开源代理在处理真实世界多图像问答(multi-image QA)任务时能力不足的问题,尤其是多数方法仅支持单图输入,难以应对复杂场景下的多视图理解需求。其解决方案的关键在于提出 IMAgent,一个通过端到端强化学习(reinforcement learning, RL)训练的开源视觉代理,专为复杂多图像任务设计;核心创新包括:1)利用多智能体系统生成具有挑战性和视觉丰富性的多图像QA对,充分激发基础VLM的工具使用潜力;2)引入两个专用工具用于视觉反思与确认,引导模型在推理过程中主动重新分配注意力至图像内容,缓解深度推理中视觉信息被忽略的问题;3)采用动作轨迹两级掩码策略,在无需昂贵监督微调数据的情况下实现稳定的工具使用行为。

链接: https://arxiv.org/abs/2512.08980
作者: Chengqi Dong,Chuhuai Yue,Hang He,Rongge Mao,Fenghe Tang,S Kevin Zhou,Zekun Xu,Xiaohan Wang,Jiajun Chai,Wei Lin,Guojun Yin
机构: MeiTuan(美团); University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent VLM-based agents aim to replicate OpenAI O3’s ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.
zh

[CV-125] What Happens When: Learning Temporal Orders of Events in Videos WACV2026

【速读】:该论文旨在解决视频大模型(Video Large Multimodal Models, VLMMs)在视频理解任务中对多事件时序顺序识别能力不足的问题。研究发现,即使视频帧被随机打乱,现有VLMMs在主流基准测试上仍表现良好,表明其可能依赖于典型场景的先验知识而非真正的时序推理能力。为系统评估和提升模型的时序理解能力,作者提出了VECTOR基准测试集,并设计了MECOT(Multi-Event instruction fine-tuning with Chain-of-Thought)方法:其关键在于通过细粒度逐事件描述进行指令微调,并在推理阶段引入思维链(Chain-of-Thought)提示策略以增强模型对事件时序关系的感知与推理能力。实验表明,MECOT不仅显著提升了在VECTOR上的性能,也改善了原有视频理解基准的表现,验证了该方案的有效性。

链接: https://arxiv.org/abs/2512.08979
作者: Daechul Ahn,Yura Choi,Hyeonbeom Choi,Seongwon Cho,San Kim,Jonghyun Choi
机构: Seoul National University (首尔国立大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026

点击查看摘要

Abstract:Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model’s ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.
zh

[CV-126] Controlling Steering Angle for Cooperative Self-driving Vehicles utilizing CNN and LSTM-based Deep Networks

【速读】:该论文旨在解决自动驾驶车辆在不同道路条件下精确调整转向角的问题,尤其针对现有深度学习方法普遍忽略图像帧之间时序依赖性的局限。其解决方案的关键在于提出一种新型端到端深度神经网络架构,该架构融合卷积神经网络(Convolutional Neural Network, CNN)、长短期记忆网络(Long Short-Term Memory, LSTM)和全连接(Fully Connected, FC)层,并利用车辆间通过车对车(Vehicle-to-Vehicle, V2V)通信共享的当前及未来图像作为输入,从而有效建模图像帧间的时序动态特性,显著提升转向角预测精度。

链接: https://arxiv.org/abs/1904.04375
作者: Rodolfo Valiente,Mahdi Zaman,Sedat Ozer,Yaser P. Fallah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in IV 2019, 6 pages, 9 figures

点击查看摘要

Abstract:A fundamental challenge in autonomous vehicles is adjusting the steering angle at different road conditions. Recent state-of-the-art solutions addressing this challenge include deep learning techniques as they provide end-to-end solution to predict steering angles directly from the raw input images with higher accuracy. Most of these works ignore the temporal dependencies between the image frames. In this paper, we tackle the problem of utilizing multiple sets of images shared between two autonomous vehicles to improve the accuracy of controlling the steering angle by considering the temporal dependencies between the image frames. This problem has not been studied in the literature widely. We present and study a new deep architecture to predict the steering angle automatically by using Long-Short-Term-Memory (LSTM) in our deep architecture. Our deep architecture is an end-to-end network that utilizes CNN, LSTM and fully connected (FC) layers and it uses both present and futures images (shared by a vehicle ahead via Vehicle-to-Vehicle (V2V) communication) as input to control the steering angle. Our model demonstrates the lowest error when compared to the other existing approaches in the literature.
zh

[CV-127] PathCo-LatticE: Pathology-Constrained Lattice-Of Experts Framework for Fully-supervised Few-Shot Cardiac MRI Segmentation

【速读】:该论文旨在解决少样本学习(Few-shot Learning, FSL)在心脏磁共振成像(Cardiac MRI)分割任务中因依赖半监督技术而导致的域偏移敏感性和验证偏差问题,从而限制了零样本泛化能力。其核心解决方案是提出PathCo-LatticE框架,关键在于三点:首先,通过虚拟患者引擎(Virtual Patient Engine)利用生成建模从稀疏临床锚点中构建连续的潜在疾病轨迹,合成生理上合理且完全标注的3D数据集;其次,采用自强化交错验证(Self-Reinforcing Interleaved Validation, SIV)机制,在线使用逐步挑战性的合成样本进行无泄漏评估,无需真实验证数据;最后,引入动态专家网格(Dynamic Lattice-of-Experts, LoE),基于病理感知拓扑结构组织专用网络,并根据输入激活最相关的专家,实现无需目标域微调即可对未见数据进行鲁棒的零样本泛化。

链接: https://arxiv.org/abs/2512.09779
作者: Mohamed Elbayumi,Mohammed S.M. Elbaz
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor MMs dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
zh

[CV-128] LiePrune: Lie Group and Quantum Geometric Dual Representation for One-Shot Structured Pruning of Quantum Neural Networks

【速读】:该论文旨在解决量子神经网络(Quantum Neural Networks, QNNs)和参数化量子电路(Parameterized Quantum Circuits, PQCs)在实际应用中面临的可扩展性瓶颈问题,包括参数冗余、梯度消失(barren plateaus)以及硬件资源限制。其解决方案的关键在于提出了一种基于李群(Lie group)结构和量子几何信息的数学严谨的一次性结构化剪枝框架 LiePrune,通过将每个量子门同时映射到李群-李代数对偶空间与量子几何特征空间中,实现了对冗余结构的原理性识别与高效压缩,从而在保持甚至提升任务性能的同时实现超过10倍的压缩率,并提供冗余检测、函数逼近和计算复杂度方面的可证明保证。

链接: https://arxiv.org/abs/2512.09469
作者: Haijian Shao,Bowen Yang,Wei Liu,Xing Deng,Yingtao Jiang
机构: Jiangsu University of Science and Technology (江苏科技大学); University of Nevada, Las Vegas (内华达大学拉斯维加斯分校)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Quantum neural networks (QNNs) and parameterized quantum circuits (PQCs) are key building blocks for near-term quantum machine learning. However, their scalability is constrained by excessive parameters, barren plateaus, and hardware limitations. We propose LiePrune, the first mathematically grounded one-shot structured pruning framework for QNNs that leverages Lie group structure and quantum geometric information. Each gate is jointly represented in a Lie group–Lie algebra dual space and a quantum geometric feature space, enabling principled redundancy detection and aggressive compression. Experiments on quantum classification (MNIST, FashionMNIST), quantum generative modeling (Bars-and-Stripes), and quantum chemistry (LiH VQE) show that LiePrune achieves over 10\times compression with negligible or even improved task performance, while providing provable guarantees on redundancy detection, functional approximation, and computational complexity.
zh

[CV-129] Causal Attribution of Model Performance Gaps in Medical Imaging Under Distribution Shifts

【速读】:该论文旨在解决深度学习模型在医学图像分割任务中因数据分布偏移(distribution shifts)导致性能显著下降的问题,且现有研究对造成性能下降的因果机制理解不足。解决方案的关键在于将因果归因框架扩展至高维分割任务,通过构建数据生成过程的因果图模型,并利用Shapley值公平地量化不同机制(如成像协议和标注差异)对性能变化的独立贡献。该方法有效应对了医学影像中的三大挑战:高维输出、样本有限性和机制间复杂交互,实证表明其能识别跨标注者与跨中心场景下的主导失效机制,从而指导针对性干预策略的制定。

链接: https://arxiv.org/abs/2512.09094
作者: Pedro M. Gordaliza,Nataliia Molchanova,Jaume Banus,Thomas Sanchez,Meritxell Bach Cuadra
机构: CIBM Center for Biomedical Imaging; Department of Radiology, Lausanne University Hospital and University of Lausanne, Switzerland; MedGIFT, Institute of Informatics, School of Management, HES–SO Valais–Wallis University of Applied Sciences and Arts Western Switzerland, Sierre, Switzerland
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)
备注: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025

点击查看摘要

Abstract:Deep learning models for medical image segmentation suffer significant performance drops due to distribution shifts, but the causal mechanisms behind these drops remain poorly understood. We extend causal attribution frameworks to high-dimensional segmentation tasks, quantifying how acquisition protocols and annotation variability independently contribute to performance degradation. We model the data-generating process through a causal graph and employ Shapley values to fairly attribute performance changes to individual mechanisms. Our framework addresses unique challenges in medical imaging: high-dimensional outputs, limited samples, and complex mechanism interactions. Validation on multiple sclerosis (MS) lesion segmentation across 4 centers and 7 annotators reveals context-dependent failure modes: annotation protocol shifts dominate when crossing annotators (7.4% \pm 8.9% DSC attribution), while acquisition shifts dominate when crossing imaging centers (6.5% \pm 9.1%). This mechanism-specific quantification enables practitioners to prioritize targeted interventions based on deployment context.
zh

[CV-130] DermETAS-SNA LLM : A Dermatology Focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM Assistant

【速读】:该论文旨在解决皮肤科疾病自动分类与临床解释能力不足的问题,特别是在提升分类准确性、缓解类别不平衡以及生成符合医学规范的个性化诊断说明方面存在挑战。解决方案的关键在于提出一个集成架构——DermETAS-SNA LLM Assistant,其核心包括:(1) 基于SKINCON数据集构建进化式Transformer架构搜索(ETAS)框架,优化视觉Transformer(ViT)以更好地表征皮肤病特征,并在DermNet数据集上微调23类二分类器;(2) 设计StackNet结构融合多个微调后的ViT二分类器,增强预测鲁棒性并缓解类别不平衡问题;(3) 引入基于检索增强生成(RAG)的诊断解释与检索模型(DERMDM),利用Google Gemini 2.5 Pro大语言模型生成患者友好的医学解释,依托权威皮肤病学知识库;(4) 实验验证表明,该系统在23类皮肤疾病上的F1分数达56.30%,显著优于SkinGPT-4的48.51%(提升16.06%),且经八位皮肤科医生评估,临床响应一致性达92%。

链接: https://arxiv.org/abs/2512.08998
作者: Nitya Phani Santosh Oruganty,Keerthi Vemula Murali,Chun-Kit Ngan,Paulo Bandeira Pinho
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our work introduces the DermETAS-SNA LLM Assistant that integrates Dermatology-focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM. The assistant dynamically learns skin-disease classifiers and provides medically informed descriptions to facilitate clinician-patient interpretation. Contributions include: (1) Developed an ETAS framework on the SKINCON dataset to optimize a Vision Transformer (ViT) tailored for dermatological feature representation and then fine-tuned binary classifiers for each of the 23 skin disease categories in the DermNet dataset to enhance classification performance; (2) Designed a StackNet architecture that integrates multiple fine-tuned binary ViT classifiers to enhance predictive robustness and mitigate class imbalance issues; (3) Implemented a RAG pipeline, termed Diagnostic Explanation and Retrieval Model for Dermatology, which harnesses the capabilities of the Google Gemini 2.5 Pro LLM architecture to generate personalized, contextually informed diagnostic descriptions and explanations for patients, leveraging a repository of verified dermatological materials; (4) Performed extensive experimental evaluations on 23 skin disease categories to demonstrate performance increase, achieving an overall F1-score of 56.30% that surpasses SkinGPT-4 (48.51%) by a considerable margin, representing a performance increase of 16.06%; (5) Conducted a domain-expert evaluation, with eight licensed medical doctors, of the clinical responses generated by our AI assistant for seven dermatological conditions. Our results show a 92% agreement rate with the assessments provided by our AI assistant (6) Created a proof-of-concept prototype that fully integrates our DermETAS-SNA LLM into our AI assistant to demonstrate its practical feasibility for real-world clinical and educational applications.
zh

[CV-131] Enhanced Chest Disease Classification Using an Improved CheXNet Framework with EfficientNetV2-M and Optimization-Driven Learning

【速读】:该论文旨在解决资源受限环境中胸部X光片(Chest X-ray)自动诊断效率低、准确率不足的问题,尤其是在放射科医生短缺导致诊断延迟和患者预后不良的背景下。其核心解决方案是提出一种基于EfficientNetV2-M主干网络的新型分类框架,并融合多项先进训练策略,包括自动混合精度训练(Automatic Mixed Precision training)、AdamW优化器、余弦退火学习率调度(Cosine Annealing learning rate scheduling)以及指数移动平均正则化(Exponential Moving Average regularization)。该方案在包含18,080张高质量胸部X光图像的数据集上实现了显著性能提升,测试平均准确率达96.45%,较基线模型(CheXNet)提高1.15个百分点(p < 0.001),且对关键传染性疾病如新冠(COVID-19)和结核病(Tuberculosis)的分类准确率分别达到99.95%和99.97%,同时训练时间减少11.4%,性能稳定性提升22.7%,验证了其作为临床决策支持工具在疫情响应、结核筛查及胸腔疾病常规评估中的实用价值。

链接: https://arxiv.org/abs/2512.08992
作者: Ali M. Bahram,Saman Muhammad Omer,Hardi M. Mohammed,Sirwan Abdolwahed Aula
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 6 figures, 7 tables

点击查看摘要

Abstract:The interpretation of Chest X-ray is an important diagnostic issue in clinical practice and especially in the resource-limited setting where the shortage of radiologists plays a role in delayed diagnosis and poor patient outcomes. Although the original CheXNet architecture has shown potential in automated analysis of chest radiographs, DenseNet-121 backbone is computationally inefficient and poorly single-label classifier. To eliminate such shortcomings, we suggest a better classification framework of chest disease that relies on EfficientNetV2-M and incorporates superior training approaches such as Automatic Mixed Precision training, AdamW, Cosine Annealing learning rate scheduling, and Exponential Moving Average regularization. We prepared a dataset of 18,080 chest X-ray images of three source materials of high authority and representing five key clinically significant disease categories which included Cardiomegaly, COVID-19, Normal, Pneumonia, and Tuberculosis. To achieve statistical reliability and reproducibility, nine independent experimental runs were run. The suggested architecture showed significant gains with mean test accuracy of 96.45 percent compared to 95.30 percent at baseline (p less than 0.001) and macro-averaged F1-score increased to 91.08 percent (p less than 0.001). Critical infectious diseases showed near-perfect classification performance with COVID-19 detection having 99.95 percent accuracy and Tuberculosis detection having 99.97 percent accuracy. Although 6.8 times more parameters are included, the training time was reduced by 11.4 percent and performance stability was increased by 22.7 percent. This framework presents itself as a decision-support tool that can be used to respond to a pandemic, screen tuberculosis, and assess thoracic disease regularly in various healthcare facilities.
zh

[CV-132] Agreement Disagreement Guided Knowledge Transfer for Cross-Scene Hyperspectral Imaging

【速读】:该论文旨在解决跨场景高光谱成像(Hyperspectral Imaging, HSI)中知识迁移存在的梯度冲突(gradient conflicts)和主导梯度(dominant gradients)问题,以及现有方法仅依赖目标特征的有限共享子集而忽略差异信息导致的关键目标模式丢失问题。解决方案的关键在于提出一种共识-分歧引导的知识迁移框架(Agreement Disagreement Guided Knowledge Transfer, ADGKT),其中共识模块包含GradVac(对齐梯度方向以缓解源域与目标域间的梯度冲突)和LogitNorm(规范logit幅度以避免单一梯度源主导),分歧模块则通过分歧限制(Disagreement Restriction, DiR)和集成策略捕获多样化的目标预测特征,从而实现更鲁棒且平衡的跨场景知识迁移。

链接: https://arxiv.org/abs/2512.08990
作者: Lu Huo,Haimin Zhang,Min Xu
机构: University of Technology Sydney (悉尼科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge transfer plays a crucial role in cross-scene hyperspectral imaging (HSI). However, existing studies often overlook the challenges of gradient conflicts and dominant gradients that arise during the optimization of shared parameters. Moreover, many current approaches fail to simultaneously capture both agreement and disagreement information, relying only on a limited shared subset of target features and consequently missing the rich, diverse patterns present in the target scene. To address these issues, we propose an Agreement Disagreement Guided Knowledge Transfer (ADGKT) framework that integrates both mechanisms to enhance cross-scene transfer. The agreement component includes GradVac, which aligns gradient directions to mitigate conflicts between source and target domains, and LogitNorm, which regulates logit magnitudes to prevent domination by a single gradient source. The disagreement component consists of a Disagreement Restriction (DiR) and an ensemble strategy, which capture diverse predictive target features and mitigate the loss of critical target information. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in achieving robust and balanced knowledge transfer across heterogeneous HSI scenes.
zh

人工智能

[AI-0] FALCON: Few-step Accurate Likelihoods for Continuous Flows NEURIPS2025

【速读】:该论文旨在解决分子热力学平衡状态下可扩展采样问题,特别是针对当前基于连续归一化流(Continuous Normalizing Flows, CNFs)的Boltzmann生成器在计算似然时开销过高、每样本需数千次函数求值而导致效率低下的瓶颈。其解决方案的关键在于提出Few-step Accurate Likelihoods for Continuous Flows (FALCON),通过引入一种混合训练目标以增强模型的可逆性,从而实现仅需少量步骤即可获得足够准确的似然估计,满足重要性采样应用的需求;实验表明,FALCON在分子Boltzmann采样任务中优于现有最优归一化流模型,且速度比性能相当的CNF模型快两个数量级。

链接: https://arxiv.org/abs/2512.09914
作者: Danyal Rehman,Tara Akhound-Sadegh,Artem Gazizov,Yoshua Bengio,Alexander Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint; NeurIPS 2025 MLSB

点击查看摘要

Abstract:Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model.
zh

[AI-1] STACHE: Local Black-Box Explanations for Reinforcement Learning Policies

【速读】:该论文旨在解决强化学习代理在稀疏奖励或安全关键环境中的行为不可预测问题,从而迫切需要可靠的调试与验证工具。其解决方案的关键在于提出STACHE框架,该框架通过生成局部、黑盒解释来阐明代理在离散马尔可夫博弈中特定动作的决策依据;具体而言,它构建了一个由两个互补组件组成的复合解释:(1) 稳定性区域(Robustness Region),即代理动作保持不变的状态连通邻域;(2) 最小反事实(Minimal Counterfactuals),即改变该决策所需的最小状态扰动。通过利用因子化状态空间的结构,该方法设计了一种精确的基于搜索的算法,避免了替代模型带来的保真度差距,从而有效捕捉策略逻辑在训练过程中的演化——从不稳定的行为到优化后的鲁棒策略,为理解代理敏感性和决策边界提供了可操作的洞察。

链接: https://arxiv.org/abs/2512.09909
作者: Andrew Elashkin,Orna Grumberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent’s specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent’s action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.
zh

[AI-2] Bayesian Networks Markov Networks Moralisation Triangulation: a Categorical Perspective

【速读】:该论文试图解决概率图模型中贝叶斯网络(Bayesian network)与马尔可夫网络(Markov network)之间转换的理论统一问题,特别是如何形式化“道德化(moralisation)”和“三角化(triangulation)”这两种关键变换。解决方案的关键在于构建一个范畴论框架,将两类网络建模为从“语法”域到“语义”陪域的函子,从而将道德化和三角化分别表示为范畴间的函子映射。其中,道德化完全基于语法层面的操作(通过函子预复合实现),而三角化则依赖于语义信息;这一区分进一步揭示了变量消除算法(variable elimination algorithm)本质上也是一个函子,能将三角化过程分解为纯语法和纯语义两个阶段,从而在理论上厘清了语法与语义操作的界限。

链接: https://arxiv.org/abs/2512.09908
作者: Antonio Lorenzin,Fabio Zanasi
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Category Theory (math.CT)
备注: 36 pages. A preliminary version of this work was presented at CALCO 2025, under the title "An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models’’

点击查看摘要

Abstract:Moralisation and Triangulation are transformations allowing to switch between different ways of factoring a probability distribution into a graphical model. Moralisation allows to view a Bayesian network (a directed model) as a Markov network (an undirected model), whereas triangulation addresses the opposite direction. We present a categorical framework where these transformations are modelled as functors between a category of Bayesian networks and one of Markov networks. The two kinds of network (the objects of these categories) are themselves represented as functors from a syntax' domain to a semantics’ codomain. Notably, moralisation and triangulation can be defined inductively on such syntax via functor pre-composition. Moreover, while moralisation is fully syntactic, triangulation relies on semantics. This leads to a discussion of the variable elimination algorithm, reinterpreted here as a functor in its own right, that splits the triangulation procedure in two: one purely syntactic, the other purely semantic. This approach introduces a functorial perspective into the theory of probabilistic graphical models, which highlights the distinctions between syntactic and semantic modifications.
zh

[AI-3] Human-in-the-Loop and AI: Crowdsourcing Metadata Vocabulary for Materials Science

【速读】:该论文旨在解决材料科学领域中元数据词汇(metadata vocabularies)开发受限于人力资源不足和标准化实践不一致的问题,从而阻碍了FAIR(可查找、可访问、可互操作、可重用)和FARR(可发现、可访问、可重复、可复现)数据原则的推进。解决方案的关键在于提出并验证了一个融合人工智能(AI)与人机协同(Human-in-the-loop, HILT)的模型——MatSci-YAMZ平台,通过引入众包机制和迭代反馈循环,使专家用户能够参与术语定义的生成与优化,从而提升元数据词汇开发的效率与一致性,并为跨学科领域提供可扩展的范式。

链接: https://arxiv.org/abs/2512.09895
作者: Jane Greenberg,Scott McClellan,Addy Ireland,Robert Sammarco,Colton Gerber,Christopher B. Rauch,Mat Kelly,John Kunze,Yuan An,Eric Toberer
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Metadata and Semantics Research Conference 2025, 14 pages, 7 figures

点击查看摘要

Abstract:Metadata vocabularies are essential for advancing FAIR and FARR data principles, but their development constrained by limited human resources and inconsistent standardization practices. This paper introduces MatSci-YAMZ, a platform that integrates artificial intelligence (AI) and human-in-the-loop (HILT), including crowdsourcing, to support metadata vocabulary development. The paper reports on a proof-of-concept use case evaluating the AI-HILT model in materials science, a highly interdisciplinary domain Six (6) participants affiliated with the NSF Institute for Data-Driven Dynamical Design (ID4) engaged with the MatSci-YAMZ plaform over several weeks, contributing term definitions and providing examples to prompt the AI-definitions refinement. Nineteen (19) AI-generated definitions were successfully created, with iterative feedback loops demonstrating the feasibility of AI-HILT refinement. Findings confirm the feasibility AI-HILT model highlighting 1) a successful proof of concept, 2) alignment with FAIR and open-science principles, 3) a research protocol to guide future studies, and 4) the potential for scalability across domains. Overall, MatSci-YAMZ’s underlying model has the capacity to enhance semantic transparency and reduce time required for consensus building and metadata vocabulary development.
zh

[AI-4] Provably Learning from Modern Language Models via Low Logit Rank

【速读】:该论文旨在解决如何在理论上保证生成式 AI(Generative AI)模型的学习效率问题,特别是针对现代语言模型中广泛观察到的“低 logit 秩”结构。传统学习理论难以直接适用于复杂语言模型,而本文通过引入一个基于 logit 查询的查询学习模型,将实际 API 访问机制形式化,并提出了一种高效算法来学习任意近似低 logit 秩模型。其解决方案的关键在于:利用语言模型输出 logits 矩阵的低秩特性,设计出可在多项式时间内完成学习的算法,从而首次为能合理刻画现代语言模型的生成式模型提供了端到端的学习保证。

链接: https://arxiv.org/abs/2512.09892
作者: Noah Golowich,Allen Liu,Abhishek Shetty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:While modern language models and their inner workings are incredibly complex, recent work (Golowich, Liu Shetty; 2025) has proposed a simple and potentially tractable abstraction for them through the observation that empirically, these language models all seem to have approximately low logit rank. Roughly, this means that a matrix formed by the model’s log probabilities of various tokens conditioned on certain sequences of tokens is well approximated by a low rank matrix. In this paper, our focus is on understanding how this structure can be exploited algorithmically for obtaining provable learning guarantees. Since low logit rank models can encode hard-to-learn distributions such as noisy parities, we study a query learning model with logit queries that reflects the access model for common APIs. Our main result is an efficient algorithm for learning any approximately low logit rank model from queries. We emphasize that our structural assumption closely reflects the behavior that is empirically observed in modern language models. Thus, our result gives what we believe is the first end-to-end learning guarantee for a generative model that plausibly captures modern language models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2512.09892 [cs.LG] (or arXiv:2512.09892v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.09892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

【速读】:该论文旨在解决现有AI代理在真实企业网络环境中与人类网络安全专业人员相比性能表现不明、评估不足的问题。其核心挑战在于如何构建一个具备高效率、高精度且可扩展的AI代理框架,以在复杂网络中有效识别漏洞并优于人工渗透测试。解决方案的关键是提出ARTEMIS——一个支持动态提示生成(dynamic prompt generation)、任意子代理(arbitrary sub-agents)和自动漏洞优先级排序(automatic vulnerability triaging)的多代理架构。实验表明,ARTEMIS在发现9个有效漏洞的同时保持82%的有效提交率,并在系统性枚举、并行利用和成本控制方面显著优于多数人类专家(单小时成本仅为60美元的人工测试者的一半),展现出与顶尖人类参与者相当的技术成熟度和输出质量。

链接: https://arxiv.org/abs/2512.09882
作者: Justin W. Lin,Eliot Krzysztof Jones,Donovan Julian Jasper,Ethan Jun-shen Ho,Anna Wu,Arnold Tianyi Yang,Neil Perry,Andy Zou,Matt Fredrikson,J. Zico Kolter,Percy Liang,Dan Boneh,Daniel E. Ho
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost – certain ARTEMIS variants cost 18/hour versus 60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.
zh

[AI-6] FlipLLM : Efficient Bit-Flip Attacks on Multimodal LLM s using Reinforcement Learning

【速读】:该论文旨在解决生成式人工智能(Generative AI)模型,尤其是大语言模型(Large Language Models, LLMs)和大视觉模型(Large Vision Models, VLMs)在面对硬件层面比特翻转攻击(Bit-Flip Attacks, BFAs)时的脆弱性问题。现有BFAs发现方法普遍存在泛化能力差、难以扩展的问题,无法高效分析现代基础模型庞大的参数空间及其复杂的相互依赖关系。解决方案的关键在于提出FlipLLM——一个架构无关的强化学习(Reinforcement Learning, RL)框架,将BFAs发现建模为序列决策问题,通过敏感性引导的层剪枝与Q-learning相结合,快速识别出最小但高影响的比特集合,从而实现对模型造成灾难性失效的精准定位。实验表明,FlipLLM可在2.5倍于当前最优方法的速度下找到关键脆弱比特,并验证了其在多个主流模型(如LLaMA 3.1 8B、LLaVA 1.6)上的有效性,同时证明在这些比特位置部署标准硬件保护机制(如ECC SECDED)可完全消除攻击影响,展现出该框架在指导硬件级防御方面的实用价值。

链接: https://arxiv.org/abs/2512.09872
作者: Khurram Khalil,Khaza Anuarul Hoque
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE HOST 2026

点击查看摘要

Abstract:Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA’s VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.
zh

[AI-7] Interpretation as Linear Transformation: A Cognitive-Geometric Model of Belief and Meaning

【速读】:该论文旨在解决如何在认知异质性(cognitive heterogeneity)的多智能体系统中建模信念(belief)、动机(motivation)与影响力(influence)的问题,其核心挑战在于不同主体因内部认知结构差异而导致的意义传递失效、误解和信念消亡。解决方案的关键在于构建一个几何框架,将每个代理表示为个性化价值空间(personalized value space),即编码代理解释与评估意义的内部维度的向量空间;信念被形式化为结构化的向量——抽象存在(abstract beings),其传播依赖于线性解释映射(linear interpretation maps)。只有当信念避开这些映射的零空间(null spaces)时才能存活,从而形成可理解性、误传与信念死亡的结构性判据。该方法通过代数约束揭示信念扭曲、动机漂移、反事实评估及互识局限等现象的本质,并提出“无零空间领导条件”(No-Null-Space Leadership Condition),将领导力定义为表征可达性(representational reachability)而非说服或权威,最终实现跨异质代理的信念动态统一建模。

链接: https://arxiv.org/abs/2512.09831
作者: Chainarong Amornbunchornvej
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: The first draft of cognitive geometry model

点击查看摘要

Abstract:This paper develops a geometric framework for modeling belief, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Beliefs are formalized as structured vectors-abstract beings-whose transmission is mediated by linear interpretation maps. A belief survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and belief death. Within this framework, I show how belief distortion, motivational drift, counterfactual evaluation, and the limits of mutual understanding arise from purely algebraic constraints. A central result-“the No-Null-Space Leadership Condition”-characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries. The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing belief dynamics across heterogeneous agents. Comments: The first draft of cognitive geometry model Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI) MSC classes: 68T27, 91F99 ACMclasses: I.2.4; I.2.11; I.2.0; J.4 Cite as: arXiv:2512.09831 [cs.AI] (or arXiv:2512.09831v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.09831 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning DATE2026

【速读】:该论文旨在解决现代AI加速器中传统故障评估方法因计算成本过高且覆盖关键故障模式不足而带来的挑战。其核心问题是:如何在大规模AI硬件设计中高效识别出最小但最具破坏性的故障场景,以实现快速、高覆盖率的故障评估。解决方案的关键在于提出RIFT(Reinforcement Learning-guided Intelligent Fault Targeting)框架,该框架将寻找最坏情况故障转化为序列决策问题,结合混合敏感性分析进行搜索空间剪枝,并利用强化学习智能生成最小且高影响力的测试向量集,从而显著提升故障评估效率与覆盖率。

链接: https://arxiv.org/abs/2512.09829
作者: Khurram Khalil,Muhammad Mahad Khaliq,Khaza Anuarul Hoque
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in the IEEE DATE 2026 conference

点击查看摘要

Abstract:The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf2.2 \times fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf99% compared to random fault injection, all while achieving \textbfsuperior fault coverage. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf12.8 \times improvement in \textbfcost-effectiveness (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
zh

[AI-9] Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在实际应用中因训练数据高维性和不确定性导致的运行时预测可靠性问题,尤其是在人类活动识别(Human Activity Recognition, HAR)等复杂、动态场景下,传统软件开发中依赖严格测试以确保无误的方法已不再适用。解决方案的关键在于提出一种量化ML系统不确定性的方法,通过适配并协同使用一组选定的技术,在运行时评估模型预测的相关性与可信度,从而为领域专家提供可解释的辅助决策支持。

链接: https://arxiv.org/abs/2512.09775
作者: Vladimir Balditsyn,Philippe Lalanda,German Vega,Stéphanie Chollet
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent convergence of pervasive computing and machine learning has given rise to numerous services, impacting almost all areas of economic and social activity. However, the use of AI techniques precludes certain standard software development practices, which emphasize rigorous testing to ensure the elimination of all bugs and adherence to well-defined specifications. ML models are trained on numerous high-dimensional examples rather than being manually coded. Consequently, the boundaries of their operating range are uncertain, and they cannot guarantee absolute error-free performance. In this paper, we propose to quantify uncertainty in ML-based systems. To achieve this, we propose to adapt and jointly utilize a set of selected techniques to evaluate the relevance of model predictions at runtime. We apply and evaluate these proposals in the highly heterogeneous and evolving domain of Human Activity Recognition (HAR). The results presented demonstrate the relevance of the approach, and we discuss in detail the assistance provided to domain experts.
zh

[AI-10] Circuits Features and Heuristics in Molecular Transformers

【速读】:该论文旨在解决生成式 AI(Generative AI)在分子结构生成任务中“黑箱”机制不明确的问题,即缺乏对模型如何学习并捕捉化学规则的理解。其核心解决方案是通过机制分析(mechanistic analysis)揭示自回归 Transformer 模型在药物小分子数据上训练时所依赖的计算结构,识别出从低层级语法解析到高层化学有效性约束的多种计算模式,并利用稀疏自编码器(sparse autoencoders, SAEs)提取与化学相关激活模式对应的特征字典,从而将可解释的机制洞见转化为下游任务中的预测性能提升。

链接: https://arxiv.org/abs/2512.09757
作者: Kristof Varadi,Mark Marosi,Peter Antal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
zh

[AI-11] Analyzing Planner Design Trade-offs for MAPF under Realistic Simulation

【速读】:该论文旨在解决当前多智能体路径规划(Multi-Agent Path Finding, MAPF)算法评估框架与实际工业场景之间存在的性能差距问题,即现有基准测试通常基于简化的机器人模型,难以反映真实物理约束下的执行效果。其解决方案的关键在于利用包含动力学和运动学建模(kinodynamic modeling)的先进评估平台(如SMART),系统性地研究三个核心设计因素:(1)解的最优性与执行性能之间的关系;(2)系统性能对动力学模型不准确性的敏感度;(3)模型精度与计划最优性之间的交互影响。通过实证分析这些因素,论文揭示了关键设计权衡,并为MAPF算法向现实世界部署提供了可操作的研究方向。

链接: https://arxiv.org/abs/2512.09736
作者: Jingtian Yan,Zhifei Li,William Kang,Stephen F. Smith,Jiaoyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) algorithms are increasingly deployed in industrial warehouses and automated manufacturing facilities, where robots must operate reliably under real-world physical constraints. However, existing MAPF evaluation frameworks typically rely on simplified robot models, leaving a substantial gap between algorithmic benchmarks and practical performance. Recent frameworks such as SMART, incorporate kinodynamic modeling and offer the MAPF community a platform for large-scale, realistic evaluation. Building on this capability, this work investigates how key planner design choices influence performance under realistic execution settings. We systematically study three fundamental factors: (1) the relationship between solution optimality and execution performance, (2) the sensitivity of system performance to inaccuracies in kinodynamic modeling, and (3) the interaction between model accuracy and plan optimality. Empirically, we examine these factors to understand how these design choices affect performance in realistic scenarios. We highlight open challenges and research directions to steer the community toward practical, real-world deployment.
zh

[AI-12] Ethics Readiness of Artificial Intelligence: A Practical Evaluation Method

【速读】:该论文试图解决的问题是如何将抽象的伦理原则有效融入人工智能(AI)系统的设计过程中,以克服传统伦理框架与工程实践之间的脱节。解决方案的关键在于提出了一种名为伦理就绪水平(Ethics Readiness Levels, ERLs)的四阶段迭代方法,该方法通过将伦理价值转化为具体场景中的提示(prompts)、检查点(checks)和控制机制(controls),使伦理反思可操作化,并借助动态树状问卷结构实现对技术应用情境的适配性评估。ERLs不仅作为管理工具促进跨学科协作,还通过量化评分体系支持持续追踪伦理整合进展,从而推动从技术主导的解决方案主义向“伦理嵌入设计”(ethics-by-design)范式的转变。

链接: https://arxiv.org/abs/2512.09729
作者: Laurynas Adomaitis,Vincent Israel-Jost,Alexei Grinbaum
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages. Data available on GitHub at this https URL

点击查看摘要

Abstract:We present Ethics Readiness Levels (ERLs), a four-level, iterative method to track how ethical reflection is implemented in the design of AI systems. ERLs bridge high-level ethical principles and everyday engineering by turning ethical values into concrete prompts, checks, and controls within real use cases. The evaluation is conducted using a dynamic, tree-like questionnaire built from context-specific indicators, ensuring relevance to the technology and application domain. Beyond being a managerial tool, ERLs help facilitate a structured dialogue between ethics experts and technical teams, while our scoring system helps track progress over time. We demonstrate the methodology through two case studies: an AI facial sketch generator for law enforcement and a collaborative industrial robot. The ERL tool effectively catalyzes concrete design changes and promotes a shift from narrow technological solutionism to a more reflective, ethics-by-design mindset.
zh

[AI-13] Gaussian Process Aggregation for Root-Parallel Monte Carlo Tree Search with Continuous Actions

【速读】:该论文旨在解决根部并行蒙特卡洛树搜索(root-parallel Monte Carlo Tree Search, MCTS)在连续动作空间环境中,如何最优聚合来自不同线程的统计信息这一关键问题。现有方法在处理未探索动作时缺乏有效估计手段,导致性能受限。其解决方案的核心在于引入高斯过程回归(Gaussian Process Regression, GPR),通过建模已探索动作的价值分布,对未被试过的潜在动作进行价值预测,从而提升策略评估精度。实验表明,该方法在6个不同领域中均优于传统聚合策略,且仅带来适度的推理时间增加。

链接: https://arxiv.org/abs/2512.09727
作者: Junlin Xiao,Victor-Alexandru Darvariu,Bruno Lacerda,Nick Hawes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monte Carlo Tree Search is a cornerstone algorithm for online planning, and its root-parallel variant is widely used when wall clock time is limited but best performance is desired. In environments with continuous action spaces, how to best aggregate statistics from different threads is an important yet underexplored question. In this work, we introduce a method that uses Gaussian Process Regression to obtain value estimates for promising actions that were not trialed in the environment. We perform a systematic evaluation across 6 different domains, demonstrating that our approach outperforms existing aggregation strategies while requiring a modest increase in inference time.
zh

[AI-14] Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

【速读】:该论文旨在解决多无人飞行器(Unmanned Aerial Vehicles, UAVs)在去中心化控制场景下,如何高效协同将关键数据包中继至指定位置的问题。其核心解决方案是提出一种基于确定性博弈框架的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法,并设计了一种鲁棒基线策略:通过限制智能体的运动范围(motion envelopes)并结合Dijkstra算法进行路径规划,从而为MARL算法提供性能基准。实验表明,两种现成的MARL算法在小规模群体中表现接近基线,但随着智能体数量增加,其可扩展性面临挑战。

链接: https://arxiv.org/abs/2512.09682
作者: Mika Persson,Jonas Lidman,Jacob Ljungberg,Samuel Sandelius,Adam Andersson
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This work presents a conceptual study on the application of Multi-Agent Reinforcement Learning (MARL) for decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for scaling studies for MARL. A robust baseline policy is proposed, which is based on restricting agent motion envelopes and applying Dijkstra’s algorithm. Experimental results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but scalability issues arise as the number of agents increase.
zh

[AI-15] Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

【速读】:该论文旨在解决等变神经网络(equivariant neural networks)在表达能力(expressive power)方面的理论理解不足问题,特别是针对两层ReLU网络,探究等变性约束对其表示能力的影响。研究发现,等变性约束可能严格限制模型的表达能力,但通过增大模型规模可以弥补这一缺陷;关键在于,即便模型规模扩大,其对应的假设空间复杂度仍可能更低,从而体现出等变网络在泛化性能上的优势。

链接: https://arxiv.org/abs/2512.09673
作者: Yuzhu Chen,Tian Qin,Xinmei Tian,Fengxiang He,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Equivariant neural networks encode symmetry as an inductive bias and have achieved strong empirical performance in wide domains. However, their expressive power remains not well understood. Focusing on 2-layer ReLU networks, this paper investigates the impact of equivariance constraints on the expressivity of equivariant and layer-wise equivariant networks. By examining the boundary hyperplanes and the channel vectors of ReLU networks, we construct an example showing that equivariance constraints could strictly limit expressive power. However, we demonstrate that this drawback can be compensated via enlarging the model size. Furthermore, we show that despite a larger model size, the resulting architecture could still correspond to a hypothesis space with lower complexity, implying superior generalizability for equivariant networks.
zh

[AI-16] An End-to-end Planning Framework with Agent ic LLM s and PDDL

【速读】:该论文旨在解决自然语言描述的规划任务中存在歧义、矛盾及难以直接转化为可执行计划的问题,尤其是在传统规划系统(如PDDL)与人类意图之间缺乏有效衔接的挑战。其核心解决方案是构建一个端到端的框架,由LLM驱动的编排器(orchestrator)将自然语言规范自动转换为结构化的PDDL模型,并通过多个子模块(agents)迭代优化领域和问题定义,以处理时间约束、最优性等常见需求以及原始输入中的不一致性;随后,该框架调用外部PDDL规划引擎生成计划,并通过自然语言翻译模块输出易于理解的步骤序列,确保人类可读性的同时保持执行正确性。此方案无需人工干预,且兼容多种主流PDDL验证器与规划引擎,显著提升了LLM在复杂规划任务中的实用性与可靠性。

链接: https://arxiv.org/abs/2512.09629
作者: Emanuele La Malfa,Ping Zhu,Samuele Marro,Sara Bernardini,Michael Wooldridge
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework across various domains and tasks, including the Google NaturalPlan benchmark and PlanBench, as well as planning problems like Blocksworld and the Tower of Hanoi (where LLMs are known to struggle even with small instances). Our framework can be integrated with any PDDL planning engine and validator (such as Fast Downward, LPG, POPF, VAL, and uVAL, which we have tested) and represents a significant step toward end-to-end planning aided by LLMs.
zh

[AI-17] Stanford Sleep Bench: Evaluating Polysomnography Pre-training Methods for Sleep Foundation Models

【速读】:该论文旨在解决睡眠基础模型研究中存在的两大关键问题:一是缺乏一个涵盖多样化任务的共享数据集与基准用于训练和评估;二是尚未对自监督表示学习(Self-Supervised Representation Learning, SSRL)方法在睡眠相关任务中的系统性表现进行评估。其解决方案的关键在于构建并公开斯坦福睡眠基准(Stanford Sleep Bench),这是一个包含17,467条多模态多导睡眠图(Polysomnography, PSG)记录、超过163,000小时数据的大规模数据集,涵盖13项临床疾病预测任务及经典睡眠分析任务(如睡眠分期、呼吸暂停诊断和年龄估计)。通过在该基准上系统评估多种SSRL预训练方法,研究发现对比学习(contrastive learning)在死亡率和疾病预测任务中显著优于其他方法,并且收敛更快,从而为睡眠分析领域的基础模型开发提供了可复现的基础设施与实证依据。

链接: https://arxiv.org/abs/2512.09591
作者: Magnus Ruud Kjaer,Rahul Thapa,Gauri Ganjoo,Hyatt Moore IV,Poul Joergen Jennum,Brandon M. Westover,James Zou,Emmanuel Mignot,Bryan He,Andreas Brink-Kjaer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Polysomnography (PSG), the gold standard test for sleep analysis, generates vast amounts of multimodal clinical data, presenting an opportunity to leverage self-supervised representation learning (SSRL) for pre-training foundation models to enhance sleep analysis. However, progress in sleep foundation models is hindered by two key limitations: (1) the lack of a shared dataset and benchmark with diverse tasks for training and evaluation, and (2) the absence of a systematic evaluation of SSRL approaches across sleep-related tasks. To address these gaps, we introduce Stanford Sleep Bench, a large-scale PSG dataset comprising 17,467 recordings totaling over 163,000 hours from a major sleep clinic, including 13 clinical disease prediction tasks alongside canonical sleep-related tasks such as sleep staging, apnea diagnosis, and age estimation. We systematically evaluate SSRL pre-training methods on Stanford Sleep Bench, assessing downstream performance across four tasks: sleep staging, apnea diagnosis, age estimation, and disease and mortality prediction. Our results show that multiple pretraining methods achieve comparable performance for sleep staging, apnea diagnosis, and age estimation. However, for mortality and disease prediction, contrastive learning significantly outperforms other approaches while also converging faster during pretraining. To facilitate reproducibility and advance sleep research, we will release Stanford Sleep Bench along with pretrained model weights, training pipelines, and evaluation code.
zh

[AI-18] Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation

【速读】:该论文旨在解决AI基准测试(benchmark)文档常存在不完整或不一致的问题,导致跨任务或跨领域比较困难,从而影响研究的透明性、可比性和可复用性。其解决方案的关键在于提出Auto-BenchmarkCard工作流,通过多智能体从异构来源(如Hugging Face、Unitxt及学术论文)中提取数据,并结合大语言模型(LLM)进行合成,最终利用FactReasoner工具基于原子蕴含评分(atomic entailment scoring)对生成内容进行事实准确性验证,从而系统化生成结构化且经验证的基准描述。

链接: https://arxiv.org/abs/2512.09577
作者: Aris Hofmann,Inge Vejsbjerg,Dhaval Salwala,Elizabeth M. Daly
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
zh

[AI-19] he Gender Code: Gendering the Global Governance of Artificial Intelligence

【速读】:该论文试图解决国际人工智能(Artificial Intelligence, AI)治理框架中对性别问题和基于性别的伤害关注不足的问题。通过分析欧盟《AI法案》等具有约束力的法规、联合国教科文组织《AI伦理建议书》等软法文件以及全球人工智能伙伴关系(Global Partnership on AI, GPAI)等全球倡议,研究发现当前治理趋势正逐步将性别关切纳入更广泛的人权框架,并强调包容性和多样性。然而,仍存在性别表述不一致、缺乏交叉性(intersectionality)考量及执行机制薄弱等关键缺口。论文提出,有效的AI治理必须具备交叉性、可执行性和包容性,这是避免形式主义、实现真正公平并防止加剧既有不平等的核心解决方案。

链接: https://arxiv.org/abs/2512.09570
作者: Jelena Cupac
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: The paper is part of the Handbook on the Global Governance of Artificial Intelligence, forthcoming with Edward Elgar Publishing

点击查看摘要

Abstract:This paper examines how international AI governance frameworks address gender issues and gender-based harms. The analysis covers binding regulations, such as the EU AI Act; soft law instruments, like the UNESCO Recommendations on AI Ethics; and global initiatives, such as the Global Partnership on AI (GPAI). These instruments reveal emerging trends, including the integration of gender concerns into broader human rights frameworks, a shift toward explicit gender-related provisions, and a growing emphasis on inclusivity and diversity. Yet, some critical gaps persist, including inconsistent treatment of gender across governance documents, limited engagement with intersectionality, and a lack of robust enforcement mechanisms. However, this paper argues that effective AI governance must be intersectional, enforceable, and inclusive. This is key to moving beyond tokenism toward meaningful equity and preventing reinforcement of existing inequalities. The study contributes to ethical AI debates by highlighting the importance of gender-sensitive governance in building a just technological future.
zh

[AI-20] oward Closed-loop Molecular Discovery via Language Model Property Alignment and Strategic Search

【速读】:该论文旨在解决传统药物发现过程中虚拟筛选效率低、成功率差以及生成式AI模型在分子设计中存在泛化能力不足、可解释性弱和过度关注结合亲和力而忽视关键药代动力学性质等问题。其解决方案的关键在于提出Trio框架,该框架融合片段分子语言建模(fragment-based molecular language modeling)、强化学习(reinforcement learning)与蒙特卡洛树搜索(Monte Carlo tree search),实现上下文感知的片段组装、物理化学与合成可行性约束,并在新颖化学类型探索与潜在中间体利用之间取得平衡,从而实现高效且可解释的闭环靶向分子设计。

链接: https://arxiv.org/abs/2512.09566
作者: Junkai Ji,Zhangfan Yang,Dong Xu,Ruibin Bai,Jianqiang Li,Tingjun Hou,Zexuan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.
zh

[AI-21] SWEnergy: An Empirical Study on Energy Efficiency in Agent ic Issue Resolution Frameworks with SLMs ICSE2026

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的软件工程自主代理(agentic agents)在资源受限环境下难以高效部署的问题,特别是当这些代理被约束使用小型语言模型(Small Language Models, SLMs)时,其任务完成率低且能源消耗高。研究发现,框架架构是决定能源效率的主要因素,但当前设计主要用于适配强大LLM的代理框架在SLMs上运行时会产生大量无效推理循环,导致能源浪费严重。解决方案的关键在于从被动调度转向主动管理SLMs局限性的新型架构设计,以实现低能耗、高效率的任务执行。

链接: https://arxiv.org/abs/2512.09543
作者: Arihant Tripathy,Ch Pavan Harshit,Karthik Vaidhyanathan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 1 table. Accepted to AGENT 2026 (ICSE 2026 workshop)

点击查看摘要

Abstract:Context. LLM-based autonomous agents in software engineering rely on large, proprietary models, limiting local deployment. This has spurred interest in Small Language Models (SLMs), but their practical effectiveness and efficiency within complex agentic frameworks for automated issue resolution remain poorly understood. Goal. We investigate the performance, energy efficiency, and resource consumption of four leading agentic issue resolution frameworks when deliberately constrained to using SLMs. We aim to assess the viability of these systems for this task in resource-limited settings and characterize the resulting trade-offs. Method. We conduct a controlled evaluation of four leading agentic frameworks (SWE-Agent, OpenHands, Mini SWE Agent, AutoCodeRover) using two SLMs (Gemma-3 4B, Qwen-3 1.7B) on the SWE-bench Verified Mini benchmark. On fixed hardware, we measure energy, duration, token usage, and memory over 150 runs per configuration. Results. We find that framework architecture is the primary driver of energy consumption. The most energy-intensive framework, AutoCodeRover (Gemma), consumed 9.4x more energy on average than the least energy-intensive, OpenHands (Gemma). However, this energy is largely wasted. Task resolution rates were near-zero, demonstrating that current frameworks, when paired with SLMs, consume significant energy on unproductive reasoning loops. The SLM’s limited reasoning was the bottleneck for success, but the framework’s design was the bottleneck for efficiency. Conclusions. Current agentic frameworks, designed for powerful LLMs, fail to operate efficiently with SLMs. We find that framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs’ limited reasoning. Viable low-energy solutions require shifting from passive orchestration to architectures that actively manage SLM weaknesses. Comments: 8 pages, 5 figures, 1 table. Accepted to AGENT 2026 (ICSE 2026 workshop) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.09543 [cs.SE] (or arXiv:2512.09543v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.09543 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arihant Tripathy [view email] [v1] Wed, 10 Dec 2025 11:28:48 UTC (1,318 KB) Full-text links: Access Paper: View a PDF of the paper titled SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs, by Arihant Tripathy and Ch Pavan Harshit and Karthik VaidhyanathanView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-22] Representation Invariance and Allocation: When Subgroup Balance Matters

【速读】:该论文旨在解决训练数据中人口群体(demographic groups)不均衡分布对模型跨群体泛化能力的影响问题,特别是挑战了“平衡子群体数据可优化整体性能”的传统假设。研究发现,在某些情况下,不平衡数据反而能提升子群体表现,而在其他情况下,缺失某一子群体的数据并不会影响其性能。解决方案的关键在于提出并验证了潜在空间分离假说(latent separation hypothesis),即部分微调模型对子群体代表性的依赖程度取决于预训练模型潜在空间中子群体间的分离度(separation in the latent space)。该假说通过理论分析和实证验证确立了子群体在潜在空间中的结构特性是决定数据平衡策略有效性的核心因素,并进一步提出了基于潜空间分离度的定量分析方法,用于指导基础模型微调过程中的数据收集与平衡决策。

链接: https://arxiv.org/abs/2512.09496
作者: Anissa Alloula,Charles Jones,Zuzanna Wakefield-Skorniewska,Francesco Quinzan,Bartłomiej Papież
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unequal representation of demographic groups in training data poses challenges to model generalisation across populations. Standard practice assumes that balancing subgroup representation optimises performance. However, recent empirical results contradict this assumption: in some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training. We conduct a systematic study of subgroup allocation across four vision and language models, varying training data composition to characterise the sensitivity of subgroup performance to data balance. We propose the latent separation hypothesis, which states that a partially fine-tuned model’s dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model. We formalise this hypothesis, provide theoretical analysis, and validate it empirically. Finally, we present a practical application to foundation model fine-tuning, demonstrating that quantitative analysis of latent subgroup separation can inform data collection and balancing decisions.
zh

[AI-23] Advancing LLM -Based Security Automation with Customized Group Relative Policy Optimization for Zero-Touch Networks

【速读】:该论文旨在解决第六代移动通信网络(6G)中零接触网络(Zero-Touch Networks, ZTNs)的安全自动化问题,具体挑战包括:1)在真实、并行且对抗性的环境下实现从安全策略生成到验证与更新的全生命周期自动化;2)使安全策略能够适应不断演变的威胁和动态网络环境。解决方案的关键在于提出两个核心组件:一是SecLoop框架,首次实现了基于大语言模型(Large Language Models, LLMs)的安全策略全生命周期自动化管理,涵盖生成、编排、响应与反馈环节,从而应对第一类挑战;二是SA-GRPO算法,一种新型的安全感知群体相对策略优化方法,通过对比多个并行SecLoop执行中收集的群体反馈迭代优化安全策略,有效解决了第二类挑战。

链接: https://arxiv.org/abs/2512.09485
作者: Xinye Cao,Yihan Lin,Guoshun Nan,Qinchuan Zhou,Yuhang Luo,Yurui Gao,Zeliang Zhang,Haolang Lu,Qimei Cui,Yanzhao Hou,Xiaofeng Tao,Tony Q.S. Quek
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE JSAC. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Zero-Touch Networks (ZTNs) represent a transformative paradigm toward fully automated and intelligent network management, providing the scalability and adaptability required for the complexity of sixth-generation (6G) networks. However, the distributed architecture, high openness, and deep heterogeneity of 6G networks expand the attack surface and pose unprecedented security challenges. To address this, security automation aims to enable intelligent security management across dynamic and complex environments, serving as a key capability for securing 6G ZTNs. Despite its promise, implementing security automation in 6G ZTNs presents two primary challenges: 1) automating the lifecycle from security strategy generation to validation and update under real-world, parallel, and adversarial conditions, and 2) adapting security strategies to evolving threats and dynamic environments. This motivates us to propose SecLoop and SA-GRPO. SecLoop constitutes the first fully automated framework that integrates large language models (LLMs) across the entire lifecycle of security strategy generation, orchestration, response, and feedback, enabling intelligent and adaptive defenses in dynamic network environments, thus tackling the first challenge. Furthermore, we propose SA-GRPO, a novel security-aware group relative policy optimization algorithm that iteratively refines security strategies by contrasting group feedback collected from parallel SecLoop executions, thereby addressing the second challenge. Extensive real-world experiments on five benchmarks, including 11 MITRE ATTCK processes and over 20 types of attacks, demonstrate the superiority of the proposed SecLoop and SA-GRPO. We will release our platform to the community, facilitating the advancement of security automation towards next generation communications.
zh

[AI-24] Architectures for Building Agent ic AI

【速读】:该论文旨在解决生成式 AI(Generative AI)与代理型 AI(Agentic AI)系统的可靠性问题,即如何在复杂、动态环境中保障其行为的可预测性、安全性和可控性。解决方案的关键在于将可靠性视为一种架构属性,通过系统性的组件化设计实现:包括目标管理器(goal manager)、规划器(planner)、工具路由模块(tool-router)、执行器(executor)、记忆模块(memory)、验证器(verifiers)、安全监控器(safety monitor)和遥测系统(telemetry)等核心组件的明确分离;同时依赖受约束的接口设计(如类型化模式、权限最小化、输入验证)以及显式的控制与保障回路(control and assurance loops),从而构建具备可审计、可干预、可恢复能力的智能体系统。

链接: https://arxiv.org/abs/2512.09458
作者: Sławomir Nowaczyk
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is a preprint of a chapter accepted for publication in Generative and Agentic AI Reliability: Architectures, Challenges, and Trust for Autonomous Systems, published by Springer Nature

点击查看摘要

Abstract:This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
zh

[AI-25] Advancing Research via Human-AI Interactive Theorem Proving

【速读】:该论文旨在解决如何在科学计算中利用大语言模型(Large Language Models, LLMs)作为研究工具,同时保持数学严谨性的问题。其核心挑战在于如何让LLM参与定理证明与数学发现过程,而不牺牲推理的准确性与可控性。解决方案的关键在于提出一种“人机协同”(human-in-the-loop)的工作流:人类专家负责问题定义和可接受假设的设定,而LLM则承担搜索证明或矛盾、提出候选性质与定理、构造满足显式约束的结构与参数等任务,并通过数值实验和简单验证检查进行辅助支持;随后,专家将模型输出视为原始素材进一步精炼,最终形成精确命题与严格证明。这一框架在流形优化与格罗弗量子搜索算法的关联性研究中得到实例验证,成功识别不变子空间、探索格罗弗兼容的投影映射,并获得基于投影梯度法的收敛性保证,为将LLM融入前沿数学研究提供了可复用的实践模板。

链接: https://arxiv.org/abs/2512.09443
作者: Chenyi Li,Zhijian Lai,Dong An,Jiang Hu,Zaiwen Wen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover’s quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.
zh

[AI-26] ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators

【速读】:该论文旨在解决在随机访问带宽受限的加速器(如基于LPDDR5的设备)上高效部署大语言模型(Large Language Models, LLMs)时,传统内存管理机制存在的性能瓶颈问题。现有方法中,静态预分配导致内存浪费,而细粒度分页(如PagedAttention)因高随机访问开销不适用;同时,HBM导向的解决方案未充分利用随机访问受限内存(Random-Access-Constrained Memory, RACM)加速器(如Cambricon MLU370)的特性。其核心解决方案是提出ODMA(On-Demand Memory Allocation),通过轻量级长度预测器与动态桶分区相结合,并引入大桶保护机制以应对请求分布漂移和重尾请求问题;边界定期从实时轨迹中更新以最大化内存利用率,从而显著提升LLM服务效率,在Cambricon MLU370-X4平台上实现内存利用率从55.05%提升至72.45%,并带来29%的每秒请求数(RPS)和27%的每秒事务数(TPS)提升。

链接: https://arxiv.org/abs/2512.09427
作者: Guoqiang Zou,Wanyu Wang,Hao Zheng,Longxiang Yin,Yinhe Han
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.
zh

[AI-27] owards Resilient Transportation: A Conditional Transformer for Accident-Informed Traffic Forecasting

【速读】:该论文旨在解决交通预测中因外部因素(如交通事故和交通管制)复杂影响被现有模型忽视而导致的准确性不足问题。其解决方案的关键在于提出了一种名为ConFormer(Conditional Transformer)的新框架,该框架通过引入图传播机制与引导归一化层,动态调整时空节点间的关系,从而更好地融合外部因素信息并提升预测精度。此外,ConFormer在保持高性能的同时显著降低了计算成本和参数需求,优于当前主流的时空基线模型。

链接: https://arxiv.org/abs/2512.09398
作者: Hongjun Wang,Jiawei Yong,Jiawei Wang,Shintaro Fukushima,Renhe Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic prediction remains a key challenge in spatio-temporal data mining, despite progress in deep learning. Accurate forecasting is hindered by the complex influence of external factors such as traffic accidents and regulations, often overlooked by existing models due to limited data integration. To address these limitations, we present two enriched traffic datasets from Tokyo and California, incorporating traffic accident and regulation data. Leveraging these datasets, we propose ConFormer (Conditional Transformer), a novel framework that integrates graph propagation with guided normalization layer. This design dynamically adjusts spatial and temporal node relationships based on historical patterns, enhancing predictive accuracy. Our model surpasses the state-of-the-art STAEFormer in both predictive performance and efficiency, achieving lower computational costs and reduced parameter demands. Extensive evaluations demonstrate that ConFormer consistently outperforms mainstream spatio-temporal baselines across multiple metrics, underscoring its potential to advance traffic prediction research.
zh

[AI-28] GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图形用户界面(GUI)自动化任务中因任务多样性导致的能力异质性问题,即不同GUI任务(如文档处理、在线购物、CAD设计等)需要模型具备多维专业知识与能力,而单一模型难以全面覆盖。解决方案的关键在于提出GAIR框架——一种基于信息联合推理与群体反思的GUI自动化代理系统,其核心机制是引入一个通用型MLLM作为决策中枢,协同多个针对特定GUI任务训练的专用MLLM,通过信息融合提升整体性能;当通用模型判断信息不足时,会切换至“群体反思”状态,向各专用模型提供差异化指令和提示,引导其获取更具针对性的信息以支持更深层次的推理与决策,从而实现跨任务能力的动态整合与优化。

链接: https://arxiv.org/abs/2512.09396
作者: Zishu Wei,Qixiang Ma,Xavier Hu,Yuhang Liu,Hui Zang,Yudong Zhao,Tao Wang,Shengyu Zhang,Fei Wu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building AI systems for GUI automation task has attracted remarkable research efforts, where MLLMs are leveraged for processing user requirements and give operations. However, GUI automation includes a wide range of tasks, from document processing to online shopping, from CAD to video editing. Diversity between particular tasks requires MLLMs for GUI automation to have heterogeneous capabilities and master multidimensional expertise, raising problems on constructing such a model. To address such challenge, we propose GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection, a novel MLLM-based GUI automation agent framework designed for integrating knowledge and combining capabilities from heterogeneous models to build GUI automation agent systems with higher performance. Since different GUI-specific MLLMs are trained on different dataset and thus have different strengths, GAIR introduced a general-purpose MLLM for jointly processing the information from multiple GUI-specific models, further enhancing performance of the agent framework. The general-purpose MLLM also serves as decision maker, trying to execute a reasonable operation based on previously gathered information. When the general-purpose model thinks that there isn’t sufficient information for a reasonable decision, GAIR would transit into group reflection status, where the general-purpose model would provide GUI-specific models with different instructions and hints based on their strengths and weaknesses, driving them to gather information with more significance and accuracy that can support deeper reasoning and decision. We evaluated the effectiveness and reliability of GAIR through extensive experiments on GUI benchmarks.
zh

[AI-29] BugSweeper: Function-Level Detection of Smart Contract Vulnerabilities Using Graph Neural Networks AAAI2026

【速读】:该论文旨在解决智能合约漏洞检测中依赖人工规则预处理导致上下文信息丢失、适应新威胁能力弱的问题。其关键解决方案是提出一个端到端的深度学习框架 BugSweeper,通过将 Solidity 函数表示为函数级抽象语法图(Function-Level Abstract Syntax Graph, FLAG),融合抽象语法树(Abstract Syntax Tree, AST)与增强的控制流和数据流语义,再利用两阶段图神经网络(Graph Neural Network, GNN)进行分析:第一阶段去除语法图中的噪声,第二阶段执行高层次推理以识别多种漏洞。该方法无需人工特征工程,实现了自动化、可扩展且对新型漏洞具有鲁棒性的智能合约安全检测。

链接: https://arxiv.org/abs/2512.09385
作者: Uisang Lee,Changhoon Chung,Junmo Lee,Soo-Mook Moon
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is accepted to AAAI 2026

点击查看摘要

Abstract:The rapid growth of Ethereum has made it more important to quickly and accurately detect smart contract vulnerabilities. While machine-learning-based methods have shown some promise, many still rely on rule-based preprocessing designed by domain experts. Rule-based preprocessing methods often discard crucial context from the source code, potentially causing certain vulnerabilities to be overlooked and limiting adaptability to newly emerging threats. We introduce BugSweeper, an end-to-end deep learning framework that detects vulnerabilities directly from the source code without manual engineering. BugSweeper represents each Solidity function as a Function-Level Abstract Syntax Graph (FLAG), a novel graph that combines its Abstract Syntax Tree (AST) with enriched control-flow and data-flow semantics. Then, our two-stage Graph Neural Network (GNN) analyzes these graphs. The first-stage GNN filters noise from the syntax graphs, while the second-stage GNN conducts high-level reasoning to detect diverse vulnerabilities. Extensive experiments on real-world contracts show that BugSweeper significantly outperforms all state-of-the-art detection methods. By removing the need for handcrafted rules, our approach offers a robust, automated, and scalable solution for securing smart contracts without any dependence on security experts.
zh

[AI-30] Branching Strategies Based on Subgraph GNNs: A Study on Theoretical Promise versus Practical Reality

【速读】:该论文旨在解决混合整数线性规划(Mixed-Integer Linear Programming, MILP)中“学习分支”(learning to branch)问题,即如何利用图神经网络(Graph Neural Networks, GNNs)提升分支决策的质量。传统消息传递GNN(Message-Passing GNNs, MPNNs)虽计算高效但表达能力不足,难以充分建模MILP结构;而高阶GNN(如2-FGNN)虽具强表达能力却因计算复杂度过高不可行。本文提出以子图GNN(Subgraph GNNs)作为理论上的折中方案,其关键创新在于证明:即使表达能力低于3-轮随机游走同构检验(3-WL)的节点锚定子图GNN,也足以近似强分支(Strong Branching)得分——这比先前认为需要3-WL表达力的结果更为严格。然而,实证表明此类模型因O(n)复杂度导致显著内存瓶颈和求解延迟,凸显当前表达性与效率之间的权衡困境。

链接: https://arxiv.org/abs/2512.09355
作者: Junru Zhou,Yicheng Wang,Pan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a promising approach for ``learning to branch’’ in Mixed-Integer Linear Programming (MILP). While standard Message-Passing GNNs (MPNNs) are efficient, they theoretically lack the expressive power to fully represent MILP structures. Conversely, higher-order GNNs (like 2-FGNNs) are expressive but computationally prohibitive. In this work, we investigate Subgraph GNNs as a theoretical middle ground. Crucially, while previous work [Chen et al., 2025] demonstrated that GNNs with 3-WL expressive power can approximate Strong Branching, we prove a sharper result: node-anchored Subgraph GNNs whose expressive power is strictly lower than 3-WL [Zhang et al., 2023] are sufficient to approximate Strong Branching scores. However, our extensive empirical evaluation on four benchmark datasets reveals a stark contrast between theory and practice. While node-anchored Subgraph GNNs theoretically offer superior branching decisions, their O(n) complexity overhead results in significant memory bottlenecks and slower solving times than MPNNs and heuristics. Our results indicate that for MILP branching, the computational cost of expressive GNNs currently outweighs their gains in decision quality, suggesting that future research must focus on efficiency-preserving expressivity.
zh

[AI-31] Efficiency-Aware Computational Intelligence for Resource-Constrained Manufacturing Toward Edge-Ready Deployment

【速读】:该论文旨在解决工业制造环境中因数据不完整、标签缺失、分布偏移及资源受限(如延迟、带宽和能源)所导致的集中式深度学习难以部署、数字孪生可靠性不足以及安全关键应用中错误漏检风险增加等问题。其解决方案的关键在于构建一个以效率为根基的计算框架,融合生成式策略缓解数据稀缺与不平衡问题,利用半监督学习减少标注与仿真需求,通过物理信息表征学习提升小样本条件下的可解释性与状态监测性能,结合空间感知图代理建模实现复杂过程的高效近似,并采用边缘-云协同压缩方案支持资源受限下的实时信号分析;同时,借助领域特定检索增强的零样本视觉语言推理扩展视觉理解能力,从而在多模态、多尺度场景下实现数据高效且资源感知的智能决策,打通实验室学习到工业部署的壁垒。

链接: https://arxiv.org/abs/2512.09319
作者: Qianyu Zhou
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 2025, University of Connecticut

点击查看摘要

Abstract:Industrial cyber physical systems operate under heterogeneous sensing, stochastic dynamics, and shifting process conditions, producing data that are often incomplete, unlabeled, imbalanced, and domain shifted. High-fidelity datasets remain costly, confidential, and slow to obtain, while edge devices face strict limits on latency, bandwidth, and energy. These factors restrict the practicality of centralized deep learning, hinder the development of reliable digital twins, and increase the risk of error escape in safety-critical applications. Motivated by these challenges, this dissertation develops an efficiency grounded computational framework that enables data lean, physics-aware, and deployment ready intelligence for modern manufacturing environments. The research advances methods that collectively address core bottlenecks across multimodal and multiscale industrial scenarios. Generative strategies mitigate data scarcity and imbalance, while semi-supervised learning integrates unlabeled information to reduce annotation and simulation demands. Physics-informed representation learning strengthens interpretability and improves condition monitoring under small-data regimes. Spatially aware graph-based surrogate modeling provides efficient approximation of complex processes, and an edge cloud collaborative compression scheme supports real-time signal analytics under resource constraints. The dissertation also extends visual understanding through zero-shot vision language reasoning augmented by domain specific retrieval, enabling generalizable assessment in previously unseen scenarios. Together, these developments establish a unified paradigm of data efficient and resource aware intelligence that bridges laboratory learning with industrial deployment, supporting reliable decision-making across diverse manufacturing systems.
zh

[AI-32] Simultaneous Genetic Evolution of Neural Networks for Optimal SFC Embedding

【速读】:该论文旨在解决服务功能链(Service Function Chain, SFC)在网络基础设施中的最优嵌入问题,该问题涉及三个相互耦合的子问题:链路组成、虚拟网络功能嵌入和链路嵌入,且被证明为NP-hard。传统方法通常对这三个子问题进行顺序优化,难以获得全局最优解。本文提出的GENESIS方案通过一种基于遗传算法(Genetic Algorithm)的协同优化框架,首次实现对三个子问题的同时优化;其关键创新在于利用三个正弦函数激活的神经网络(Neural Network)分别生成候选解,并将输出映射至高斯分布后输入A*算法,从而高效探索解空间并收敛到全局最优解。实验表明,GENESIS在48个数据中心场景中实现了100%的最优解率,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2512.09318
作者: Theviyanthan Krishnamohan,Lauritz Thamsen,Paul Harvey
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliance of organisations on computer networks is enabled by network programmability, which is typically achieved through Service Function Chaining. These chains virtualise network functions, link them, and programmatically embed them on networking infrastructure. Optimal embedding of Service Function Chains is an NP-hard problem, with three sub-problems, chain composition, virtual network function embedding, and link embedding, that have to be optimised simultaneously, rather than sequentially, for optimal results. Genetic Algorithms have been employed for this, but existing approaches either do not optimise all three sub-problems or do not optimise all three sub-problems simultaneously. We propose a Genetic Algorithm-based approach called GENESIS, which evolves three sine-function-activated Neural Networks, and funnels their output to a Gaussian distribution and an A* algorithm to optimise all three sub-problems simultaneously. We evaluate GENESIS on an emulator across 48 different data centre scenarios and compare its performance to two state-of-the-art Genetic Algorithms and one greedy algorithm. GENESIS produces an optimal solution for 100% of the scenarios, whereas the second-best method optimises only 71% of the scenarios. Moreover, GENESIS is the fastest among all Genetic Algorithms, averaging 15.84 minutes, compared to an average of 38.62 minutes for the second-best Genetic Algorithm.
zh

[AI-33] Hetero-SplitEE: Split Learning of Neural Networks with Early Exits for Heterogeneous IoT Devices

【速读】:该论文旨在解决现有Split Learning方法在异构物联网(IoT)系统中应用受限的问题,即传统方案假设所有客户端具有相同的计算资源和统一的模型分割点(cut layers),这与现实场景中设备计算能力差异显著的情况不符。解决方案的关键在于提出Hetero-SplitEE框架,其核心创新是引入异构早退出机制(heterogeneous early exits),使每个客户端可根据自身计算容量动态选择不同的模型切分层(split points),从而实现个性化训练;同时设计两种协同训练策略——顺序策略(Sequential strategy)平均策略(Averaging strategy),分别通过串行训练降低计算开销或通过跨层周期聚合支持并行协作,有效保障在异构环境下共享深度神经网络的高效、稳定训练。

链接: https://arxiv.org/abs/2512.09313
作者: Yuki Oda,Yuta Ono,Hiroshi Nakamura,Hideki Takase
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages. Accepted at MCSoC 2025

点击查看摘要

Abstract:The continuous scaling of deep neural networks has fundamentally transformed machine learning, with larger models demonstrating improved performance across diverse tasks. This growth in model size has dramatically increased the computational resources required for the training process. Consequently, distributed approaches, such as Federated Learning and Split Learning, have become essential paradigms for scalable deployment. However, existing Split Learning approaches assume client homogeneity and uniform split points across all participants. This critically limits their applicability to real-world IoT systems where devices exhibit heterogeneity in computational resources. To address this limitation, this paper proposes Hetero-SplitEE, a novel method that enables heterogeneous IoT devices to train a shared deep neural network in parallel collaboratively. By integrating heterogeneous early exits into hierarchical training, our approach allows each client to select distinct split points (cut layers) tailored to its computational capacity. In addition, we propose two cooperative training strategies, the Sequential strategy and the Averaging strategy, to facilitate this collaboration among clients with different split points. The Sequential strategy trains clients sequentially with a shared server model to reduce computational overhead. The Averaging strategy enables parallel client training with periodic cross-layer aggregation. Extensive experiments on CIFAR-10, CIFAR-100, and STL-10 datasets using ResNet-18 demonstrate that our method maintains competitive accuracy while efficiently supporting diverse computational constraints, enabling practical deployment of collaborative deep learning in heterogeneous IoT ecosystems.
zh

[AI-34] FBA2D: Frequency-based Black-box Attack for AI-generated Image Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 检测器在黑盒决策攻击下的安全性问题,即在攻击者仅能通过 API 查询模型输出(如分类标签)而无法获取模型结构或数据分布信息的情况下,如何高效地构造对抗样本以绕过检测。其解决方案的关键在于提出 FBA²D 方法:利用图像在频域中真实与生成图像的差异特性,通过离散余弦变换(Discrete Cosine Transform, DCT)实现细粒度频谱分区,并将选定频率带作为查询子空间,从而提升查询效率和生成图像质量;同时引入“对抗样本汤”(adversarial example soup)策略,通过对多轮代理迭代结果进行平均并用作初始化,有效缓解初始失败问题、保持图像质量,并在严格查询预算下加速攻击过程。

链接: https://arxiv.org/abs/2512.09264
作者: Xiaojing Chen,Dan Li,Lijun Peng,Jun YanŁetter,Zhiqing Guo,Junyang Chen,Xiao Lan,Zhongjie Ba,Yunfeng DiaoŁetter
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prosperous development of Artificial Intelligence-Generated Content (AIGC) has brought people’s anxiety about the spread of false information on social media. Designing detectors for filtering is an effective defense method, but most detectors will be compromised by adversarial samples. Currently, most studies exposing AIGC security issues assume information on model structure and data distribution. In real applications, attackers query and interfere with models that provide services in the form of application programming interfaces (APIs), which constitutes the black-box decision-based attack paradigm. However, to the best of our knowledge, decision-based attacks on AIGC detectors remain unexplored. In this study, we propose \textbfFBA ^2 D: a frequency-based black-box attack method for AIGC detection to fill the research gap. Motivated by frequency-domain discrepancies between generated and real images, we develop a decision-based attack that leverages the Discrete Cosine Transform (DCT) for fine-grained spectral partitioning and selects frequency bands as query subspaces, improving both query efficiency and image quality. Moreover, attacks on AIGC detectors should mitigate initialization failures, preserve image quality, and operate under strict query budgets. To address these issues, we adopt an ``adversarial example soup’’ method, averaging candidates from successive surrogate iterations and using the result as the initialization to accelerate the query-based attack. The empirical study on the Synthetic LSUN dataset and GenImage dataset demonstrate the effectiveness of our prosed method. This study shows the urgency of addressing practical AIGC security problems.
zh

[AI-35] nsor-Compressed and Fully-Quantized Training of Neural PDE Solvers DATE2026

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在资源受限边缘设备上部署时面临的计算与内存开销问题,其核心挑战源于高阶自动微分、密集张量运算以及对全精度算术的依赖。解决方案的关键在于提出一个集成框架,包含三项创新:(1) 采用混合精度训练方法,利用平方块矩阵乘法(square-block MX, SMX)格式消除反向传播中的数据冗余;(2) 设计基于差值的量化方案用于Stein估计器(Stein’s estimator, SE),缓解数值下溢问题;(3) 引入部分重构策略(partial-reconstruction scheme, PRS)减少张量列车(tensor-train, TT)层中量化误差累积。此外,论文还设计了PINTA硬件加速器以充分释放该框架的性能潜力,在多个高维偏微分方程(PDEs)任务中实现了显著的速度提升(5.5x–83.5x)和能效改进(159.6x–2324.1x),同时保持与全精度基线相当或更优的精度。

链接: https://arxiv.org/abs/2512.09202
作者: Jinming Lu,Jiayi Tian,Yequan Zhao,Hai Li,Zheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: DATE 2026

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource-constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein’s estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed-precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein’s estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton-Jacobi-Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5x to 83.5x speedups and 159.6x to 2324.1x energy savings. This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.
zh

[AI-36] LLM s for Analog Circuit Design Continuum (ACDC)

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实工程领域(特别是模拟电路设计)中可靠性与鲁棒性不足的问题,从而限制其在以人类为中心的工作流中的实际应用。解决方案的关键在于系统性地评估不同数据表示形式对模型行为的影响,并比较小型模型(如T5、GPT-2)与大型基础模型(如Mistral-7B、GPT-oss-20B)在多种训练条件下的表现,识别出模型在生成设计时的不稳定性、对数据格式的敏感性以及泛化能力有限等关键可靠性挑战,为构建适用于结构化现实应用场景的可部署基础模型提供实证依据与改进方向。

链接: https://arxiv.org/abs/2512.09199
作者: Yasaman Esfandiari,Jocelyn Rego,Austin Meyer,Jonathan Gallagher,Mia Levy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and transformer architectures have shown impressive reasoning and generation capabilities across diverse natural language tasks. However, their reliability and robustness in real-world engineering domains remain largely unexplored, limiting their practical utility in human-centric workflows. In this work, we investigate the applicability and consistency of LLMs for analog circuit design – a task requiring domain-specific reasoning, adherence to physical constraints, and structured representations – focusing on AI-assisted design where humans remain in the loop. We study how different data representations influence model behavior and compare smaller models (e.g., T5, GPT-2) with larger foundation models (e.g., Mistral-7B, GPT-oss-20B) under varying training conditions. Our results highlight key reliability challenges, including sensitivity to data format, instability in generated designs, and limited generalization to unseen circuit configurations. These findings provide early evidence on the limits and potential of LLMs as tools to enhance human capabilities in complex engineering tasks, offering insights into designing reliable, deployable foundation models for structured, real-world applications.
zh

[AI-37] owards Optimal Valve Prescription for Transcatheter Aortic Valve Replacement (TAVR) Surgery: A Machine Learning Approach

【速读】:该论文旨在解决经导管主动脉瓣置换术(Transcatheter Aortic Valve Replacement, TAVR)中如何个性化选择最优人工心脏瓣膜(Transcatheter Heart Valve, THV)以降低永久性起搏器植入(Permanent Pacemaker Implantation, PPI)风险的问题。当前临床指南在瓣膜类型推荐上仍存在争议,且缺乏基于个体特征的精准决策工具。其解决方案的关键在于构建一个融合美国与希腊患者群体的多源数据集(整合人口学信息、CT影像与超声心动图),并通过叶级分析(leaf-level analysis)利用人群异质性特征,避免依赖不确定的反事实风险估计,从而实现对每位患者最优THV类型的个性化处方预测。最终模型在内部美国队列和外部希腊验证队列中分别使PPI发生率降低26%和16%,是首个统一的、面向个体的TAVR瓣膜选择策略。

链接: https://arxiv.org/abs/2512.09198
作者: Phevos Paschalidis,Vasiliki Stoumpou,Lisa Everest,Yu Ma,Talhat Azemi,Jawad Haider,Steven Zweibel,Eleftherios M. Protopapas,Jeff Mather,Maciej Tysarowski,George E. Sarris,Robert C. Hagberg,Howard L. Haronian,Dimitris Bertsimas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transcatheter Aortic Valve Replacement (TAVR) has emerged as a minimally invasive treatment option for patients with severe aortic stenosis, a life-threatening cardiovascular condition. Multiple transcatheter heart valves (THV) have been approved for use in TAVR, but current guidelines regarding valve type prescription remain an active topic of debate. We propose a data-driven clinical support tool to identify the optimal valve type with the objective of minimizing the risk of permanent pacemaker implantation (PPI), a predominant postoperative complication. We synthesize a novel dataset that combines U.S. and Greek patient populations and integrates three distinct data sources (patient demographics, computed tomography scans, echocardiograms) while harmonizing differences in each country’s record system. We introduce a leaf-level analysis to leverage population heterogeneity and avoid benchmarking against uncertain counterfactual risk estimates. The final prescriptive model shows a reduction in PPI rates of 26% and 16% compared with the current standard of care in our internal U.S. population and external Greek validation cohort, respectively. To the best of our knowledge, this work represents the first unified, personalized prescription strategy for THV selection in TAVR.
zh

[AI-38] Understanding Mental States in Active and Autonomous Driving with EEG

【速读】:该论文旨在解决自动驾驶(autonomous driving)与传统人工驾驶(active driving)模式下驾驶员心理状态差异的量化问题,以支持安全人机交互界面的设计。其关键解决方案是首次基于脑电图(EEG)数据系统比较了两种驾驶模式中认知负荷、疲劳、情绪效价(valence)和唤醒度(arousal)的时空特征及通道特异性激活模式,并通过迁移学习实验验证了两类场景下神经表征存在显著分布偏移(distribution shift)。研究发现,尽管两种模式在任务复杂度梯度上呈现相似趋势,但其心理状态强度和神经激活模式存在本质差异,主要归因于运动参与度和注意力需求的不同,从而揭示了开发下一代自动驾驶驾驶员监控系统必须依赖场景特定的数据与模型。

链接: https://arxiv.org/abs/2512.09190
作者: Prithila Angkan,Paul Hungler,Ali Etemad
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 15 Pages, 13 Figures and 3 Tables. This work has been submitted to IEEE Transaction for possible publication

点击查看摘要

Abstract:Understanding how driver mental states differ between active and autonomous driving is critical for designing safe human-vehicle interfaces. This paper presents the first EEG-based comparison of cognitive load, fatigue, valence, and arousal across the two driving modes. Using data from 31 participants performing identical tasks in both scenarios of three different complexity levels, we analyze temporal patterns, task-complexity effects, and channel-wise activation differences. Our findings show that although both modes evoke similar trends across complexity levels, the intensity of mental states and the underlying neural activation differ substantially, indicating a clear distribution shift between active and autonomous driving. Transfer-learning experiments confirm that models trained on active driving data generalize poorly to autonomous driving and vice versa. We attribute this distribution shift primarily to differences in motor engagement and attentional demands between the two driving modes, which lead to distinct spatial and temporal EEG activation patterns. Although autonomous driving results in lower overall cortical activation, participants continue to exhibit measurable fluctuations in cognitive load, fatigue, valence, and arousal associated with readiness to intervene, task-evoked emotional responses, and monotony-related passive fatigue. These results emphasize the need for scenario-specific data and models when developing next-generation driver monitoring systems for autonomous vehicles.
zh

[AI-39] WOLF: Werewolf-based Observations for LLM Deception and Falsehoods NEURIPS2025

【速读】:该论文旨在解决多智能体推理中欺骗行为评估的静态化问题,即现有评价体系将欺骗简化为单次分类任务,忽略了其在交互性、对抗性和时间演化上的复杂动态特性。为此,作者提出WOLF基准,其核心创新在于构建了一个基于狼人杀(Werewolf)规则的多智能体社会推理测试平台,通过角色驱动的状态机(LangGraph)实现严格的昼夜循环、辩论轮次与多数投票机制,并对每条陈述进行独立分析,结合说话者自评诚实度与同伴评分的欺骗性,采用标准化的欺骗类型分类(遗漏、扭曲、捏造、误导)及纵向平滑的怀疑分数,从而可分离测量欺骗生成与检测能力。该设计使系统能够在真实对抗场景下量化模型的欺骗策略和识别精度,显著提升了对长期信任演化与错误累积的捕捉能力。

链接: https://arxiv.org/abs/2512.09187
作者: Mrinal Agarwal,Saad Rana,Theo Sundoro,Hermela Berhe,Spencer Kim,Vasu Sharma,Sean O’Brien,Kevin Zhu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Spotlight Multi-Turn Interactions in Large Language Models (MTI-LLM) Workshop at NeurIPS 2025

点击查看摘要

Abstract:Deception is a fundamental challenge for multi-agent reasoning: effective systems must strategically conceal information while detecting misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly but remain weak at detecting deception in peers. We present WOLF, a multi-agent social deduction benchmark based on Werewolf that enables separable measurement of deception production and detection. WOLF embeds role-grounded agents (Villager, Werewolf, Seer, Doctor) in a programmable LangGraph state machine with strict night-day cycles, debate turns, and majority voting. Every statement is a distinct analysis unit, with self-assessed honesty from speakers and peer-rated deceptiveness from others. Deception is categorized via a standardized taxonomy (omission, distortion, fabrication, misdirection), while suspicion scores are longitudinally smoothed to capture both immediate judgments and evolving trust dynamics. Structured logs preserve prompts, outputs, and state transitions for full reproducibility. Across 7,320 statements and 100 runs, Werewolves produce deceptive statements in 31% of turns, while peer detection achieves 71-73% precision with ~52% overall accuracy. Precision is higher for identifying Werewolves, though false positives occur against Villagers. Suspicion toward Werewolves rises from ~52% to over 60% across rounds, while suspicion toward Villagers and the Doctor stabilizes near 44-46%. This divergence shows that extended interaction improves recall against liars without compounding errors against truthful roles. WOLF moves deception evaluation beyond static datasets, offering a dynamic, controlled testbed for measuring deceptive and detective capacity in adversarial multi-agent interaction.
zh

[AI-40] SDialog: A Python Toolkit for End-to-End Agent Building User Simulation Dialog Generation and Evaluation EACL

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的对话系统在构建、评估与可解释性分析方面缺乏统一框架的问题。现有方法通常将生成、评测和机制解释割裂处理,导致研究流程低效且难以系统化理解对话行为。解决方案的关键在于提出SDialog——一个MIT许可证下的开源Python工具包,其核心是基于标准化的对话表示(Dialog representation),实现了对话生成、多维度评估(包括语言指标、LLM作为裁判和功能正确性验证)以及机制可解释性分析(如激活检查、特征消融与诱导控制)的一体化端到端集成,并支持音频合成与声学环境模拟。通过耦合生成、评估与可解释性模块于以对话为中心的架构中,SDialog显著提升了构建和理解对话代理的系统性与效率。

链接: https://arxiv.org/abs/2512.09142
作者: Sergio Burdisso,Séverin Baroudi,Yanis Labrak,David Grunert,Pawel Cyrta,Yiyang Chen,Srikanth Madikeri,Esaú Villatoro-Tello,Thomas Schaaf,Ricard Marxer,Petr Motlicek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Pre-print submitted to EACL System Demonstration (under review)

点击查看摘要

Abstract:We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \textttDialog representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
zh

[AI-41] A Categorical Analysis of Large Language Models and Why LLM s Circumvent the Symbol Grounding Problem

【速读】:该论文旨在解决符号接地问题(symbol grounding problem),即如何使语言模型中的符号与现实世界中的对象或状态建立稳定、有意义的关联。论文提出了一种形式化的范畴论框架,用于分析人类和大型语言模型(LLMs)如何将内容转化为关于可能世界空间 W 的可真值评估命题。其解决方案的关键在于指出:LLMs 并未真正解决符号接地问题,而是通过生成看似合理的语义响应来“绕过”该问题,从而在表面上实现对世界的描述,但缺乏与真实世界状态的实质对应关系。

链接: https://arxiv.org/abs/2512.09117
作者: Luciano Floridi,Yiyang Jia,Fernando Tohmé
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a formal, categorical framework for analysing how humans and large language models (LLMs) transform content into truth-evaluated propositions about a state space of possible worlds W , in order to argue that LLMs do not solve but circumvent the symbol grounding problem.
zh

[AI-42] AI TIPS 2.0: A Comprehensive Framework for Operationalizing AI Governance

【速读】:该论文旨在解决当前人工智能(AI)系统部署中面临的三大治理挑战:一是组织在使用场景层面缺乏有效的风险评估,导致如Humana诉讼案例中因AI系统存在显著偏见和高错误率而引发医疗索赔不当拒付;二是现有框架(如ISO 42001和NIST AI RMF)停留在概念层面,缺少可操作的控制措施,使实践者难以将治理要求转化为具体技术实现;三是组织缺乏规模化实施治理的机制,无法在开发全生命周期内嵌入可信AI实践、量化合规性并提供从董事会到数据科学家的角色化可见性。解决方案的关键在于提出AI TIPS(Artificial Intelligence Trust-Integrated Pillars for Sustainability 2.0),这是一个更新后的综合性操作框架,早于NIST AI风险管理框架四年发布,直接回应上述问题,提供可落地的治理结构与实施路径。

链接: https://arxiv.org/abs/2512.09114
作者: Pamela Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 47 pages

点击查看摘要

Abstract:The deployment of AI systems faces three critical governance challenges that current frameworks fail to adequately address. First, organizations struggle with inadequate risk assessment at the use case level, exemplified by the Humana class action lawsuit and other high impact cases where an AI system deployed to production exhibited both significant bias and high error rates, resulting in improper healthcare claim denials. Each AI use case presents unique risk profiles requiring tailored governance, yet most frameworks provide one size fits all guidance. Second, existing frameworks like ISO 42001 and NIST AI RMF remain at high conceptual levels, offering principles without actionable controls, leaving practitioners unable to translate governance requirements into specific technical implementations. Third, organizations lack mechanisms for operationalizing governance at scale, with no systematic approach to embed trustworthy AI practices throughout the development lifecycle, measure compliance quantitatively, or provide role-appropriate visibility from boards to data scientists. We present AI TIPS, Artificial Intelligence Trust-Integrated Pillars for Sustainability 2.0, update to the comprehensive operational framework developed in 2019,four years before NIST’s AI Risk Management Framework, that directly addresses these challenges.
zh

[AI-43] Semantic Trajectory Generation for Goal-Oriented Spacecraft Rendezvous

【速读】:该论文旨在解决自主航天器在真实交会对接任务中轨迹生成的可扩展性与人机交互效率问题,传统非凸制导与控制方法依赖大量专家输入(如航路点、约束条件和任务时间线),限制了其在复杂场景下的应用。解决方案的关键在于提出SAGES(Semantic Autonomous Guidance Engine for Space)框架,该框架通过自然语言指令解析实现高阶意图映射,并结合连续时间约束强化机制,在满足非凸约束的前提下自动生成符合人类语义行为的轨迹,从而显著降低专家干预需求并提升操作灵活性与一致性。

链接: https://arxiv.org/abs/2512.09111
作者: Yuji Takubo,Arpit Dwivedi,Sukeerth Ramkumar,Luis A. Pabon,Daniele Gammelli,Marco Pavone,Simone D’Amico
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 28 pages, 12 figures. Submitted to AIAA SCITECH 2026

点击查看摘要

Abstract:Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits the operational scalability in real rendezvous this http URL paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings – fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform – demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden.
zh

[AI-44] Evolving Excellence: Automated Optimization of LLM -based Agents

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(Agentic AI)系统在实际部署中性能不佳的问题,其根本原因在于代理配置(如提示词、工具描述和参数)往往需要数周的手动调优,而现有优化方法要么过于复杂难以推广,要么孤立优化各组件,忽视了关键的交互依赖关系。解决方案的关键在于提出ARTEMIS——一个无需代码的进化优化平台,通过语义感知的遗传算子联合优化代理配置:它能自动识别可配置组件、从执行日志中提取性能信号,并在不修改架构的前提下演化出更优配置,从而显著提升多种典型代理任务的性能表现(如竞赛编程、代码优化、数学推理等)。

链接: https://arxiv.org/abs/2512.09108
作者: Paul Brookes,Vardan Voskanyan,Rafail Giavrimis,Matthew Truscott,Mina Ilieva,Chrystalla Pavlou,Alexandru Staicu,Manal Adham,Will Evers- Hood,Jingzhi Gong,Kejia Zhang,Matvey Fedoseev,Vishal Sharma,Roman Bauer,Zheng Wang,Hema Nair,Wei Jie,Tianhua Xu,Aurora Constantin,Leslie Kanthan,Michail Basios
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI systems built on large language models (LLMs) offer significant potential for automating complex workflows, from software development to customer support. However, LLM agents often underperform due to suboptimal configurations; poorly tuned prompts, tool descriptions, and parameters that typically require weeks of manual refinement. Existing optimization methods either are too complex for general use or treat components in isolation, missing critical interdependencies. We present ARTEMIS, a no-code evolutionary optimization platform that jointly optimizes agent configurations through semantically-aware genetic operators. Given only a benchmark script and natural language goals, ARTEMIS automatically discovers configurable components, extracts performance signals from execution logs, and evolves configurations without requiring architectural modifications. We evaluate ARTEMIS on four representative agent systems: the \emphALE Agent for competitive programming on AtCoder Heuristic Contest, achieving a \textbf 13.6% improvement in acceptance rate; the \emphMini-SWE Agent for code optimization on SWE-Perf, with a statistically significant \textbf10.1% performance gain; and the \emphCrewAI Agent for cost and mathematical reasoning on Math Odyssey, achieving a statistically significant \textbf 36.9% reduction in the number of tokens required for evaluation. We also evaluate the \emphMathTales-Teacher Agent powered by a smaller open-source model (Qwen2.5-7B) on GSM8K primary-level mathematics problems, achieving a \textbf22% accuracy improvement and demonstrating that ARTEMIS can optimize agents based on both commercial and local models. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.09108 [cs.SE] (or arXiv:2512.09108v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.09108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] Masked Generative Policy for Robotic Control

【速读】:该论文旨在解决视觉-运动模仿学习(visuomotor imitation learning)中现有方法在复杂、非马尔可夫(non-Markovian)任务下推理效率低且控制可靠性差的问题。其解决方案的核心是提出一种名为掩码生成策略(Masked Generative Policy, MGP)的新框架,通过将动作表示为离散token,并利用条件掩码Transformer并行生成动作序列,随后仅对置信度低的token进行快速精炼;其中MGP-Long进一步实现了单次前向传播预测完整轨迹,并基于新观测动态调整低置信度动作token,从而在保持全局一致性的同时具备鲁棒的自适应执行能力,显著提升了复杂任务的成功率与推理速度。

链接: https://arxiv.org/abs/2512.09101
作者: Lipeng Zhuang,Shiyu Fan,Florent P. Audonnet,Yingdong Ru,Gerardo Aragon Camarasa,Paul Henderson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.
zh

[AI-46] Calibrated Trust in Dealing with LLM Hallucinations: A Qualitative Study

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中幻觉(hallucinations)现象对用户信任及交互行为的影响问题。研究表明,幻觉不会导致用户产生普遍性的不信任,而是促使用户根据具体情境进行信任校准。解决方案的关键在于识别并整合用户相关因素(如期望、先验经验、领域知识和直觉)与情境因素(如感知风险和决策重要性),从而验证并扩展了Blöbaum提出的递归信任校准过程,将“直觉”作为新增的用户相关信任因子,为负责任且具反思性的LLM使用提供了实践指导。

链接: https://arxiv.org/abs/2512.09088
作者: Adrian Ryser,Florian Allwein,Tim Schlippe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations are outputs by Large Language Models (LLMs) that are factually incorrect yet appear plausible [1]. This paper investigates how such hallucinations influence users’ trust in LLMs and users’ interaction with LLMs. To explore this in everyday use, we conducted a qualitative study with 192 participants. Our findings show that hallucinations do not result in blanket mistrust but instead lead to context-sensitive trust calibration. Building on the calibrated trust model by Lee See [2] and Afroogh et al.'s trust-related factors [3], we confirm expectancy [3], [4], prior experience [3], [4], [5], and user expertise domain knowledge [3], [4] as userrelated (human) trust factors, and identify intuition as an additional factor relevant for hallucination detection. Additionally, we found that trust dynamics are further influenced by contextual factors, particularly perceived risk [3] and decision stakes [6]. Consequently, we validate the recursive trust calibration process proposed by Blöbaum [7] and extend it by including intuition as a user-related trust factor. Based on these insights, we propose practical recommendations for responsible and reflective LLM use.
zh

[AI-47] Mental Models of Autonomy and Sentience Shape Reactions to AI

【速读】:该论文试图解决的问题是:当前关于人工智能(Artificial Intelligence, AI)的叙事常常将自主性(autonomy,即自我管理能力)与感知性(sentience,即感知和感受能力)混为一谈,从而模糊了人类对AI不同心理模型的反应差异,影响了人机交互设计的精准性。解决方案的关键在于通过实证研究明确区分这两种心理模型——即在实验中系统操纵AI被描述为仅具自主性、仅具感知性、兼具两者或均无,并基于大规模样本(N = 2,702)的场景化实验验证二者对人类心智知觉(mind perception)和道德考量的差异化影响。结果表明,激活感知性模型比自主性模型更能提升整体心智知觉和道德关怀,而自主性则更易引发威胁感知;此外,感知性还能增强对自主性的感知,说明二者存在交互效应。这一方法论上的解耦为未来人机交互研究提供了精细化的心理机制框架,有助于优化拟人化AI的设计与提示接口(prompting interfaces)。

链接: https://arxiv.org/abs/2512.09085
作者: Janet V.T. Pauketat,Daniel B. Shank,Aikaterina Manoli,Jacy Reese Anthis
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 37 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.
zh

[AI-48] Beyond the Hype: Comparing Lightweight and Deep Learning Models for Air Quality Forecasting

【速读】:该论文旨在解决城市空气污染(特别是PM₂.₅和PM₁₀)精准预测问题,以支持公共健康保护和污染治理政策制定。传统深度学习(Deep Learning, DL)方法虽性能优越但存在模型复杂、可解释性差的问题,限制了其在实际场景中的部署。研究提出采用轻量级加法模型——Facebook Prophet(FBP)与NeuralProphet(NP)作为替代方案,通过系统特征选择(相关性、互信息、mRMR)、防泄漏数据缩放及时间序列分割策略优化输入特征,并结合污染物及其前体物作为回归变量;其中NP进一步引入滞后依赖关系。结果表明,FBP在7天留出测试集上表现最优,R²均超过0.94,显著优于LSTM、LightGBM及SARIMAX等基线模型,证明了可解释性强的加法模型在精度、透明度与部署便捷性之间实现了良好平衡。

链接: https://arxiv.org/abs/2512.09076
作者: Moazzam Umer Gondal,Hamad ul Qudous,Asma Ahmad Farhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Accurate forecasting of urban air pollution is essential for protecting public health and guiding mitigation policies. While Deep Learning (DL) and hybrid pipelines dominate recent research, their complexity and limited interpretability hinder operational use. This study investigates whether lightweight additive models – Facebook Prophet (FBP) and NeuralProphet (NP) – can deliver competitive forecasts for particulate matter (PM _2.5 , PM _10 ) in Beijing, China. Using multi-year pollutant and meteorological data, we applied systematic feature selection (correlation, mutual information, mRMR), leakage-safe scaling, and chronological data splits. Both models were trained with pollutant and precursor regressors, with NP additionally leveraging lagged dependencies. For context, two machine learning baselines (LSTM, LightGBM) and one traditional statistical model (SARIMAX) were also implemented. Performance was evaluated on a 7-day holdout using MAE, RMSE, and R^2 . Results show that FBP consistently outperformed NP, SARIMAX, and the learning-based baselines, achieving test R^2 above 0.94 for both pollutants. These findings demonstrate that interpretable additive models remain competitive with both traditional and complex approaches, offering a practical balance of accuracy, transparency, and ease of deployment.
zh

[AI-49] ShelfAware: Real-Time Visual-Inertial Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

【速读】:该论文旨在解决室内准静态场景(quasi-static environments)中基于视觉的全局定位问题,此类场景具有稳定的全局布局但局部语义持续变化,导致几何重复、动态杂波和感知噪声,从而破坏传统视觉定位方法的性能。解决方案的关键在于提出 ShelfAware——一种基于语义粒子滤波(semantic particle filter)的鲁棒全局定位方法,其核心创新是将场景语义建模为对象类别上的统计证据而非固定地标,并融合深度似然与类别中心的语义相似性;同时利用预计算的语义视点库在蒙特卡洛定位(MCL)中执行逆向语义提案(inverse semantic proposals),实现低资源硬件上的快速、精准假设生成。该方法有效缓解了准静态环境中常见的几何歧义(geometric aliasing)和语义漂移(semantic drift)问题,在零售环境中验证了其高成功率(96%)、快速收敛(均值1.91秒)及稳定跟踪能力(80%序列)。

链接: https://arxiv.org/abs/2512.09065
作者: Shivendra Agrawal,Jake Brawer,Ashutosh Naik,Alessandro Roncone,Bradley Hayes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.
zh

[AI-50] Llama-based source code vulnerability detection: Prompt engineering vs Fine tuning ESORICS2025

【速读】:该论文旨在解决软件源代码漏洞检测(Code Vulnerability Detection, CVD)自动化不足的问题,尤其是在软件开发周期加速导致漏洞数量持续增长的背景下。其核心解决方案在于探索大型语言模型(Large Language Models, LLMs)在CVD任务中的应用潜力,并通过多种先进方法提升其性能。关键创新包括提出一种名为“Double Fine-tuning”的新型微调策略以及测试尚未充分研究的“测试时微调”(Test-Time Fine-tuning)方法,同时利用开源Llama-3.1 8B模型结合BigVul和PrimeVul数据集中的代码样本进行实验。结果表明,微调是提升LLMs在CVD任务中表现的关键因素,其中Double Fine-tuning展现出优越性能,而检索增强生成(Retrieval-Augmented Generation, RAG)作为示例选择技术也表现出较好的效果。

链接: https://arxiv.org/abs/2512.09006
作者: Dyna Soumhane Ouchebara,Stéphane Dupont
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 20 pages, Accepted at ESORICS 2025

点击查看摘要

Abstract:The significant increase in software production, driven by the acceleration of development cycles over the past two decades, has led to a steady rise in software vulnerabilities, as shown by statistics published yearly by the CVE program. The automation of the source code vulnerability detection (CVD) process has thus become essential, and several methods have been proposed ranging from the well established program analysis techniques to the more recent AI-based methods. Our research investigates Large Language Models (LLMs), which are considered among the most performant AI models to date, for the CVD task. The objective is to study their performance and apply different state-of-the-art techniques to enhance their effectiveness for this task. We explore various fine-tuning and prompt engineering settings. We particularly suggest one novel approach for fine-tuning LLMs which we call Double Fine-tuning, and also test the understudied Test-Time fine-tuning approach. We leverage the recent open-source Llama-3.1 8B, with source code samples extracted from BigVul and PrimeVul datasets. Our conclusions highlight the importance of fine-tuning to resolve the task, the performance of Double tuning, as well as the potential of Llama models for CVD. Though prompting proved ineffective, Retrieval augmented generation (RAG) performed relatively well as an example selection technique. Overall, some of our research questions have been answered, and many are still on hold, which leaves us many future work perspectives. Code repository is available here: this https URL.
zh

[AI-51] Institutional AI Sovereignty Through Gateway Architecture: Implementation Report from Fontys ICT

【速读】:该论文旨在解决高校在采用商业生成式 AI (Generative AI) 工具时面临的碎片化、高风险问题,包括访问不平等、合规性风险(如数据处理不透明和非欧盟服务器托管)以及缺乏统一治理机制。其核心解决方案是构建一个受控的机构级 AI 网关平台(gateway platform),该平台由三层架构组成:第一层为基于机构身份认证的前端界面,明确展示模型选择;第二层为核心网关,实施策略控制、访问管理与预算分配,并默认将流量路由至欧盟境内基础设施;第三层为提供者层,通过“机构模型卡”封装商业与开源模型,实现统一治理接口。关键创新在于,该网关模式实现了模型多样性与快速迭代的同时保持机构主权与可控性,其成功运行表明,只有通过专门的治理结构(如设立正式的AI负责人角色)才能确保高校可持续地运营多供应商AI平台,并将AI视为战略事务而非辅助职能。

链接: https://arxiv.org/abs/2512.08978
作者: Ruud Huijts,Koen Suilen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To counter fragmented, high-risk adoption of commercial AI tools, we built and ran an institutional AI platform in a six-month, 300-user pilot, showing that a university of applied sciences can offer advanced AI with fair access, transparent risks, controlled costs, and alignment with European law. Commercial AI subscriptions create unequal access and compliance risks through opaque processing and non-EU hosting, yet banning them is neither realistic nor useful. Institutions need a way to provide powerful AI in a sovereign, accountable form. Our solution is a governed gateway platform with three layers: a ChatGPT-style frontend linked to institutional identity that makes model choice explicit; a gateway core enforcing policy, controlling access and budgets, and routing traffic to EU infrastructure by default; and a provider layer wrapping commercial and open-source models in institutional model cards that consolidate vendor documentation into one governance interface. The pilot ran reliably with no privacy incidents and strong adoption, enabling EU-default routing, managed spending, and transparent model choices. Only the gateway pattern combines model diversity and rapid innovation with institutional control. The central insight: AI is not a support function but strategy, demanding dedicated leadership. Sustainable operation requires governance beyond traditional boundaries. We recommend establishing a formal AI Officer role combining technical literacy, governance authority, and educational responsibility. Without it, AI decisions stay ad-hoc and institutional exposure grows. With it, higher-education institutions can realistically operate their own multi-provider AI platform, provided they govern AI as seriously as they teach it. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08978 [cs.CY] (or arXiv:2512.08978v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2512.08978 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ruud Huijts [view email] [v1] Thu, 4 Dec 2025 12:41:32 UTC (3,204 KB) Full-text links: Access Paper: View a PDF of the paper titled Institutional AI Sovereignty Through Gateway Architecture: Implementation Report from Fontys ICT, by Ruud Huijts and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CY prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-52] Peek-a-Boo Reasoning : Contrastive Region Masking in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在链式思维(Chain-of-Thought, CoT)推理过程中对视觉区域依赖关系不透明的问题,即缺乏对模型每一步推理中视觉信息使用情况的因果性、细粒度诊断手段。现有方法通常仅关注最终答案或注意力图谱,难以揭示模型是否真正基于视觉证据进行推理。其解决方案的关键在于提出对比区域掩码(Contrastive Region Masking, CRM),通过系统性地遮蔽标注的视觉区域,并将掩码后的推理轨迹与未遮蔽基线进行对比,从而实现步骤级的因果归因分析。CRM能够识别模型的不同失效模式:一类模型虽保持推理结构但会因缺失证据而产生幻觉,另一类则过度依赖视觉线索且对扰动敏感,从而推动评估范式从答案正确性转向推理忠实性(faithfulness)。

链接: https://arxiv.org/abs/2512.08976
作者: Isha Chaturvedi,Anjana Nair,Yushen Li,Adhitya Rajendra Kumar,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Vasu Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attri- bution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of an- swers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.
zh

[AI-53] Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture

【速读】:该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)系统在复杂声学环境下的性能下降问题,尤其是在存在背景噪声干扰时的识别准确率降低问题。其解决方案的关键在于将噪声检测能力直接集成到语音识别架构中,具体通过在wav2vec2框架基础上引入一个独立的噪声识别模块,该模块与语音转录过程并行运行,并通过联合优化语音转录和噪声分类的目标函数,实现对语音和噪声的同步建模与判别,从而显著提升在嘈杂环境中的词错误率(Word Error Rate, WER)、字符错误率(Character Error Rate, CER)以及噪声检测准确性。

链接: https://arxiv.org/abs/2512.08973
作者: Karamvir Singh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 figures

点击查看摘要

Abstract:This research presents a novel approach to enhancing automatic speech recognition systems by integrating noise detection capabilities directly into the recognition architecture. Building upon the wav2vec2 framework, the proposed method incorporates a dedicated noise identification module that operates concurrently with speech transcription. Experimental validation using publicly available speech and environmental audio datasets demonstrates substantial improvements in transcription quality and noise discrimination. The enhanced system achieves superior performance in word error rate, character error rate, and noise detection accuracy compared to conventional architectures. Results indicate that joint optimization of transcription and noise classification objectives yields more reliable speech recognition in challenging acoustic conditions.
zh

[AI-54] Learning Robust Representations for Malicious Content Detection via Contrastive Sampling and Uncertainty Estimation

【速读】:该论文旨在解决在噪声数据和类别不平衡条件下,正例-未标记(Positive-Unlabeled, PU)学习中表示学习性能下降的问题。其核心解决方案是提出不确定性对比框架(Uncertainty Contrastive Framework, UCF),关键创新在于:引入基于样本置信度动态调整对比权重的不确定性感知对比损失(uncertainty-aware contrastive loss),通过正例锚点稳定训练过程,并采用自适应温度缩放机制以应对批次级别的变异性;同时结合自注意力引导的LSTM编码器增强特征提取能力。该方法显著提升了分类性能,在恶意内容识别任务中使传统分类器达到超过93.38%的准确率、高于0.93的精确率及近乎完美的召回率,且嵌入空间具有良好的可分离性与校准性,适用于高风险场景如网络安全和生物医学文本挖掘。

链接: https://arxiv.org/abs/2512.08969
作者: Elias Hossain,Umesh Biswas,Charan Gudla,Sai Phani Parsa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose the Uncertainty Contrastive Framework (UCF), a Positive-Unlabeled (PU) representation learning framework that integrates uncertainty-aware contrastive loss, adaptive temperature scaling, and a self-attention-guided LSTM encoder to improve classification under noisy and imbalanced conditions. UCF dynamically adjusts contrastive weighting based on sample confidence, stabilizes training using positive anchors, and adapts temperature parameters to batch-level variability. Applied to malicious content classification, UCF-generated embeddings enable multiple traditional classifiers to achieve more than 93.38% accuracy, precision above 0.93, and near-perfect recall, with minimal false negatives and competitive ROC-AUC scores. Visual analyses confirm clear separation between positive and unlabeled instances, highlighting the framework’s ability to produce calibrated, discriminative embeddings. These results position UCF as a robust and scalable solution for PU learning in high-stakes domains such as cybersecurity and biomedical text mining.
zh

[AI-55] CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对语义保持型对抗攻击(如同义词替换)时缺乏可靠鲁棒性保障的问题。现有方法因未进行语义验证导致认证边界宽松,且依赖重复采样造成计算开销高。其解决方案的关键在于提出CluCERT框架,通过聚类引导的去噪平滑机制实现更紧致的认证边界和更高的计算效率:一方面引入语义聚类过滤器以剔除噪声样本并保留有意义扰动,另一方面设计精炼模块提取核心语义与快速同义词替换策略加速去噪过程,从而在多个下游任务和越狱防御场景中显著优于现有认证方法。

链接: https://arxiv.org/abs/2512.08967
作者: Zixia Wang,Gaojie Jin,Jia Hu,Ronghui Mu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce a semantic clustering filter that reduces noisy samples and retains meaningful perturbations, supported by theoretical analysis. Furthermore, we enhance computational efficiency through two mechanisms: a refine module that extracts core semantics, and a fast synonym substitution strategy that accelerates the denoising process. Finally, we conduct extensive experiments on various downstream tasks and jailbreak defense scenarios. Experimental results demonstrate that our method outperforms existing certified approaches in both robustness bounds and computational efficiency.
zh

[AI-56] EEG-Bench: A Benchmark for EEG Foundation Models in Clinical Applications NEURIPS

【速读】:该论文旨在解决当前脑电图(EEG)基础模型在临床应用中缺乏统一评估标准的问题,以推动其在真实医疗场景中的可比性与可复现性。解决方案的关键在于构建一个涵盖11个明确诊断任务、覆盖14个公开EEG数据集的综合性基准测试框架,该框架采用最小预处理、标准化评估协议,并支持经典基线模型与现代基础模型的直接对比,从而为临床部署提供可靠性能参考。

链接: https://arxiv.org/abs/2512.08959
作者: Ard Kastrati,Josua Bürki,Jonas Lauer,Cheng Xuan,Raffaele Iaquinto,Roger Wattenhofer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Foundation Models for the Brain and Body (BrainBodyFM@NeurIPS)

点击查看摘要

Abstract:We introduce a unified benchmarking framework focused on evaluating EEG-based foundation models in clinical applications. The benchmark spans 11 well-defined diagnostic tasks across 14 publicly available EEG datasets, including epilepsy, schizophrenia, Parkinson’s disease, OCD, and mild traumatic brain injury. It features minimal preprocessing, standardized evaluation protocols, and enables side-by-side comparisons of classical baselines and modern foundation models. Our results show that while foundation models achieve strong performance in certain settings, simpler models often remain competitive, particularly under clinical distribution shifts. To facilitate reproducibility and adoption, we release all prepared data and code in an accessible and extensible format.
zh

[AI-57] LUMOS: Large User MOdels for User Behavior Prediction

【速读】:该论文旨在解决在线B2C平台中大规模用户行为预测的挑战,传统方法依赖于任务特定模型和领域特定特征工程,存在耗时、计算成本高及可扩展性差等问题。其解决方案的关键在于提出LUMOS(Large User MOdel Series),一个基于Transformer架构的多任务学习模型,通过仅使用原始用户活动数据联合学习多个任务,摒弃了人工特征工程和任务专属模型;其中核心创新包括一种新颖的交叉注意力机制,能够利用未来已知事件(如节假日、促销等)条件化预测,从而捕捉复杂行为模式(例如“即将到来的假期如何影响用户参与度?”),以及多模态标记化策略,将用户交易、事件上下文与静态人口统计属性融合为丰富表示,经由专用嵌入路径处理,最终在真实生产数据集上实现显著性能提升。

链接: https://arxiv.org/abs/2512.08957
作者: Dhruv Nigam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User behavior prediction at scale remains a critical challenge for online B2C platforms. Traditional approaches rely heavily on task-specific models and domain-specific feature engineering. This is time-consuming, computationally expensive, and requires domain expertise and therefore not scalable. We present LUMOS (Large User MOdel Series), a transformer-based architecture that eliminates task-specific models and manual feature engineering by learning multiple tasks jointly using only raw user activity data. LUMOS introduces a novel cross-attention mechanism that conditions predictions on future known events (e.g., holidays, sales, etc.), enabling the model to predict complex behaviour patterns like “how will upcoming holidays affect user engagement?” The architecture also employs multi-modal tokenization, combining user transactions, event context, and static user demographic attributes into rich representations processed through specialized embedding pathways. Through extensive experiments on a production dataset spanning 275 billion user activity tokens from 250 million users, we demonstrate that LUMOS achieves superior performance compared to traditional task-specific models. Across 5 tasks with established baselines, we achieve an average improvement of 0.025 in ROC-AUC for binary classification tasks and 4.6% reduction in MAPE for regression tasks. Online A/B testing validates these improvements translate to measurable business impact with a 3.15% increase in Daily Active Users. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08957 [cs.LG] (or arXiv:2512.08957v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08957 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-58] LLM 4XCE: Large Language Models for Extremely Large-Scale Massive MIMO Channel Estimation

【速读】:该论文旨在解决超大规模多输入多输出(XL-MIMO)系统中混合场(hybrid-field)信道估计难题,即近场与远场效应共存导致传统方法难以准确建模和泛化的问题。其解决方案的关键在于提出一种基于大语言模型(LLM)的信道估计算法——LLM4XCE,通过设计嵌入模块与并行特征-空间注意力机制(Parallel Feature-Spatial Attention),深度融合导频特征与空间结构信息,构建语义丰富的输入表示;同时仅微调顶层两个Transformer层,在保证高训练效率的同时有效捕捉导频数据中的潜在依赖关系,从而在混合场条件下显著提升估计精度与泛化能力。

链接: https://arxiv.org/abs/2512.08955
作者: Renbin Li,Shuangshuang Li,Peihao Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extremely large-scale massive multiple-input multiple-output (XL-MIMO) is a key enabler for sixth-generation (6G) networks, offering massive spatial degrees of freedom. Despite these advantages, the coexistence of near-field and far-field effects in hybrid-field channels presents significant challenges for accurate estimation, where traditional methods often struggle to generalize effectively. In recent years, large language models (LLMs) have achieved impressive performance on downstream tasks via fine-tuning, aligning with the semantic communication shift toward task-oriented understanding over bit-level accuracy. Motivated by this, we propose Large Language Models for XL-MIMO Channel Estimation (LLM4XCE), a novel channel estimation framework that leverages the semantic modeling capabilities of large language models to recover essential spatial-channel representations for downstream tasks. The model integrates a carefully designed embedding module with Parallel Feature-Spatial Attention, enabling deep fusion of pilot features and spatial structures to construct a semantically rich representation for LLM input. By fine-tuning only the top two Transformer layers, our method effectively captures latent dependencies in the pilot data while ensuring high training efficiency. Extensive simulations demonstrate that LLM4XCE significantly outperforms existing state-of-the-art methods under hybrid-field conditions, achieving superior estimation accuracy and generalization performance. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08955 [cs.LG] (or arXiv:2512.08955v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08955 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-59] An Electrocardiogram Multi-task Benchmark with Comprehensive Evaluations and Insightful Findings

【速读】:该论文试图解决的问题是:基础模型(foundation models)在心电图(ECG)分析中的有效性尚不明确,尤其是在与传统时间序列深度学习模型对比时,其性能表现和适用性缺乏系统评估。解决方案的关键在于通过构建一个全面的基准测试框架,对语言模型、通用时间序列模型以及专用于ECG的基础模型进行量化比较,并结合深入的定性分析揭示其优势与局限。实验结果表明,通用时间序列及ECG基础模型可达到80%的顶尖性能水平,证明其在无需大量人工标注或领域专家干预的情况下具备强大的泛化能力,从而为生理波形分析中AI技术的应用提供了新的路径。

链接: https://arxiv.org/abs/2512.08954
作者: Yuhao Xu,Jiaying Lu,Sirui Ding,Defu Cao,Xiao Hu,Carl Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the process of patient diagnosis, non-invasive measurements are widely used due to their low risks and quick results. Electrocardiogram (ECG), as a non-invasive method to collect heart activities, is used to diagnose cardiac conditions. Analyzing the ECG typically requires domain expertise, which is a roadblock to applying artificial intelligence (AI) for healthcare. Through advances in self-supervised learning and foundation models, AI systems can now acquire and leverage domain knowledge without relying solely on human expertise. However, there is a lack of comprehensive analyses over the foundation models’ performance on ECG. This study aims to answer the research question: “Are Foundation Models Useful for ECG Analysis?” To address it, we evaluate language/general time-series/ECG foundation models in comparison with time-series deep learning models. The experimental results show that general time-series/ECG foundation models achieve a top performance rate of 80%, indicating their effectiveness in ECG analysis. In-depth analyses and insights are provided along with comprehensive experimental results. This study highlights the limitations and potential of foundation models in advancing physiological waveform analysis. The data and code for this benchmark are publicly available at this https URL.
zh

[AI-60] SimClinician: A Multimodal Simulation Testbed for Reliable Psychologist AI Collaboration in Mental Health Diagnosis

【速读】:该论文旨在解决当前基于人工智能(AI)的心理健康诊断研究中忽视人机协作决策过程的问题,尤其是心理医生如何响应AI建议(接受、调整或拒绝)这一关键环节。现有研究多依赖基准准确率评估AI价值,但实际应用中诊断决策是连续且受患者语调、停顿、用词及非语言行为等多模态线索影响的复杂过程。为此,作者提出SimClinician交互式仿真平台,其核心在于通过三个关键模块实现心理学家与AI的协同诊断:一是整合音频、文本和凝视表情模式的仪表盘;二是可渲染去标识化动态行为的虚拟角色模块;三是将AI输出映射为多模态证据的决策层,支持心理医生审查AI推理并输入最终诊断。实验表明,增加确认步骤可使接受率提升23%,同时将误判升级控制在9%以下,保障了人机协作的流畅性与可靠性。

链接: https://arxiv.org/abs/2512.08953
作者: Filippo Cenacchi,Longbing Cao,Deborah Richards
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI based mental health diagnosis is often judged by benchmark accuracy, yet in practice its value depends on how psychologists respond whether they accept, adjust, or reject AI suggestions. Mental health makes this especially challenging: decisions are continuous and shaped by cues in tone, pauses, word choice, and nonverbal behaviors of patients. Current research rarely examines how AI diagnosis interface design influences these choices, leaving little basis for reliable testing before live studies. We present SimClinician, an interactive simulation platform, to transform patient data into psychologist AI collaborative diagnosis. Contributions include: (1) a dashboard integrating audio, text, and gaze-expression patterns; (2) an avatar module rendering de-identified dynamics for analysis; (3) a decision layer that maps AI outputs to multimodal evidence, letting psychologists review AI reasoning, and enter a diagnosis. Tested on the E-DAIC corpus (276 clinical interviews, expanded to 480,000 simulations), SimClinician shows that a confirmation step raises acceptance by 23%, keeping escalations below 9%, and maintaining smooth interaction flow.
zh

[AI-61] Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

【速读】:该论文旨在解决人形机器人在心理健康筛查(如抑郁症和创伤后应激障碍)场景中,因依赖真实用户测试而导致的迭代效率低、硬件磨损严重以及难以实现多样化训练的问题。其核心挑战在于:筛查代理必须掌握对话节奏、语调、回应信号(backchannels)及面部与语音注意力分配等非语言动态,而现有模拟器普遍缺乏对这些因素的策略学习,且多数控制器过度关注任务准确性,忽视信任建立、节奏控制和关系维护。解决方案的关键在于提出一种以代理为中心的仿真先行(agent-centred, simulation-first)管道,将访谈数据转化为276个具有同步语音、眼神/面部表情和头部-躯干姿态的Unreal Engine MetaHuman患者,并引入感知融合-策略闭环机制,在安全约束下决策何时说话、何时回应及如何避免打断;同时采用反事实回放(bounded nonverbal perturbations)和不确定性感知的轮次管理器来降低诊断模糊性。实验表明,基于TD3的控制器优于PPO和CEM,在保持相近奖励水平的同时实现了更稳定节奏和更高的覆盖度,且在模态缺失和渲染器更换下表现鲁棒,验证了该方案在提升社交时机精度和完整性的有效性。

链接: https://arxiv.org/abs/2512.08952
作者: Filippo Cenacchi,Deborah Richards,Longbing Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.
zh

[AI-62] AI Co-Artist: A LLM -Powered Framework for Interactive GLSL Shader Animation Evolution

【速读】:该论文旨在解决生成式AI(Generative AI)在实时着色器编程(real-time shader programming)领域中因技术门槛高而限制艺术创作的问题,特别是针对缺乏编程背景的艺术家和设计师难以使用GLSL等工具实现复杂视觉效果的困境。解决方案的关键在于提出AI Co-Artist系统,该系统利用大语言模型(Large Language Models, LLMs)如GPT-4的强大语义理解与程序合成能力,结合用户引导的进化机制(user-guided evolutionary principles),通过直观的可视化交互界面实现对GLSL着色器的迭代演化与优化,从而显著降低技术门槛、提升创作效率,并支持多样化的创意输出。

链接: https://arxiv.org/abs/2512.08951
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Creative coding and real-time shader programming are at the forefront of interactive digital art, enabling artists, designers, and enthusiasts to produce mesmerizing, complex visual effects that respond to real-time stimuli such as sound or user interaction. However, despite the rich potential of tools like GLSL, the steep learning curve and requirement for programming fluency pose substantial barriers for newcomers and even experienced artists who may not have a technical background. In this paper, we present AI Co-Artist, a novel interactive system that harnesses the capabilities of large language models (LLMs), specifically GPT-4, to support the iterative evolution and refinement of GLSL shaders through a user-friendly, visually-driven interface. Drawing inspiration from the user-guided evolutionary principles pioneered by the Picbreeder platform, our system empowers users to evolve shader art using intuitive interactions, without needing to write or understand code. AI Co-Artist serves as both a creative companion and a technical assistant, allowing users to explore a vast generative design space of real-time visual art. Through comprehensive evaluations, including structured user studies and qualitative feedback, we demonstrate that AI Co-Artist significantly reduces the technical threshold for shader creation, enhances creative outcomes, and supports a wide range of users in producing professional-quality visual effects. Furthermore, we argue that this paradigm is broadly generalizable. By leveraging the dual strengths of LLMs-semantic understanding and program synthesis, our method can be applied to diverse creative domains, including website layout generation, architectural visualizations, product prototyping, and infographics.
zh

[AI-63] Beyond Technical Debt: How AI-Assisted Development Creates Comprehension Debt in Resource-Constrained Indie Teams

【速读】:该论文旨在解决分布式、兼职的独立游戏开发团队在缺乏适配生产框架的情况下所面临的系统性挑战,如技术债(technical debt)、协作困难和开发者倦怠(burnout)。针对这一问题,作者提出了一种名为CIGDI(Co-Intelligence Game Development Ideation)的七阶段迭代式开发框架,其核心在于通过“人类在环决策点”(human-in-the-loop decision points)结构化地整合AI工具,以优化优先级判定与时间分配(Priority Criteria and Timeboxing)。关键创新在于识别出一种新型技术债——“理解债务”(comprehension debt),即AI辅助构建的功能系统超出团队自身能力范围,导致系统脆弱性和对AI的高度依赖,从而揭示了AI支持可能从学习阶梯演变为依赖陷阱的核心矛盾。

链接: https://arxiv.org/abs/2512.08942
作者: Yujie Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Junior indie game developers in distributed, part-time teams lack production frameworks suited to their specific context, as traditional methodologies are often inaccessible. This study introduces the CIGDI (Co-Intelligence Game Development Ideation) Framework, an alternative approach for integrating AI tools to address persistent challenges of technical debt, coordination, and burnout. The framework emerged from a three-month reflective practice and autoethnographic study of a three-person distributed team developing the 2D narrative game “The Worm’s Memoirs”. Based on analysis of development data (N=157 Jira tasks, N=333 GitHub commits, N=13+ Miro boards, N=8 reflection sessions), CIGDI is proposed as a seven-stage iterative process structured around human-in-the-loop decision points (Priority Criteria and Timeboxing). While AI support democratized knowledge access and reduced cognitive load, our analysis identified a significant challenge: “comprehension debt.” We define this as a novel form of technical debt where AI helps teams build systems more sophisticated than their independent skill level can create or maintain. This paradox (possessing functional systems the team incompletely understands) creates fragility and AI dependency, distinct from traditional code quality debt. This work contributes a practical production framework for resource-constrained teams and identifies critical questions about whether AI assistance constitutes a learning ladder or a dependency trap for developer skill. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.08942 [cs.HC] (or arXiv:2512.08942v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2512.08942 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yujie Zhang [view email] [v1] Thu, 30 Oct 2025 12:41:26 UTC (80 KB)
zh

[AI-64] Assessing the Human-Likeness of LLM -Driven Digital Twins in Simulating Health Care System Trust

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的人类数字孪生在模拟复杂人类心理特质(如对医疗系统的不信任)方面的实际仿真能力尚不明确的问题,这一问题直接影响健康专业人员对基于LLM的人工智能(Artificial Intelligence, AI)系统在日常工作中应用的信任与采纳。其解决方案的关键在于:利用Twin-2K-500数据集和健康医疗系统不信任量表(Health Care System Distrust Scale, HCSDS),通过人主体样本对数字孪生的模拟结果进行系统评估,分析项目级分布、汇总统计及人口子群模式,从而揭示LLM数字孪生在群体趋势模拟中的潜力及其在细分人群差异捕捉上的局限性。研究发现,尽管数字孪生能较好再现年龄和性别等主要人口学特征,但在教育水平等细微差异上敏感度较低,表明当前LLM驱动的数字孪生在建模复杂人类态度方面存在不足,需谨慎校准与验证后方可用于健康系统工程中的推断分析或政策模拟。

链接: https://arxiv.org/abs/2512.08939
作者: Yuzhou Wu,Mingyang Wu,Di Liu,Rong Yin,Kang Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 1 figure may be published in IISE Annual Conference Expo 2026

点击查看摘要

Abstract:Serving as an emerging and powerful tool, Large Language Model (LLM)-driven Human Digital Twins are showing great potential in healthcare system research. However, its actual simulation ability for complex human psychological traits, such as distrust in the healthcare system, remains unclear. This research gap particularly impacts health professionals’ trust and usage of LLM-based Artificial Intelligence (AI) systems in assisting their routine work. In this study, based on the Twin-2K-500 dataset, we systematically evaluated the simulation results of the LLM-driven human digital twin using the Health Care System Distrust Scale (HCSDS) with an established human-subject sample, analyzing item-level distributions, summary statistics, and demographic subgroup patterns. Results showed that the simulated responses by the digital twin were significantly more centralized with lower variance and had fewer selections of extreme options (all p0.001). While the digital twin broadly reproduces human results in major demographic patterns, such as age and gender, it exhibits relatively low sensitivity in capturing minor differences in education levels. The LLM-based digital twin simulation has the potential to simulate population trends, but it also presents challenges in making detailed, specific distinctions in subgroups of human beings. This study suggests that the current LLM-driven Digital Twins have limitations in modeling complex human attitudes, which require careful calibration and validation before applying them in inferential analyses or policy simulations in health systems engineering. Future studies are necessary to examine the emotional reasoning mechanism of LLMs before their use, particularly for studies that involve simulations sensitive to social topics, such as human-automation trust.
zh

[AI-65] When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being

【速读】:该论文旨在解决生成式 AI (Generative AI) 在提供建议时的质量问题,特别是其与在线社群中人类专家意见的对比关系。研究发现,LLM(大语言模型)生成的建议在整体质量、有效性、亲和力及用户再次求助意愿等方面显著优于Reddit高票人类建议;同时揭示了单纯模型版本升级未必提升建议质量(如GPT-4o优于GPT-5除讨好倾向外),并提出一种“无感优化”策略——即通过轻量级人工润色可使人类建议达到与AI相当水平。关键解决方案在于融合人类智慧与算法能力,构建包含专家监督、群体反馈与AI生成协同的混合建议生态系统。

链接: https://arxiv.org/abs/2512.08937
作者: Harsh Kumar,Jasmine Chahal,Yinuo Zhao,Zeling Zhang,Annika Wei,Louis Tay,Ashton Anderson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.
zh

[AI-66] A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness

【速读】:该论文旨在解决生成式人工智能(Generative AI)在个人健康应用中面临的用户安全、模型准确性与隐私保护等挑战,特别是在大型语言模型(LLM)用于个性化健康指导时存在的风险与不确定性。其解决方案的关键在于提出并验证了一个基于原则的系统评估框架——SHARP(Safety, Helpfulness, Accuracy, Relevance, Personalization),该框架整合了通用专家与临床专家的人工评价、自动评分器评估及对抗性测试,并嵌入迭代开发周期中,从而实现对LLM驱动健康应用的端到端、多维度、实证驱动的评估与优化。通过在Fitbit Insights Explorer系统中的分阶段部署(涉及超13,000名同意用户)验证表明,该方法能够识别初始测试未发现的问题,推动针对性改进,并确立了一套兼顾创新性与安全性的负责任开发范式。

链接: https://arxiv.org/abs/2512.08936
作者: Brent Winslow,Jacqueline Shreibati,Javier Perez,Hao-Wei Su,Nichole Young-Lin,Nova Hammerquist,Daniel McDuff,Jason Guss,Jenny Vafeiadou,Nick Cain,Alex Lin,Erik Schenck,Shiva Rajagopal,Jia-Ru Chung,Anusha Venkatakrishnan,Amy Armento Lee,Maryam Karimzadehgan,Qingyou Meng,Rythm Agarwal,Aravind Natarajan,Tracy Giest
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.
zh

[AI-67] Motion2Meaning: A Clinician-Centered Framework for Contestable LLM in Parkinsons Disease Gait Interpretation

【速读】:该论文旨在解决当前基于人工智能(AI)的步态分析在帕金森病(Parkinson’s Disease, PD)临床护理中缺乏透明性与可争议性的问题,即现有临床仪表板无法为医生提供有效方式来审查或质疑AI决策。其解决方案的关键在于提出Motion2Meaning框架,该框架通过紧密集成的界面实现可争议AI(Contestable AI),包含三个核心组件:用于可视化步态数据的步态数据可视化接口(Gait Data Visualization Interface, GDVI)、基于穿戴式传感器垂直地面反作用力(vGRF)时间序列预测Hoehn-Yahr分期的一维卷积神经网络(1D-CNN),以及结合新型跨模态解释差异(Cross-Modal Explanation Discrepancy, XMED)防护机制与可争议大型语言模型(LLM)的解释接口(Contestable Interpretation Interface, CII)。其中,XMED能够识别模型不可靠预测,而LLM则支持临床医生验证正确预测并挑战部分错误,从而构建一个兼具透明度、可审计性和临床监督能力的PD步态分析系统。

链接: https://arxiv.org/abs/2512.08934
作者: Loc Phuc Truong Nguyen,Hung Thanh Do,Hung Truong Thanh Nguyen,Hung Cao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at the 9th International Symposium on Chatbots and Human-Centered AI

点击查看摘要

Abstract:AI-assisted gait analysis holds promise for improving Parkinson’s Disease (PD) care, but current clinical dashboards lack transparency and offer no meaningful way for clinicians to interrogate or contest AI decisions. To address this issue, we present Motion2Meaning, a clinician-centered framework that advances Contestable AI through a tightly integrated interface designed for interpretability, oversight, and procedural recourse. Our approach leverages vertical Ground Reaction Force (vGRF) time-series data from wearable sensors as an objective biomarker of PD motor states. The system comprises three key components: a Gait Data Visualization Interface (GDVI), a one-dimensional Convolutional Neural Network (1D-CNN) that predicts Hoehn Yahr severity stages, and a Contestable Interpretation Interface (CII) that combines our novel Cross-Modal Explanation Discrepancy (XMED) safeguard with a contestable Large Language Model (LLM). Our 1D-CNN achieves 89.0% F1-score on the public PhysioNet gait dataset. XMED successfully identifies model unreliability by detecting a five-fold increase in explanation discrepancies in incorrect predictions (7.45%) compared to correct ones (1.56%), while our LLM-powered interface enables clinicians to validate correct predictions and successfully contest a portion of the model’s errors. A human-centered evaluation of this contestable interface reveals a crucial trade-off between the LLM’s factual grounding and its readability and responsiveness to clinical feedback. This work demonstrates the feasibility of combining wearable sensor analysis with Explainable AI (XAI) and contestable LLMs to create a transparent, auditable system for PD gait interpretation that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: this https URL.
zh

[AI-68] Agent ic AI as Undercover Teammates: Argumentative Knowledge Construction in Hybrid Human-AI Collaborative Learning

【速读】:该论文旨在解决生成式 AI (Generative AI) 在协作学习环境中对论证性知识建构过程的影响机制不明确的问题。其核心解决方案在于将 AI 设计为具有支持型或对立型人格的“ undercover teammates”,通过实证分析揭示不同人格类型的 AI 如何重塑协作推理中的认知与社会动态。关键发现是:AI 的教育价值并非来自增加话语量,而是通过调节论证结构和协作模式提升认知质量——支持型 AI 促进概念整合与共识导向推理,对立型 AI 则激发批判性深化与冲突驱动协商,从而验证了 agentic AI 作为具边界但可适应的 epistemic 和 social 参与者在混合人机学习环境中的作用。

链接: https://arxiv.org/abs/2512.08933
作者: Lixiang Yan,Yueqiao Jin,Linxuan Zhao,Roberto Martinez-Maldonado,Xinyu Li,Xiu Guan,Wenxin Guo,Xibin Han,Dragan Gašević
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) agents are increasingly embedded in collaborative learning environments, yet their impact on the processes of argumentative knowledge construction remains insufficiently understood. Emerging conceptualisations of agentic AI and artificial agency suggest that such systems possess bounded autonomy, interactivity, and adaptability, allowing them to engage as epistemic participants rather than mere instructional tools. Building on this theoretical foundation, the present study investigates how agentic AI, designed as undercover teammates with either supportive or contrarian personas, shapes the epistemic and social dynamics of collaborative reasoning. Drawing on Weinberger and Fischer’s (2006) four-dimensional framework, participation, epistemic reasoning, argument structure, and social modes of co-construction, we analysed synchronous discourse data from 212 human and 64 AI participants (92 triads) engaged in an analytical problem-solving task. Mixed-effects and epistemic network analyses revealed that AI teammates maintained balanced participation but substantially reorganised epistemic and social processes: supportive personas promoted conceptual integration and consensus-oriented reasoning, whereas contrarian personas provoked critical elaboration and conflict-driven negotiation. Epistemic adequacy, rather than participation volume, predicted individual learning gains, indicating that agentic AI’s educational value lies in enhancing the quality and coordination of reasoning rather than amplifying discourse quantity. These findings extend CSCL theory by conceptualising agentic AI as epistemic and social participants, bounded yet adaptive collaborators that redistribute cognitive and argumentative labour in hybrid human-AI learning environments.
zh

[AI-69] Prediction-aware and Reinforcement Learning based Altruistic Cooperative Driving

【速读】:该论文旨在解决自动驾驶车辆(AV)在与人类驾驶车辆(HV)共存场景下的安全导航问题,核心挑战在于HV会动态调整其行为策略以响应AV的决策,导致复杂的社交交互环境。为提升AV在该类场景中的安全性与鲁棒性,作者提出了一种融合社会导航与预测能力的强化学习(RL)框架:关键创新在于引入一个混合预测网络(Hybrid Predictive Network, HPN),通过多步预测链生成未来观测序列,并将其输入价值函数网络(Value Function Network, VFN)用于优化社会效用;同时设计了一个基于可解释运动学预测的安全优先器,用于屏蔽不安全动作,从而约束RL策略。此方法使AV能够前瞻性地感知和适应HV的行为变化,实现更高效且安全的协同决策。

链接: https://arxiv.org/abs/2211.10585
作者: Rodolfo Valiente,Mahdi Razzaghpour,Behrad Toghi,Ghayoor Shah,Yaser P. Fallah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous vehicle (AV) navigation in the presence of Human-driven vehicles (HVs) is challenging, as HVs continuously update their policies in response to AVs. In order to navigate safely in the presence of complex AV-HV social interactions, the AVs must learn to predict these changes. Humans are capable of navigating such challenging social interaction settings because of their intrinsic knowledge about other agents behaviors and use that to forecast what might happen in the future. Inspired by humans, we provide our AVs the capability of anticipating future states and leveraging prediction in a cooperative reinforcement learning (RL) decision-making framework, to improve safety and robustness. In this paper, we propose an integration of two essential and earlier-presented components of AVs: social navigation and prediction. We formulate the AV decision-making process as a RL problem and seek to obtain optimal policies that produce socially beneficial results utilizing a prediction-aware planning and social-aware optimization RL framework. We also propose a Hybrid Predictive Network (HPN) that anticipates future observations. The HPN is used in a multi-step prediction chain to compute a window of predicted future observations to be used by the value function network (VFN). Finally, a safe VFN is trained to optimize a social utility using a sequence of previous and predicted observations, and a safety prioritizer is used to leverage the interpretable kinematic predictions to mask the unsafe actions, constraining the RL policy. We compare our prediction-aware AV to state-of-the-art solutions and demonstrate performance improvements in terms of efficiency and safety in multiple simulated scenarios.
zh

[AI-70] Altruistic Maneuver Planning for Cooperative Autonomous Vehicles Using Multi-agent Advantage Actor-Critic CVPR2021

【速读】:该论文旨在解决混合自主环境中自动驾驶车辆(Autonomous Vehicles, AVs)与人类驾驶车辆共存时的协同决策问题,核心挑战在于如何使AVs在 maneuver planning(机动规划)过程中体现出利他性(altruism),从而提升整体交通流效率与安全性。解决方案的关键在于采用一种端到端(end-to-end)的多智能体强化学习方法,通过引入一种改进的同步优势演员-评论家算法(multi-agent variant of synchronous Advantage Actor-Critic, A2C),让AVs仅从经验中隐式学习人类驾驶员的行为模式,而无需依赖显式的驾驶行为模型,从而实现对人类驾驶员行为的间接影响和协调,最终促进社会最优行为的涌现。

链接: https://arxiv.org/abs/2107.05664
作者: Behrad Toghi,Rodolfo Valiente,Dorsa Sadigh,Ramtin Pedarsani,Yaser P. Fallah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) - Workshop on Autonomous Driving: Perception, Prediction and Planning

点击查看摘要

Abstract:With the adoption of autonomous vehicles on our roads, we will witness a mixed-autonomy environment where autonomous and human-driven vehicles must learn to co-exist by sharing the same road infrastructure. To attain socially-desirable behaviors, autonomous vehicles must be instructed to consider the utility of other vehicles around them in their decision-making process. Particularly, we study the maneuver planning problem for autonomous vehicles and investigate how a decentralized reward structure can induce altruism in their behavior and incentivize them to account for the interest of other autonomous and human-driven vehicles. This is a challenging problem due to the ambiguity of a human driver’s willingness to cooperate with an autonomous vehicle. Thus, in contrast with the existing works which rely on behavior models of human drivers, we take an end-to-end approach and let the autonomous agents to implicitly learn the decision-making process of human drivers only from experience. We introduce a multi-agent variant of the synchronous Advantage Actor-Critic (A2C) algorithm and train agents that coordinate with each other and can affect the behavior of human drivers to improve traffic flow and safety.
zh

[AI-71] Supervised learning pays attention

【速读】:该论文旨在解决传统监督学习方法在处理异质性数据时难以灵活适应个体预测点需求,同时保持模型简洁性和可解释性的难题。其核心问题在于如何在不预设簇或相似性结构的前提下,为每个测试样本构建个性化的局部模型,并确保结果具备清晰的特征重要性和训练样本相关性解释。解决方案的关键在于引入注意力加权机制(attention weighting)——通过一种监督式的相似度度量,动态调整训练数据对每个测试点的权重,使模型能够聚焦于与目标变量高度相关的特征及其交互作用,从而实现自适应的局部拟合;该方法不仅提升了预测性能,还在理论上证明了在线性模型中能获得比标准线性模型更低的均方误差,尤其适用于具有已知子群结构的混合模型数据生成过程。

链接: https://arxiv.org/abs/2512.09912
作者: Erin Craig,Robert Tibshirani
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning with attention enables large neural networks to make context-specific predictions by selectively focusing on relevant examples. Here, we adapt this idea to supervised learning procedures such as lasso regression and gradient boosting, for tabular data. Our goals are to (1) flexibly fit personalized models for each prediction point and (2) retain model simplicity and interpretability. Our method fits a local model for each test observation by weighting the training data according to attention, a supervised similarity measure that emphasizes features and interactions that are predictive of the outcome. Attention weighting allows the method to adapt to heterogeneous data in a data-driven way, without requiring cluster or similarity pre-specification. Further, our approach is uniquely interpretable: for each test observation, we identify which features are most predictive and which training observations are most relevant. We then show how to use attention weighting for time series and spatial data, and we present a method for adapting pretrained tree-based models to distributional shift using attention-weighted residual corrections. Across real and simulated datasets, attention weighting improves predictive performance while preserving interpretability, and theory shows that attention-weighting linear models attain lower mean squared error than the standard linear model under mixture-of-models data-generating processes with known subgroup structure. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.09912 [stat.ML] (or arXiv:2512.09912v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.09912 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] he Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization

【速读】:该论文旨在解决大规模语言模型训练中权重矩阵函数优化的问题,核心挑战在于如何设计更有效的更新策略以提升模型性能。解决方案的关键在于突破传统谱范数(spectral norm)的局限,引入Ky Fan k-范数的对偶形式,构建出一类名为Fanions的新型优化算法家族,其中F-Fanions和S-Fanions分别基于Ky Fan k-范数与Frobenius范数或ll_\infty范数的凸组合对偶构造,其代表性成员F-Muon和S-Muon在理论上具有更强的表达能力,并在大量实验中验证了其与Muon相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2512.09678
作者: Alexey Kravatskiy,Ivan Kozyrev,Nikolai Kozlov,Alexander Vinogradov,Daniil Merkulov,Ivan Oseledets
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages

点击查看摘要

Abstract:In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan k -norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan k -norms with either the Frobenius norm or the l_\infty norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon’s performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
zh

[AI-73] Graph-Based Bayesian Optimization for Quantum Circuit Architecture Search with Uncertainty Calibrated Surrogates

【速读】:该论文旨在解决量子电路设计在实际量子机器学习应用中的瓶颈问题,特别是在处理复杂现实世界数据时,如何高效自动发现并优化变分量子电路(Variational Quantum Circuits, VQCs)。其关键解决方案是提出一个基于图神经网络(Graph Neural Network, GNN)代理模型的贝叶斯优化框架,将电路表示为图结构,并利用蒙特卡洛丢弃(Monte Carlo dropout)估计代理不确定性,结合期望改进(Expected Improvement)采集函数进行电路变异与选择。该方法在下一代防火墙遥测和网络物联网(NF-ToN-IoT-V2)安全数据集上验证了有效性,相比多层感知机(MLP)代理、随机搜索和贪婪GNN选择等基线方法,在电路复杂度更低的前提下实现了相当或更优的分类准确率,且具备良好的噪声鲁棒性。

链接: https://arxiv.org/abs/2512.09586
作者: Prashant Kumar Choudhary,Nouhaila Innan,Muhammad Shafique,Rajeev Singh
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Networking and Internet Architecture (cs.NI)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:Quantum circuit design is a key bottleneck for practical quantum machine learning on complex, real-world data. We present an automated framework that discovers and refines variational quantum circuits (VQCs) using graph-based Bayesian optimization with a graph neural network (GNN) surrogate. Circuits are represented as graphs and mutated and selected via an expected improvement acquisition function informed by surrogate uncertainty with Monte Carlo dropout. Candidate circuits are evaluated with a hybrid quantum-classical variational classifier on the next generation firewall telemetry and network internet of things (NF-ToN-IoT-V2) cybersecurity dataset, after feature selection and scaling for quantum embedding. We benchmark our pipeline against an MLP-based surrogate, random search, and greedy GNN selection. The GNN-guided optimizer consistently finds circuits with lower complexity and competitive or superior classification accuracy compared to all baselines. Robustness is assessed via a noise study across standard quantum noise channels, including amplitude damping, phase damping, thermal relaxation, depolarizing, and readout bit flip noise. The implementation is fully reproducible, with time benchmarking and export of best found circuits, providing a scalable and interpretable route to automated quantum circuit discovery.
zh

[AI-74] Lazy Diffusion: Mitigating spectral collapse in generative diffusion-based stable autoregressive emulation of turbulent flows

【速读】:该论文旨在解决扩散模型(Diffusion Models)在模拟具有幂律谱特性的湍流等多尺度动力系统时存在的谱坍缩(spectral collapse)问题,即标准DDPMs在傅里叶空间中导致高频模态信噪比单调下降,使得小尺度结构被噪声淹没,从而破坏物理一致性。解决方案的关键在于:首先,将噪声调度(noise schedule)重新诠释为一种谱正则化器,并设计幂律形式的调度策略 β(τ)τγ\beta(\tau)\propto\tau^\gamma,以保留更深层的细尺度结构;其次,提出懒惰扩散(Lazy Diffusion)——一种单步蒸馏方法,利用学习到的得分函数几何结构跳过长时间反向轨迹,有效防止高波数(high-kk)模态退化。这两项改进共同实现了对湍流功率谱的准确重建和长期预测稳定性,表明朴素高斯调度与幂律物理不兼容,而引入物理先验的扩散过程可成为高效、精确且完全概率化的多尺度系统代理模型。

链接: https://arxiv.org/abs/2512.09572
作者: Anish Sambamurthy,Ashesh Chattopadhyay
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Turbulent flows posses broadband, power-law spectra in which multiscale interactions couple high-wavenumber fluctuations to large-scale dynamics. Although diffusion-based generative models offer a principled probabilistic forecasting framework, we show that standard DDPMs induce a fundamental \emphspectral collapse: a Fourier-space analysis of the forward SDE reveals a closed-form, mode-wise signal-to-noise ratio (SNR) that decays monotonically in wavenumber, |k| for spectra S(k)!\propto!|k|^-\lambda , rendering high-wavenumber modes indistinguishable from noise and producing an intrinsic spectral bias. We reinterpret the noise schedule as a spectral regularizer and introduce power-law schedules \beta(\tau)!\propto!\tau^\gamma that preserve fine-scale structure deeper into diffusion time, along with \emphLazy Diffusion, a one-step distillation method that leverages the learned score geometry to bypass long reverse-time trajectories and prevent high- k degradation. Applied to high-Reynolds-number 2D Kolmogorov turbulence and 1/12^\circ Gulf of Mexico ocean reanalysis, these methods resolve spectral collapse, stabilize long-horizon autoregression, and restore physically realistic inertial-range scaling. Together, they show that naïve Gaussian scheduling is structurally incompatible with power-law physics and that physics-aware diffusion processes can yield accurate, efficient, and fully probabilistic surrogates for multiscale dynamical systems.
zh

[AI-75] NeuroSketch: An Effective Framework for Neural Decoding via Systematic Architectural Optimization

【速读】:该论文旨在解决神经解码(Neural Decoding)中模型架构设计不足的问题,即当前研究多集中于信号处理和深度学习方法的改进,而对模型结构本身的系统性优化探索有限。解决方案的关键在于提出 NeuroSketch 框架,通过从宏观到微观层级的系统性架构优化,显著提升神经解码性能。研究首先验证了二维卷积神经网络(CNN-2D)在时序与空间维度上的优越性,并在此基础上逐步优化模型结构,最终在三种脑电模式(视觉、听觉、言语)、三种脑电信号类型(EEG、SEEG、ECoG)及八项解码任务上实现最先进(SOTA)性能,证明了架构优化对神经解码效能的核心贡献。

链接: https://arxiv.org/abs/2512.09524
作者: Gaorui Zhang,Zhizhang Yuan,Jialan Yang,Junru Chen,Li Meng,Yang Yang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Neural decoding, a critical component of Brain-Computer Interface (BCI), has recently attracted increasing research interest. Previous research has focused on leveraging signal processing and deep learning methods to enhance neural decoding performance. However, the in-depth exploration of model architectures remains underexplored, despite its proven effectiveness in other tasks such as energy forecasting and image classification. In this study, we propose NeuroSketch, an effective framework for neural decoding via systematic architecture optimization. Starting with the basic architecture study, we find that CNN-2D outperforms other architectures in neural decoding tasks and explore its effectiveness from temporal and spatial perspectives. Building on this, we optimize the architecture from macro- to micro-level, achieving improvements in performance at each step. The exploration process and model validations take over 5,000 experiments spanning three distinct modalities (visual, auditory, and speech), three types of brain signals (EEG, SEEG, and ECoG), and eight diverse decoding tasks. Experimental results indicate that NeuroSketch achieves state-of-the-art (SOTA) performance across all evaluated datasets, positioning it as a powerful tool for neural decoding. Our code and scripts are available at this https URL.
zh

[AI-76] Functional Percolation: A Perspective on Criticality of Form and Function

【速读】:该论文旨在解决跨学科领域中关于扩展系统内信息处理的物理约束与最小条件问题,尤其关注网络连通性如何限制并促进信息处理能力。其解决方案的关键在于通过分析Erdős-Rényi随机网络在结构渗流相变(structural percolation transition)附近的特性,利用级联驱动的动力学机制作为传播状态依赖响应的最小且普适模型,发现当巨型连通分量出现时,实现实用信息处理的能力发生突变:复杂输入-输出响应函数变得可实现,功能多样性迅速提升,输出熵增加,以及由转移熵(transfer entropy)量化的信息定向流动超出局部邻域。这些同步转变定义了“功能渗流”(functional percolation)区域,表明结构渗流临界点处存在一个帕累托最优权衡——即功能复杂度与多样性之间的平衡,从而揭示渗流临界性是具有局部相互作用和传播影响系统的通用信息处理组织原则。

链接: https://arxiv.org/abs/2512.09317
作者: Galen J. Wilkerson
机构: 未知
类目: Physics and Society (physics.soc-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Understanding the physical constraints and minimal conditions that enable information processing in extended systems remains a central challenge across disciplines, from neuroscience and artificial intelligence to social and physical networks. Here we study how network connectivity both limits and enables information processing by analyzing random networks across the structural percolation transition. Using cascade-mediated dynamics as a minimal and universal mechanism for propagating state-dependent responses, we examine structural, functional, and information-theoretic observables as functions of mean degree in Erdos-Renyi networks. We find that the emergence of a giant connected component coincides with a sharp transition in realizable information processing: complex input-output response functions become accessible, functional diversity increases rapidly, output entropy rises, and directed information flow quantified by transfer entropy extends beyond local neighborhoods. These coincident transitions define a regime of functional percolation, referring to a sharp expansion of the space of realizable input-output functions at the structural percolation transition. Near criticality, networks exhibit a Pareto-optimal tradeoff between functional complexity and diversity, suggesting that percolation criticality provides a universal organizing principle for information processing in systems with local interactions and propagating influences.
zh

[AI-77] AI-Driven Expansion and Application of the Alexandria Database

【速读】:该论文旨在解决计算材料发现中高通量筛选效率低、稳定化合物识别准确率不足的问题。其核心挑战在于如何在大规模结构空间中高效且精准地识别出热力学稳定的候选材料,同时避免生成与实验不符的虚假结构。解决方案的关键在于构建一个多阶段工作流,整合生成式AI(Generative AI)模型Matra-Genoa用于结构生成、Orb-v2通用机器学习势函数进行原子间相互作用建模,以及ALIGNN图神经网络实现高精度能量预测;通过这一协同框架,在11900万候选结构中成功识别出1.3百万个经密度泛函理论(DFT)验证的稳定化合物,并显著提升数据库(ALEXANDRIA)的规模与可靠性,最终实现99%的成功率和三倍于以往方法的性能提升。

链接: https://arxiv.org/abs/2512.09169
作者: Théo Cavignac(1),Jonathan Schmidt(2),Pierre-Paul De Breuck(1),Antoine Loew(1),Tiago F. T. Cerqueira(3),Hai-Chen Wang(1),Anton Bochkarev(4),Yury Lysogorskiy(4),Aldo H. Romero(5),Ralf Drautz(4),Silvana Botti(1),Miguel A. L. Marques(1) ((1) Research Center Future Energy Materials and Systems of the University Alliance Ruhr and ICAMS, Ruhr University Bochum, Bochum, Germany, (2) Department of Materials, ETH Zürich, Zürich, Switzerland, (3) CFisUC, Department of Physics, University of Coimbra, Coimbra, Portugal, (4) ICAMS, Ruhr-Universität Bochum and ACEworks GmbH, Bochum, Germany, (5) Department of Physics, West Virginia University, Morgantown, USA)
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel multi-stage workflow for computational materials discovery that achieves a 99% success rate in identifying compounds within 100 meV/atom of thermodynamic stability, with a threefold improvement over previous approaches. By combining the Matra-Genoa generative model, Orb-v2 universal machine learning interatomic potential, and ALIGNN graph neural network for energy prediction, we generated 119 million candidate structures and added 1.3 million DFT-validated compounds to the ALEXANDRIA database, including 74 thousand new stable materials. The expanded ALEXANDRIA database now contains 5.8 million structures with 175 thousand compounds on the convex hull. Predicted structural disorder rates (37-43%) match experimental databases, unlike other recent AI-generated datasets. Analysis reveals fundamental patterns in space group distributions, coordination environments, and phase stability networks, including sub-linear scaling of convex hull connectivity. We release the complete dataset, including sAlex25 with 14 million out-of-equilibrium structures containing forces and stresses for training universal force fields. We demonstrate that fine-tuning a GRACE model on this data improves benchmark accuracy. All data, models, and workflows are freely available under Creative Commons licenses.
zh

[AI-78] Monitoring Deployed AI Systems in Health Care

【速读】:该论文旨在解决医疗领域中部署后人工智能(Artificial Intelligence, AI)系统监测不足的问题,以确保其安全性、质量与持续效益,并支持治理决策(如更新、修改或停用系统)。解决方案的关键在于提出一个基于三大互补原则的监测框架:系统完整性(System Integrity)、性能(Performance)和影响(Impact)。该框架通过明确监测指标、审查频率、责任人及具体响应措施,为传统AI与生成式AI(Generative AI)提供可操作的监控计划,从而实现对AI系统在实际临床环境中长期稳定运行的有效保障。

链接: https://arxiv.org/abs/2512.09048
作者: Timothy Keyes,Alison Callahan,Abby S. Pandya,Nerissa Ambers,Juan M. Banda,Miguel Fuentes,Carlene Lugtu,Pranav Masariya,Srikar Nallan,Connor O’Brien,Thomas Wang,Emily Alsentzer,Jonathan H. Chen,Dev Dash,Matthew A. Eisenberg,Patricia Garcia,Nikesh Kotecha,Anurang Revri,Michael A. Pfeffer,Nigam H. Shah,Sneha S. Jain
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注: 36 pages, 3 figures

点击查看摘要

Abstract:Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
zh

[AI-79] Digital Modeling of Spatial Pathway Activity from Histology Reveals Tumor Microenvironment Heterogeneity

【速读】:该论文旨在解决如何从常规苏木精-伊红(H&E)染色的组织病理图像中预测空间转录组(Spatial Transcriptomics, ST)层面的通路活性问题,从而实现对肿瘤微环境异质性的高效、可扩展的空间解析。其解决方案的关键在于利用计算病理学基础模型提取的图像特征,直接预测微尺度(55 μm 和 100 μm)分辨率下的通路活性,发现TGFβ信号通路在多个独立乳腺癌和肺癌ST数据集中预测准确度最高,并且预测结果在87–88%的可靠案例中展现出肿瘤区域与邻近非肿瘤区域之间预期的生物学对比,表明从常规病理图像中可恢复具有空间一致性和生物学可解释性的通路活动模式。

链接: https://arxiv.org/abs/2512.09003
作者: Ling Liao,Changhuei Yang,Maxim Artyomov,Mark Watson,Adam Kepecs,Haowen Zhou,Alexey Sergushichev,Richard Cote
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables simultaneous mapping of tissue morphology and spatially resolved gene expression, offering unique opportunities to study tumor microenvironment heterogeneity. Here, we introduce a computational framework that predicts spatial pathway activity directly from hematoxylin-and-eosin-stained histology images at microscale resolution 55 and 100 um. Using image features derived from a computational pathology foundation model, we found that TGFb signaling was the most accurately predicted pathway across three independent breast and lung cancer ST datasets. In 87-88% of reliably predicted cases, the resulting spatial TGFb activity maps reflected the expected contrast between tumor and adjacent non-tumor regions, consistent with the known role of TGFb in regulating interactions within the tumor microenvironment. Notably, linear and nonlinear predictive models performed similarly, suggesting that image features may relate to pathway activity in a predominantly linear fashion or that nonlinear structure is small relative to measurement noise. These findings demonstrate that features extracted from routine histopathology may recover spatially coherent and biologically interpretable pathway patterns, offering a scalable strategy for integrating image-based inference with ST information in tumor microenvironment studies.
zh

机器学习

[LG-0] Closing the Train-Test Gap in World Models for Gradient-Based Planning

链接: https://arxiv.org/abs/2512.09929
作者: Arjun Parthasarathy,Nimit Kalra,Rohun Agrawal,Yann LeCun,Oumayma Bounou,Pavel Izmailov,Micah Goldblum
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.

[LG-1] Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

链接: https://arxiv.org/abs/2512.09894
作者: Mengren(Bill)Liu,Yixiang Zhang,Yiming(Jason)Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2–against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.

[LG-2] Analysis of Dirichlet Energies as Over-smoothing Measures

链接: https://arxiv.org/abs/2512.09890
作者: Anna Bison,Alessandro Sperduti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze the distinctions between two functionals often used as over-smoothing measures: the Dirichlet energies induced by the unnormalized graph Laplacian and the normalized graph Laplacian. We demonstrate that the latter fails to satisfy the axiomatic definition of a node-similarity measure proposed by Rusch \textitet al. By formalizing fundamental spectral properties of these two definitions, we highlight critical distinctions necessary to select the metric that is spectrally compatible with the GNN architecture, thereby resolving ambiguities in monitoring the dynamics.

[LG-3] HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

链接: https://arxiv.org/abs/2512.09886
作者: Gustavo Coelho Haase,Paulo Henrique Dourado da Silva
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 9 pages

点击查看摘要

Abstract:Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbfHPM-KD, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.

[LG-4] Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime

链接: https://arxiv.org/abs/2512.09850
作者: Simone Cuonzo,Nina Deliu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage. We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework’s practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.09850 [cs.LG] (or arXiv:2512.09850v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.09850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Fast Factorized Learning: Powered by In-Memory Database Systems

链接: https://arxiv.org/abs/2512.09836
作者: Bernhard Stöckl,Maximilian E. Schüle
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL – as a disk-based database system – and HyPer – as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.

[LG-6] Predicting the Containment Time of California Wildfires Using Machine Learning

链接: https://arxiv.org/abs/2512.09835
作者: Shashank Bhardwaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:California’s wildfire season keeps getting worse over the years, overwhelming the emergency response teams. These fires cause massive destruction to both property and human life. Because of these reasons, there’s a growing need for accurate and practical predictions that can help assist with resources allocation for the Wildfire managers or the response teams. In this research, we built machine learning models to predict the number of days it will require to fully contain a wildfire in California. Here, we addressed an important gap in the current literature. Most prior research has concentrated on wildfire risk or how fires spread, and the few that examine the duration typically predict it in broader categories rather than a continuous measure. This research treats the wildfire duration prediction as a regression task, which allows for more detailed and precise forecasts rather than just the broader categorical predictions used in prior work. We built the models by combining three publicly available datasets from California Department of Forestry and Fire Protection’s Fire and Resource Assessment Program (FRAP). This study compared the performance of baseline ensemble regressor, Random Forest and XGBoost, with a Long Short-Term Memory (LSTM) neural network. The results show that the XGBoost model slightly outperforms the Random Forest model, likely due to its superior handling of static features in the dataset. The LSTM model, on the other hand, performed worse than the ensemble models because the dataset lacked temporal features. Overall, this study shows that, depending on the feature availability, Wildfire managers or Fire management authorities can select the most appropriate model to accurately predict wildfire containment duration and allocate resources effectively.

[LG-7] A roadmap of geospatial soil quality analysis systems

链接: https://arxiv.org/abs/2512.09817
作者: Habiba BEN ABDERRAHMANE,Slimane Oulad-Naoui,Benameur ZIANI
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soil quality (SQ) plays a crucial role in sustainable agriculture, environmental conservation, and land-use planning. Traditional SQ assessment techniques rely on costly, labor-intensive sampling and laboratory analysis, limiting their spatial and temporal coverage. Advances in Geographic Information Systems (GIS), remote sensing, and machine learning (ML) enabled efficient SQ evaluation. This paper presents a comprehensive roadmap distinguishing it from previous reviews by proposing a unified and modular pipeline that integrates multi-source soil data, GIS and remote sensing tools, and machine learning techniques to support transparent and scalable soil quality assessment. It also includes practical applications. Contrary to existing studies that predominantly target isolated soil parameters or specific modeling methodologies, this approach consolidates recent advancements in Geographic Information Systems (GIS), remote sensing technologies, and machine learning algorithms within the entire soil quality assessment pipeline. It also addresses existing challenges and limitations while exploring future developments and emerging trends in the field that can deliver the next generation of soil quality systems making them more transparent, adaptive, and aligned with sustainable land management.

[LG-8] Incorporating Fairness in Neighborhood Graphs for Fair Spectral Clustering

链接: https://arxiv.org/abs/2512.09810
作者: Adithya K Moorthy,V Vijaya Saradhi,Bhanu Prasad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering plays a pivotal role in unsupervised learning methods like spectral clustering, yet traditional methods for graph clustering often perpetuate bias through unfair graph constructions that may underrepresent some groups. The current research introduces novel approaches for constructing fair k-nearest neighbor (kNN) and fair epsilon-neighborhood graphs that proactively enforce demographic parity during graph formation. By incorporating fairness constraints at the earliest stage of neighborhood selection steps, our approaches incorporate proportional representation of sensitive features into the local graph structure while maintaining geometric this http URL work addresses a critical gap in pre-processing for fair spectral clustering, demonstrating that topological fairness in graph construction is essential for achieving equitable clustering outcomes. Widely used graph construction methods like kNN and epsilon-neighborhood graphs propagate edge based disparate impact on sensitive groups, leading to biased clustering results. Providing representation of each sensitive group in the neighborhood of every node leads to fairer spectral clustering results because the topological features of the graph naturally reflect equitable group ratios. This research fills an essential shortcoming in fair unsupervised learning, by illustrating how topological fairness in graph construction inherently facilitates fairer spectral clustering results without the need for changes to the clustering algorithm itself. Thorough experiments on three synthetic datasets, seven real-world tabular datasets, and three real-world image datasets prove that our fair graph construction methods surpass the current baselines in graph clustering tasks.

[LG-9] Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers

链接: https://arxiv.org/abs/2512.09800
作者: Zhaolan Huang,Kaspar Schleiser,Gyungmin Myung,Emmanuel Baccelli
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers.

[LG-10] M3Net: A Multi-Metric Mixture of Experts Network Digital Twin with Graph Neural Networks

链接: https://arxiv.org/abs/2512.09797
作者: Blessed Guda,Carlee Joe-Wong
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of 5G/6G network technologies promises to enable applications like autonomous vehicles and virtual reality, resulting in a significant increase in connected devices and necessarily complicating network management. Even worse, these applications often have strict, yet heterogeneous, performance requirements across metrics like latency and reliability. Much recent work has thus focused on developing the ability to predict network performance. However, traditional methods for network modeling, like discrete event simulators and emulation, often fail to balance accuracy and scalability. Network Digital Twins (NDTs), augmented by machine learning, present a viable solution by creating virtual replicas of physical networks for real- time simulation and analysis. State-of-the-art models, however, fall short of full-fledged NDTs, as they often focus only on a single performance metric or simulated network data. We introduce M3Net, a Multi-Metric Mixture-of-experts (MoE) NDT that uses a graph neural network architecture to estimate multiple performance metrics from an expanded set of network state data in a range of scenarios. We show that M3Net significantly enhances the accuracy of flow delay predictions by reducing the MAPE (Mean Absolute Percentage Error) from 20.06% to 17.39%, while also achieving 66.47% and 78.7% accuracy on jitter and packets dropped for each flow

[LG-11] Knowledge Diversion for Efficient Morphology Control and Policy Transfer

链接: https://arxiv.org/abs/2512.09796
作者: Fu Feng,Ruixiao Shi,Yucheng Xie,Jianlu Shen,Jing Wang,Xin Geng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal morphology control aims to learn a universal policy that generalizes across heterogeneous agent morphologies, with Transformer-based controllers emerging as a popular choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose \textbfDivMorph, a modular training paradigm that leverages knowledge diversion to learn decomposable controllers. DivMorph factorizes randomly initialized Transformer weights into factor units via SVD prior to training and employs dynamic soft gating to modulate these units based on task and morphology embeddings, separating them into shared \textitlearngenes and morphology- and task-specific \textittailors, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph enables scalable and efficient policy deployment while supporting effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, achieving a 3 \times improvement in sample efficiency over direct finetuning for cross-task transfer and a 17 \times reduction in model size for single-agent deployment.

[LG-12] nyDéjàVu: Smaller Memory Footprint Faster Inference on Sensor Data Streams with Always-On Microcontrollers

链接: https://arxiv.org/abs/2512.09786
作者: Zhaolan Huang,Emmanuel Baccelli
类目: Machine Learning (cs.LG); Performance (cs.PF); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Always-on sensors are increasingly expected to embark a variety of tiny neural networks and to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware uses microcontrollers (MCUs) with tiny memory budget e.g., 128kB of RAM. In this context, optimizing data flows across neural network layers becomes crucial. In this paper, we introduce TinyDéjàVu, a new framework and novel algorithms we designed to drastically reduce the RAM footprint required by inference using various tiny ML models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on hardware. We show that TinyDéjàVu can save more than 60% of RAM usage and eliminate up to 90% of redundant compute on overlapping sliding window inputs.

[LG-13] Predicting Polymer Solubility in Solvents Using SMILES Strings

链接: https://arxiv.org/abs/2512.09784
作者: Andrew Reinhard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and predicting polymer solubility in various solvents is critical for applications ranging from recycling to pharmaceutical formulation. This work presents a deep learning framework that predicts polymer solubility, expressed as weight percent (wt%), directly from SMILES representations of both polymers and solvents. A dataset of 8,049 polymer solvent pairs at 25 deg C was constructed from calibrated molecular dynamics simulations (Zhou et al., 2023), and molecular descriptors and fingerprints were combined into a 2,394 feature representation per sample. A fully connected neural network with six hidden layers was trained using the Adam optimizer and evaluated using mean squared error loss, achieving strong agreement between predicted and actual solubility values. Generalizability was demonstrated using experimentally measured data from the Materials Genome Project, where the model maintained high accuracy on 25 unseen polymer solvent combinations. These findings highlight the viability of SMILES based machine learning models for scalable solubility prediction and high-throughput solvent screening, supporting applications in green chemistry, polymer processing, and materials design.

[LG-14] Physics-Aware Heterogeneous GNN Architecture for Real-Time BESS Optimization in Unbalanced Distribution Systems

链接: https://arxiv.org/abs/2512.09780
作者: Aoxiang Ma,Salah Ghamizi,Jun Cao,Pedro Rodriguez
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Battery energy storage systems (BESS) have become increasingly vital in three-phase unbalanced distribution grids for maintaining voltage stability and enabling optimal dispatch. However, existing deep learning approaches often lack explicit three-phase representation, making it difficult to accurately model phase-specific dynamics and enforce operational constraints–leading to infeasible dispatch solutions. This paper demonstrates that by embedding detailed three-phase grid information–including phase voltages, unbalanced loads, and BESS states–into heterogeneous graph nodes, diverse GNN architectures (GCN, GAT, GraphSAGE, GPS) can jointly predict network state variables with high accuracy. Moreover, a physics-informed loss function incorporates critical battery constraints–SoC and C-rate limits–via soft penalties during training. Experimental validation on the CIGRE 18-bus distribution system shows that this embedding-loss approach achieves low prediction errors, with bus voltage MSEs of 6.92e-07 (GCN), 1.21e-06 (GAT), 3.29e-05 (GPS), and 9.04e-07 (SAGE). Importantly, the physics-informed method ensures nearly zero SoC and C-rate constraint violations, confirming its effectiveness for reliable, constraint-compliant dispatch.

[LG-15] Mixture of Lookup Key-Value Experts

链接: https://arxiv.org/abs/2512.09723
作者: Zongcheng Wang
类目: Machine Learning (cs.LG)
*备注: Preliminary Version; Work in Progress

点击查看摘要

Abstract:Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencitejie_mixture_2025. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE’s context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbfMixture \textbfof \textbfLookup \textbfKey-\textbfValue Experts (\textbfMoLKV) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.

[LG-16] raining One Model to Master Cross-Level Agent ic Actions via Reinforcement Learning

链接: https://arxiv.org/abs/2512.09706
作者: Kaichen He,Zihao Wang,Muyao Li,Anji Liu,Yitao Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces–such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching–balancing high-level efficiency with low-level precision–without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at this https URL

[LG-17] A data-driven approach to linking design features with manufacturing process data for sustainable product development

链接: https://arxiv.org/abs/2512.09690
作者: Jiahang Li,Lucas Cazzonelli,Jacqueline Höllig,Markus Doellken,Sven Matthiesen
类目: Machine Learning (cs.LG)
*备注: This is the preprint of a paper accepted for the CIRP Design Conference 2026

点击查看摘要

Abstract:The growing adoption of Industrial Internet of Things (IIoT) technologies enables automated, real-time collection of manufacturing process data, unlocking new opportunities for data-driven product development. Current data-driven methods are generally applied within specific domains, such as design or manufacturing, with limited exploration of integrating design features and manufacturing process data. Since design decisions significantly affect manufacturing outcomes, such as error rates, energy consumption, and processing times, the lack of such integration restricts the potential for data-driven product design improvements. This paper presents a data-driven approach to mapping and analyzing the relationship between design features and manufacturing process data. A comprehensive system architecture is developed to ensure continuous data collection and integration. The linkage between design features and manufacturing process data serves as the basis for developing a machine learning model that enables automated design improvement suggestions. By integrating manufacturing process data with sustainability metrics, this approach opens new possibilities for sustainable product development.

[LG-18] Membership and Dataset Inference Attacks on Large Audio Generative Models NEURIPS2025

链接: https://arxiv.org/abs/2512.09654
作者: Jakub Proboszcz,Paweł Kochanski,Karol Korszun,Donato Crisostomi,Giorgio Strano,Emanuele Rodolà,Kamil Deja,Jan Dubinski
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 AI for Music Workshop NeurIPS 2025 Workshop on Creativity Generative AI

点击查看摘要

Abstract:Generative audio models, based on diffusion and autoregressive architectures, have advanced rapidly in both quality and expressiveness. This progress, however, raises pressing copyright concerns, as such models are often trained on vast corpora of artistic and commercial works. A central question is whether one can reliably verify if an artist’s material was included in training, thereby providing a means for copyright holders to protect their content. In this work, we investigate the feasibility of such verification through membership inference attacks (MIA) on open-source generative audio models, which attempt to determine whether a specific audio sample was part of the training set. Our empirical results show that membership inference alone is of limited effectiveness at scale, as the per-sample membership signal is weak for models trained on large and diverse datasets. However, artists and media owners typically hold collections of works rather than isolated samples. Building on prior work in text and vision domains, in this work we focus on dataset inference (DI), which aggregates diverse membership evidence across multiple samples. We find that DI is successful in the audio domain, offering a more practical mechanism for assessing whether an artist’s works contributed to model training. Our results suggest DI as a promising direction for copyright protection and dataset accountability in the era of large audio generative models.

[LG-19] Semantic-Aware Cooperative Communication and Computation Framework in Vehicular Networks

链接: https://arxiv.org/abs/2512.09621
作者: Jingbo Zhang,Maoxin Ji,Qiong Wu,Pingyi Fan,Kezhi Wang,Wen Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic Communication (SC) combined with Vehicular edge computing (VEC) provides an efficient edge task processing paradigm for Internet of Vehicles (IoV). Focusing on highway scenarios, this paper proposes a Tripartite Cooperative Semantic Communication (TCSC) framework, which enables Vehicle Users (VUs) to perform semantic task offloading via Vehicle-to-Infrastructure (V2I) and Vehicle-to-Vehicle (V2V) communications. Considering task latency and the number of semantic symbols, the framework constructs a Mixed-Integer Nonlinear Programming (MINLP) problem, which is transformed into two subproblems. First, we innovatively propose a multi-agent proximal policy optimization task offloading optimization method based on parametric distribution noise (MAPPO-PDN) to solve the optimization problem of the number of semantic symbols; second, linear programming (LP) is used to solve offloading ratio. Simulations show that performance of this scheme is superior to that of other algorithms.

[LG-20] Comparative Analysis of Hash-based Malware Clustering via K-Means

链接: https://arxiv.org/abs/2512.09539
作者: Aink Acrie Soe Thein,Nikolaos Pitropakis,Pavlos Papadopoulos,Sam Grierson,Sana Ullah Jan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To be published in the proceedings of the 8th International Conference on Reliable Information and Communication Technology (IRICT 2025). Springer Book Series: “Lecture Notes on Data Engineering and Communications Technologies”

点击查看摘要

Abstract:With the adoption of multiple digital devices in everyday life, the cyber-attack surface has increased. Adversaries are continuously exploring new avenues to exploit them and deploy malware. On the other hand, detection approaches typically employ hashing-based algorithms such as SSDeep, TLSH, and IMPHash to capture structural and behavioural similarities among binaries. This work focuses on the analysis and evaluation of these techniques for clustering malware samples using the K-means algorithm. More specifically, we experimented with established malware families and traits and found that TLSH and IMPHash produce more distinct, semantically meaningful clusters, whereas SSDeep is more efficient for broader classification tasks. The findings of this work can guide the development of more robust threat-detection mechanisms and adaptive security mechanisms.

[LG-21] Latent-Autoregressive GP-VAE Language Model

链接: https://arxiv.org/abs/2512.09535
作者: Yves Ruffenach
类目: Machine Learning (cs.LG)
*备注: 27 pages, 1 figure, 4 tables. Proof-of-concept study of a latent-autoregressive GP-VAE language model with TCN encoder and non-autoregressive decoder. Code available at this https URL

点击查看摘要

Abstract:We investigate a fully Latent AutoRegressive scheme based on a Gaussian Process (GP) integrated into a Variational Autoencoder (VAE). In this setting, sequential dynamics are transferred from the observation space to a continuous latent space, while linguistic generation remains parallel through a non-autoregressive decoder. We present a complete methodological formulation, including a causal GP prior, a structured amortized posterior, and a training protocol based on a regularized ELBO. Empirical evaluation, conducted within a deliberately constrained proof-of-concept (POC) framework, shows that the model can be trained stably and that the sequential and parallel sampling variants exhibit consistent behavior. Overall, the results suggest that part of the temporal structure in a language model can be supported by the probabilistic geometry of the latent space rather than by explicit neural operations.

[LG-22] QuanvNeXt: An end-to-end quanvolutional neural network for EEG-based detection of major depressive disorder

链接: https://arxiv.org/abs/2512.09517
作者: Nabil Anan Orka,Ehtashamul Haque,Maftahul Jannat,Md Abdul Awal,Mohammad Ali Moni
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:This study presents QuanvNeXt, an end-to-end fully quanvolutional model for EEG-based depression diagnosis. QuanvNeXt incorporates a novel Cross Residual block, which reduces feature homogeneity and strengthens cross-feature relationships while retaining parameter efficiency. We evaluated QuanvNeXt on two open-source datasets, where it achieved an average accuracy of 93.1% and an average AUC-ROC of 97.2%, outperforming state-of-the-art baselines such as InceptionTime (91.7% accuracy, 95.9% AUC-ROC). An uncertainty analysis across Gaussian noise levels demonstrated well-calibrated predictions, with ECE scores remaining low (0.0436, Dataset 1) to moderate (0.1159, Dataset 2) even at the highest perturbation (\epsilon = 0.1). Additionally, a post-hoc explainable AI analysis confirmed that QuanvNeXt effectively identifies and learns spectrotemporal patterns that distinguish between healthy controls and major depressive disorder. Overall, QuanvNeXt establishes an efficient and reliable approach for EEG-based depression diagnosis.

[LG-23] Contextual Dynamic Pricing with Heterogeneous Buyers NEURIPS2025

链接: https://arxiv.org/abs/2512.09513
作者: Thodoris Lykouris,Sloan Nietert,Princewill Okoroafor,Chara Podimata,Julian Zimmert
类目: Machine Learning (cs.LG)
*备注: Appeared at NeurIPS 2025

点击查看摘要

Abstract:We initiate the study of contextual dynamic pricing with a heterogeneous population of buyers, where a seller repeatedly posts prices (over T rounds) that depend on the observable d -dimensional context and receives binary purchase feedback. Unlike prior work assuming homogeneous buyer types, in our setting the buyer’s valuation type is drawn from an unknown distribution with finite support size K_\star . We develop a contextual pricing algorithm based on optimistic posterior sampling with regret \widetildeO(K_\star\sqrtdT) , which we prove to be tight in d and T up to logarithmic terms. Finally, we refine our analysis for the non-contextual pricing case, proposing a variance-aware zooming algorithm that achieves the optimal dependence on K_\star .

[LG-24] WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

链接: https://arxiv.org/abs/2512.09472
作者: Chiheng Lou,Sheng Qi,Rui Kang,Yong Zhang,Chen Sun,Pengcheng Wang,Bingyang Liu,Xuanzhe Liu,Xin Jin
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8 \times compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5 \times more requests compared to the GPU-sharing system. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2512.09472 [cs.DC] (or arXiv:2512.09472v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.09472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] Cauchy-Schwarz Fairness Regularizer

链接: https://arxiv.org/abs/2512.09467
作者: Yezi Liu,Hanning Chen,Wenjun Huang,Yang Ni,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group fairness in machine learning is often enforced by adding a regularizer that reduces the dependence between model predictions and sensitive attributes. However, existing regularizers are built on heterogeneous distance measures and design choices, which makes their behavior hard to reason about and their performance inconsistent across tasks. This raises a basic question: what properties make a good fairness regularizer? We address this question by first organizing existing in-process methods into three families: (i) matching prediction statistics across sensitive groups, (ii) aligning latent representations, and (iii) directly minimizing dependence between predictions and sensitive attributes. Through this lens, we identify desirable properties of the underlying distance measure, including tight generalization bounds, robustness to scale differences, and the ability to handle arbitrary prediction distributions. Motivated by these properties, we propose a Cauchy-Schwarz (CS) fairness regularizer that penalizes the empirical CS divergence between prediction distributions conditioned on sensitive groups. Under a Gaussian comparison, we show that CS divergence yields a tighter bound than Kullback-Leibler divergence, Maximum Mean Discrepancy, and the mean disparity used in Demographic Parity, and we discuss how these advantages translate to a distribution-free, kernel-based estimator that naturally extends to multiple sensitive attributes. Extensive experiments on four tabular benchmarks and one image dataset demonstrate that the proposed CS regularizer consistently improves Demographic Parity and Equal Opportunity metrics while maintaining competitive accuracy, and achieves a more stable utility-fairness trade-off across hyperparameter settings compared to prior regularizers.

[LG-26] Generalizable Collaborative Search-and-Capture in Cluttered Environments via Path-Guided MAPPO and Directional Frontier Allocation

链接: https://arxiv.org/abs/2512.09410
作者: Jialin Ying,Zhihao Li,Zicheng Dong,Guohua Wu,Yihuan Liao
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Collaborative pursuit-evasion in cluttered environments presents significant challenges due to sparse rewards and constrained Fields of View (FOV). Standard Multi-Agent Reinforcement Learning (MARL) often suffers from inefficient exploration and fails to scale to large scenarios. We propose PGF-MAPPO (Path-Guided Frontier MAPPO), a hierarchical framework bridging topological planning with reactive control. To resolve local minima and sparse rewards, we integrate an A*-based potential field for dense reward shaping. Furthermore, we introduce Directional Frontier Allocation, combining Farthest Point Sampling (FPS) with geometric angle suppression to enforce spatial dispersion and accelerate coverage. The architecture employs a parameter-shared decentralized critic, maintaining O(1) model complexity suitable for robotic swarms. Experiments demonstrate that PGF-MAPPO achieves superior capture efficiency against faster evaders. Policies trained on 10x10 maps exhibit robust zero-shot generalization to unseen 20x20 environments, significantly outperforming rule-based and learning-based baselines.

[LG-27] Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLM s

链接: https://arxiv.org/abs/2512.09403
作者: Sohely Jahan,Ruimin Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of 12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.09403 [cs.LG] (or arXiv:2512.09403v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.09403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Federated Distillation Assisted Vehicle Edge Caching Scheme Based on Lightweight DDPM

链接: https://arxiv.org/abs/2512.09378
作者: Xun Li,Qiong Wu,Pingyi Fan,Kezhi Wang,Wen Chen,Khaled B. Letaief
类目: Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE letters. The source code has been released at: this https URL

点击查看摘要

Abstract:Vehicle edge caching is a promising technology that can significantly reduce the latency for vehicle users (VUs) to access content by pre-caching user-interested content at edge nodes. It is crucial to accurately predict the content that VUs are interested in without exposing their privacy. Traditional federated learning (FL) can protect user privacy by sharing models rather than raw data. However, the training of FL requires frequent model transmission, which can result in significant communication overhead. Additionally, vehicles may leave the road side unit (RSU) coverage area before training is completed, leading to training failures. To address these issues, in this letter, we propose a federated distillation-assisted vehicle edge caching scheme based on lightweight denoising diffusion probabilistic model (LDPM). The simulation results demonstrate that the proposed vehicle edge caching scheme has good robustness to variations in vehicle speed, significantly reducing communication overhead and improving cache hit percentage.

[LG-29] CFLight: Enhancing Safety with Traffic Signal Control through Counterfactual Learning

链接: https://arxiv.org/abs/2512.09368
作者: Mingyuan Li,Chunyu Liu,Zhuojun Li,Xiao Liu,Guangsheng Yu,Bo Du,Jun Shen,Qiang Wu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Traffic accidents result in millions of injuries and fatalities globally, with a significant number occurring at intersections each year. Traffic Signal Control (TSC) is an effective strategy for enhancing safety at these urban junctures. Despite the growing popularity of Reinforcement Learning (RL) methods in optimizing TSC, these methods often prioritize driving efficiency over safety, thus failing to address the critical balance between these two aspects. Additionally, these methods usually need more interpretability. CounterFactual (CF) learning is a promising approach for various causal analysis fields. In this study, we introduce a novel framework to improve RL for safety aspects in TSC. This framework introduces a novel method based on CF learning to address the question: What if, when an unsafe event occurs, we backtrack to perform alternative actions, and will this unsafe event still occur in the subsequent period?'' To answer this question, we propose a new structure causal model to predict the result after executing different actions, and we propose a new CF module that integrates with additional X’’ modules to promote safe RL practices. Our new algorithm, CFLight, which is derived from this framework, effectively tackles challenging safety events and significantly improves safety at intersections through a near-zero collision control strategy. Through extensive numerical experiments on both real-world and synthetic datasets, we demonstrate that CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and the recent safe RL model. Moreover, our method represents a generalized and safe framework for RL methods, opening possibilities for applications in other domains. The data and code are available in the github this https URL.

[LG-30] KGOT: Unified Knowledge Graph and Optimal Transport Pseudo-Labeling for Molecule-Protein Interaction Prediction

链接: https://arxiv.org/abs/2512.09365
作者: Jiayu Qin,Zhengquan Luo,Guy Tadmor,Changyou Chen,David Zeevi,Zhiqiang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting molecule-protein interactions (MPIs) is a fundamental task in computational biology, with crucial applications in drug discovery and molecular function annotation. However, existing MPI models face two major challenges. First, the scarcity of labeled molecule-protein pairs significantly limits model performance, as available datasets capture only a small fraction of biological relevant interactions. Second, most methods rely solely on molecular and protein features, ignoring broader biological context such as genes, metabolic pathways, and functional annotations that could provide essential complementary information. To address these limitations, our framework first aggregates diverse biological datasets, including molecular, protein, genes and pathway-level interactions, and then develop an optimal transport-based approach to generate high-quality pseudo-labels for unlabeled molecule-protein pairs, leveraging the underlying distribution of known interactions to guide label assignment. By treating pseudo-labeling as a mechanism for bridging disparate biological modalities, our approach enables the effective use of heterogeneous data to enhance MPI prediction. We evaluate our framework on multiple MPI datasets including virtual screening tasks and protein retrieval tasks, demonstrating substantial improvements over state-of-the-art methods in prediction accuracies and zero shot ability across unseen interactions. Beyond MPI prediction, our approach provides a new paradigm for leveraging diverse biological data sources to tackle problems traditionally constrained by single- or bi-modal learning, paving the way for future advances in computational biology and drug discovery.

[LG-31] A Granular Framework for Construction Material Price Forecasting: Econometric and Machine-Learning Approaches

链接: https://arxiv.org/abs/2512.09360
作者: Boge Lyu,Qianye Yin,Iris Denise Tommelein,Hanyang Liu,Karnamohit Ranka,Karthik Yeluripati,Junzhe Shi
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:The persistent volatility of construction material prices poses significant risks to cost estimation, budgeting, and project delivery, underscoring the urgent need for granular and scalable forecasting methods. This study develops a forecasting framework that leverages the Construction Specifications Institute (CSI) MasterFormat as the target data structure, enabling predictions at the six-digit section level and supporting detailed cost projections across a wide spectrum of building materials. To enhance predictive accuracy, the framework integrates explanatory variables such as raw material prices, commodity indexes, and macroeconomic indicators. Four time-series models, Long Short-Term Memory (LSTM), Autoregressive Integrated Moving Average (ARIMA), Vector Error Correction Model (VECM), and Chronos-Bolt, were evaluated under both baseline configurations (using CSI data only) and extended versions with explanatory variables. Results demonstrate that incorporating explanatory variables significantly improves predictive performance across all models. Among the tested approaches, the LSTM model consistently achieved the highest accuracy, with RMSE values as low as 1.390 and MAPE values of 0.957, representing improvements of up to 59% over the traditional statistical time-series model, ARIMA. Validation across multiple CSI divisions confirmed the framework’s scalability, while Division 06 (Wood, Plastics, and Composites) is presented in detail as a demonstration case. This research offers a robust methodology that enables owners and contractors to improve budgeting practices and achieve more reliable cost estimation at the Definitive level.

[LG-32] Improved Physics-Driven Neural Network to Solve Inverse Scattering Problems

链接: https://arxiv.org/abs/2512.09333
作者: Yutong Du,Zicheng Liu,Bo Wu,Jingwei Kou,Hang Li,Changyou Li,Yali Zong,Bo Qi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an improved physics-driven neural network (IPDNN) framework for solving electromagnetic inverse scattering problems (ISPs). A new Gaussian-localized oscillation-suppressing window (GLOW) activation function is introduced to stabilize convergence and enable a lightweight yet accurate network architecture. A dynamic scatter subregion identification strategy is further developed to adaptively refine the computational domain, preventing missed detections and reducing computational cost. Moreover, transfer learning is incorporated to extend the solver’s applicability to practical scenarios, integrating the physical interpretability of iterative algorithms with the real-time inference capability of neural networks. Numerical simulations and experimental results demonstrate that the proposed solver achieves superior reconstruction accuracy, robustness, and efficiency compared with existing state-of-the-art methods.

[LG-33] Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design

链接: https://arxiv.org/abs/2512.09329
作者: Amin Tavakoli,Raswanth Murugan,Ozan Gokdemir,Arvind Ramanathan,Frances Arnold,Anima Anandkumar
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM’s output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.

[LG-34] Self-Supervised Learning with Gaussian Processes

链接: https://arxiv.org/abs/2512.09322
作者: Yunshan Duan,Sinead Williamson
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Self supervised learning (SSL) is a machine learning paradigm where models learn to understand the underlying structure of data without explicit supervision from labeled samples. The acquired representations from SSL have demonstrated useful for many downstream tasks including clustering, and linear classification, etc. To ensure smoothness of the representation space, most SSL methods rely on the ability to generate pairs of observations that are similar to a given instance. However, generating these pairs may be challenging for many types of data. Moreover, these methods lack consideration of uncertainty quantification and can perform poorly in out-of-sample prediction settings. To address these limitations, we propose Gaussian process self supervised learning (GPSSL), a novel approach that utilizes Gaussian processes (GP) models on representation learning. GP priors are imposed on the representations, and we obtain a generalized Bayesian posterior minimizing a loss function that encourages informative representations. The covariance function inherent in GPs naturally pulls representations of similar units together, serving as an alternative to using explicitly defined positive samples. We show that GPSSL is closely related to both kernel PCA and VICReg, a popular neural network-based SSL method, but unlike both allows for posterior uncertainties that can be propagated to downstream tasks. Experiments on various datasets, considering classification and regression tasks, demonstrate that GPSSL outperforms traditional methods in terms of accuracy, uncertainty quantification, and error control.

[LG-35] Goal inference with Rao-Blackwellized Particle Filters

链接: https://arxiv.org/abs/2512.09269
作者: Yixuan Wang,Dan P. Guralnik,Warren E. Dixon
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Inferring the eventual goal of a mobile agent from noisy observations of its trajectory is a fundamental estimation problem. We initiate the study of such intent inference using a variant of a Rao-Blackwellized Particle Filter (RBPF), subject to the assumption that the agent’s intent manifests through closed-loop behavior with a state-of-the-art provable practical stability property. Leveraging the assumed closed-form agent dynamics, the RBPF analytically marginalizes the linear-Gaussian substructure and updates particle weights only, improving sample efficiency over a standard particle filter. Two difference estimators are introduced: a Gaussian mixture model using the RBPF weights and a reduced version confining the mixture to the effective sample. We quantify how well the adversary can recover the agent’s intent using information-theoretic leakage metrics and provide computable lower bounds on the Kullback-Leibler (KL) divergence between the true intent distribution and RBPF estimates via Gaussian-mixture KL bounds. We also provide a bound on the difference in performance between the two estimators, highlighting the fact that the reduced estimator performs almost as well as the complete one. Experiments illustrate fast and accurate intent recovery for compliant agents, motivating future work on designing intent-obfuscating controllers.

[LG-36] Contrastive Learning for Semi-Supervised Deep Regression with Generalized Ordinal Rankings from Spectral Seriation

链接: https://arxiv.org/abs/2512.09267
作者: Ce Wang,Weihang Dai,Hanru Bai,Xiaomeng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning methods enforce label distance relationships in feature space to improve representation capability for regression models. However, these methods highly depend on label information to correctly recover ordinal relationships of features, limiting their applications to semi-supervised regression. In this work, we extend contrastive regression methods to allow unlabeled data to be used in the semi-supervised setting, thereby reducing the dependence on costly annotations. Particularly we construct the feature similarity matrix with both labeled and unlabeled samples in a mini-batch to reflect inter-sample relationships, and an accurate ordinal ranking of involved unlabeled samples can be recovered through spectral seriation algorithms if the level of error is within certain bounds. The introduction of labeled samples above provides regularization of the ordinal ranking with guidance from the ground-truth label information, making the ranking more reliable. To reduce feature perturbations, we further utilize the dynamic programming algorithm to select robust features for the matrix construction. The recovered ordinal relationship is then used for contrastive learning on unlabeled samples, and we thus allow more data to be used for feature representation learning, thereby achieving more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, serving as an additional training signal. We provide theoretical guarantees and empirical verification through experiments on various datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods. Our code have been released on this https URL.

[LG-37] Understanding the Failure Modes of Transformers through the Lens of Graph Neural Networks

链接: https://arxiv.org/abs/2512.09182
作者: Hunjae Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers and more specifically decoder-only transformers dominate modern LLM architectures. While they have shown to work exceptionally well, they are not without issues, resulting in surprising failure modes and predictably asymmetric performance degradation. This article is a study of many of these observed failure modes of transformers through the lens of graph neural network (GNN) theory. We first make the case that much of deep learning, including transformers, is about learnable information mixing and propagation. This makes the study of model failure modes a study of bottlenecks in information propagation. This naturally leads to GNN theory, where there is already a rich literature on information propagation bottlenecks and theoretical failure modes of models. We then make the case that many issues faced by GNNs are also experienced by transformers. In addition, we analyze how the causal nature of decoder-only transformers create interesting geometric properties in information propagation, resulting in predictable and potentially devastating failure modes. Finally, we observe that existing solutions in transformer research tend to be ad-hoc and driven by intuition rather than grounded theoretical motivation. As such, we unify many such solutions under a more theoretical perspective, providing insight into why they work, what problem they are actually solving, and how they can be further improved to target specific failure modes of transformers. Overall, this article is an attempt to bridge the gap between observed failure modes in transformers and a general lack of theoretical understanding of them in this space.

[LG-38] Spectral Embedding via Chebyshev Bases for Robust DeepONet Approximation

链接: https://arxiv.org/abs/2512.09165
作者: Muhammad Abid,Omer San
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Operator Networks (DeepONets) have become a central tool in data-driven operator learning, providing flexible surrogates for nonlinear mappings arising in partial differential equations (PDEs). However, the standard trunk design based on fully connected layers acting on raw spatial or spatiotemporal coordinates struggles to represent sharp gradients, boundary layers, and non-periodic structures commonly found in PDEs posed on bounded domains with Dirichlet or Neumann boundary conditions. To address these limitations, we introduce the Spectral-Embedded DeepONet (SEDONet), a new DeepONet variant in which the trunk is driven by a fixed Chebyshev spectral dictionary rather than coordinate inputs. This non-periodic spectral embedding provides a principled inductive bias tailored to bounded domains, enabling the learned operator to capture fine-scale non-periodic features that are difficult for Fourier or MLP trunks to represent. SEDONet is evaluated on a suite of PDE benchmarks including 2D Poisson, 1D Burgers, 1D advection-diffusion, Allen-Cahn dynamics, and the Lorenz-96 chaotic system, covering elliptic, parabolic, advective, and multiscale temporal phenomena, all of which can be viewed as canonical problems in computational mechanics. Across all datasets, SEDONet consistently achieves the lowest relative L2 errors among DeepONet, FEDONet, and SEDONet, with average improvements of about 30-40% over the baseline DeepONet and meaningful gains over Fourier-embedded variants on non-periodic geometries. Spectral analyses further show that SEDONet more accurately preserves high-frequency and boundary-localized features, demonstrating the value of Chebyshev embeddings in non-periodic operator learning. The proposed architecture offers a simple, parameter-neutral modification to DeepONets, delivering a robust and efficient spectral framework for surrogate modeling of PDEs on bounded domains.

[LG-39] Learning Unmasking Policies for Diffusion Language Models

链接: https://arxiv.org/abs/2512.09106
作者: Metod Jazbec,Theo X. Olausson,Louis Béthune,Pierre Ablin,Michael Kirchhof,Joao Monterio,Victor Turrisi,Jason Ramapuram,Marco Cuturi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model’s vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.

[LG-40] Natural Geometry of Robust Data Attribution: From Convex Models to Deep Networks

链接: https://arxiv.org/abs/2512.09103
作者: Shihao Li,Jiachen Li,Dongmei Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Data attribution methods identify which training examples are responsible for a model’s predictions, but their sensitivity to distributional perturbations undermines practical reliability. We present a unified framework for certified robust attribution that extends from convex models to deep networks. For convex settings, we derive Wasserstein-Robust Influence Functions (W-RIF) with provable coverage guarantees. For deep networks, we demonstrate that Euclidean certification is rendered vacuous by spectral amplification – a mechanism where the inherent ill-conditioning of deep representations inflates Lipschitz bounds by over 10,000\times . This explains why standard TRAK scores, while accurate point estimates, are geometrically fragile: naive Euclidean robustness analysis yields 0% certification. Our key contribution is the Natural Wasserstein metric, which measures perturbations in the geometry induced by the model’s own feature covariance. This eliminates spectral amplification, reducing worst-case sensitivity by 76\times and stabilizing attribution estimates. On CIFAR-10 with ResNet-18, Natural W-TRAK certifies 68.7% of ranking pairs compared to 0% for Euclidean baselines – to our knowledge, the first non-vacuous certified bounds for neural network attribution. Furthermore, we prove that the Self-Influence term arising from our analysis equals the Lipschitz constant governing attribution stability, providing theoretical grounding for leverage-based anomaly detection. Empirically, Self-Influence achieves 0.970 AUROC for label noise detection, identifying 94.1% of corrupted labels by examining just the top 20% of training data.

[LG-41] GS-KAN: Parameter-Efficient Kolmogorov-Arnold Networks via Sprecher-Type Shared Basis Functions

链接: https://arxiv.org/abs/2512.09084
作者: Oscar Eliasson
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:The Kolmogorov-Arnold representation theorem offers a theoretical alternative to Multi-Layer Perceptrons (MLPs) by placing learnable univariate functions on edges rather than nodes. While recent implementations such as Kolmogorov-Arnold Networks (KANs) demonstrate high approximation capabilities, they suffer from significant parameter inefficiency due to the requirement of maintaining unique parameterizations for every network edge. In this work, we propose GS-KAN (Generalized Sprecher-KAN), a lightweight architecture inspired by David Sprecher’s refinement of the superposition theorem. GS-KAN constructs unique edge functions by applying learnable linear transformations to a single learnable, shared parent function per layer. We evaluate GS-KAN against existing KAN architectures and MLPs across synthetic function approximation, tabular data regression and image classification tasks. Our results demonstrate that GS-KAN outperforms both MLPs and standard KAN baselines on continuous function approximation tasks while maintaining superior parameter efficiency. Additionally, GS-KAN achieves competitive performance with existing KAN architectures on tabular regression and outperforms MLPs on high-dimensional classification tasks. Crucially, the proposed architecture enables the deployment of KAN-based architectures in high-dimensional regimes under strict parameter constraints, a setting where standard implementations are typically infeasible due to parameter explosion. The source code is available at this https URL.

[LG-42] Modular Deep-Learning-Based Early Warning System for Deadly Heatwave Prediction

链接: https://arxiv.org/abs/2512.09074
作者: Shangqing Xu,Zhiyuan Zhao,Megha Sharma,José María Martín-Olalla,Alexander Rodríguez,Gregory A. Wellenius,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Severe heatwaves in urban areas significantly threaten public health, calling for establishing early warning strategies. Despite predicting occurrence of heatwaves and attributing historical mortality, predicting an incoming deadly heatwave remains a challenge due to the difficulty in defining and estimating heat-related mortality. Furthermore, establishing an early warning system imposes additional requirements, including data availability, spatial and temporal robustness, and decision costs. To address these challenges, we propose DeepTherm, a modular early warning system for deadly heatwave prediction without requiring heat-related mortality history. By highlighting the flexibility of deep learning, DeepTherm employs a dual-prediction pipeline, disentangling baseline mortality in the absence of heatwaves and other irregular events from all-cause mortality. We evaluated DeepTherm on real-world data across Spain. Results demonstrate consistent, robust, and accurate performance across diverse regions, time periods, and population groups while allowing trade-off between missed alarms and false alarms.

[LG-43] Banach neural operator for Navier-Stokes equations

链接: https://arxiv.org/abs/2512.09070
作者: Bo Zhang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Classical neural networks are known for their ability to approximate mappings between finite-dimensional spaces, but they fall short in capturing complex operator dynamics across infinite-dimensional function spaces. Neural operators, in contrast, have emerged as powerful tools in scientific machine learning for learning such mappings. However, standard neural operators typically lack mechanisms for mixing or attending to input information across space and time. In this work, we introduce the Banach neural operator (BNO) – a novel framework that integrates Koopman operator theory with deep neural networks to predict nonlinear, spatiotemporal dynamics from partial observations. The BNO approximates a nonlinear operator between Banach spaces by combining spectral linearization (via Koopman theory) with deep feature learning (via convolutional neural networks and nonlinear activations). This sequence-to-sequence model captures dominant dynamic modes and allows for mesh-independent prediction. Numerical experiments on the Navier-Stokes equations demonstrate the method’s accuracy and generalization capabilities. In particular, BNO achieves robust zero-shot super-resolution in unsteady flow prediction and consistently outperforms conventional Koopman-based methods and deep learning models.

[LG-44] Contrast transfer functions help quantify neural network out-of-distribution generalization in HRTEM

链接: https://arxiv.org/abs/2512.09067
作者: Luis Rangel DaCosta,Mary C. Scott
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Neural networks, while effective for tackling many challenging scientific tasks, are not known to perform well out-of-distribution (OOD), i.e., within domains which differ from their training data. Understanding neural network OOD generalization is paramount to their successful deployment in experimental workflows, especially when ground-truth knowledge about the experiment is hard to establish or experimental conditions significantly vary. With inherent access to ground-truth information and fine-grained control of underlying distributions, simulation-based data curation facilitates precise investigation of OOD generalization behavior. Here, we probe generalization with respect to imaging conditions of neural network segmentation models for high-resolution transmission electron microscopy (HRTEM) imaging of nanoparticles, training and measuring the OOD generalization of over 12,000 neural networks using synthetic data generated via random structure sampling and multislice simulation. Using the HRTEM contrast transfer function, we further develop a framework to compare information content of HRTEM datasets and quantify OOD domain shifts. We demonstrate that neural network segmentation models enjoy significant performance stability, but will smoothly and predictably worsen as imaging conditions shift from the training distribution. Lastly, we consider limitations of our approach in explaining other OOD shifts, such as of the atomic structures, and discuss complementary techniques for understanding generalization in such settings.

[LG-45] A Diffusion-Based Framework for High-Resolution Precipitation Forecasting over CONUS

链接: https://arxiv.org/abs/2512.09059
作者: Marina Vicens-Miquel,Amy McGovern,Aaron J. Hill,Efi Foufoula-Georgiou,Clement Guilloteau,Samuel S. P. Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate precipitation forecasting is essential for hydrometeorological risk management, especially for anticipating extreme rainfall that can lead to flash flooding and infrastructure damage. This study introduces a diffusion-based deep learning (DL) framework that systematically compares three residual prediction strategies differing only in their input sources: (1) a fully data-driven model using only past observations from the Multi-Radar Multi-Sensor (MRMS) system, (2) a corrective model using only forecasts from the High-Resolution Rapid Refresh (HRRR) numerical weather prediction system, and (3) a hybrid model integrating both MRMS and selected HRRR forecast variables. By evaluating these approaches under a unified setup, we provide a clearer understanding of how each data source contributes to predictive skill over the Continental United States (CONUS). Forecasts are produced at 1-km spatial resolution, beginning with direct 1-hour predictions and extending to 12 hours using autoregressive rollouts. Performance is evaluated using both CONUS-wide and region-specific metrics that assess overall performance and skill at extreme rainfall thresholds. Across all lead times, our DL framework consistently outperforms the HRRR baseline in pixel-wise and spatiostatistical metrics. The hybrid model performs best at the shortest lead time, while the HRRR-corrective model outperforms others at longer lead times, maintaining high skill through 12 hours. To assess reliability, we incorporate calibrated uncertainty quantification tailored to the residual learning setup. These gains, particularly at longer lead times, are critical for emergency preparedness, where modest increases in forecast horizon can improve decision-making. This work advances DL-based precipitation forecasting by enhancing predictive skill, reliability, and applicability across regions.

[LG-46] Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

链接: https://arxiv.org/abs/2512.09054
作者: Alon Arad,Saharon Rosset
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration produced suboptimal results when compared to parametric methods, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multiclass calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.

[LG-47] Graph Deep Learning for Intracranial Aneurysm Blood Flow Simulation and Risk Assessment

链接: https://arxiv.org/abs/2512.09013
作者: Paul Garnier,Pablo Jeken-Rico,Vincent Lannelongue,Chiara Faitini,Aurèle Goetz,Lea Chanvillard,Ramy Nemer,Jonathan Viquerat,Ugo Pelissier,Philippe Meliga,Jacques Sédat,Thomas Liebig,Yves Chau,Elie Hachem
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Intracranial aneurysms remain a major cause of neurological morbidity and mortality worldwide, where rupture risk is tightly coupled to local hemodynamics particularly wall shear stress and oscillatory shear index. Conventional computational fluid dynamics simulations provide accurate insights but are prohibitively slow and require specialized expertise. Clinical imaging alternatives such as 4D Flow MRI offer direct in-vivo measurements, yet their spatial resolution remains insufficient to capture the fine-scale shear patterns that drive endothelial remodeling and rupture risk while being extremely impractical and expensive. We present a graph neural network surrogate model that bridges this gap by reproducing full-field hemodynamics directly from vascular geometries in less than one minute per cardiac cycle. Trained on a comprehensive dataset of high-fidelity simulations of patient-specific aneurysms, our architecture combines graph transformers with autoregressive predictions to accurately simulate blood flow, wall shear stress, and oscillatory shear index. The model generalizes across unseen patient geometries and inflow conditions without mesh-specific calibration. Beyond accelerating simulation, our framework establishes the foundation for clinically interpretable hemodynamic prediction. By enabling near real-time inference integrated with existing imaging pipelines, it allows direct comparison with hospital phase-diagram assessments and extends them with physically grounded, high-resolution flow fields. This work transforms high-fidelity simulations from an expert-only research tool into a deployable, data-driven decision support system. Our full pipeline delivers high-resolution hemodynamic predictions within minutes of patient imaging, without requiring computational specialists, marking a step-change toward real-time, bedside aneurysm analysis. Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn) Cite as: arXiv:2512.09013 [cs.LG] (or arXiv:2512.09013v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.09013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] StructuredDNA: A Bio-Physical Framework for Energy-Aware Transformer Routing

链接: https://arxiv.org/abs/2512.08968
作者: Mustapha Hamdi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid scaling of large computational models has led to a critical increase in energy and compute costs. Inspired by biological systems where structure and function emerge from low-energy configurations, we introduce StructuredDNA, a sparse architecture framework for modular, energy-aware Transformer routing. StructuredDNA replaces dense Mixture-of-Experts routing with a bio-physical, energy-guided routing layer based on semantic energy minimization. Inputs are dynamically grouped into semantic codons, and routing selects a single expert by minimizing a global energy functional that combines cohesion, uncertainty, and computational cost. We validate StructuredDNA on both specialized (BioASQ) and open-domain benchmarks (WikiText-103). On BioASQ (K = 50), we achieve a 97.7% reduction in Energy Utilization Density (EUD) and a Semantic Stability Index (SSI) of 0.998. We further demonstrate a Semantic Scaling Law on WikiText-103, showing that the architecture generalizes to open domains by scaling expert granularity (K = 2048) while maintaining more than 99% energy efficiency. StructuredDNA thus establishes a robust, domain-agnostic paradigm for future sparse computational frameworks. StructuredDNA provides an explicit link between bio-physical principles and sparse expert routing in Transformer architectures, and points toward future energy-aware, modular, and scalable computational systems. We discuss limitations of this proof-of-concept study and outline directions for scaling the approach to larger models, datasets, and hardware platforms. The StructuredDNA implementation is available at this https URL . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.08968 [cs.LG] (or arXiv:2512.08968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08968 Focus to learn more arXiv-issued DOI via DataCite

[LG-49] SEA: Spectral Edge Attacks on Graph Neural Networks

链接: https://arxiv.org/abs/2512.08964
作者: Yongyu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) achieve strong performance on graph-structured data, but are notoriously vulnerable to small, carefully crafted perturbations of the graph structure. Most existing structure-based attacks rely on gradient-based heuristics or local connectivity patterns, and treat edges as equally important candidates for manipulation. In this paper, we propose Spectral Edge Attacks (SEA), a new family of adversarial attacks that explicitly leverage spectral robustness evaluation to guide structural perturbations. Our key idea is to compute a spectral embedding that captures the most fragile directions of the input manifold and to use it to assign a robustness score to each edge or non-edge. Based on these scores, we introduce two complementary attack variants: (i) a Spade-guided deletion attack that removes the most spectrally robust edges, and (ii) a Spade-guided addition attack that inserts edges between nodes that are maximally incompatible in the fragile spectral space. Both attacks operate at the graph level, are model-aware but conceptually simple, and can be plugged into existing GNN architectures without requiring gradients. We describe the spectral formulation, the attack algorithms, and experiments on benchmarks.

[LG-50] DW-KNN: A Transparent Local Classifier Integrating Distance Consistency and Neighbor Reliability

链接: https://arxiv.org/abs/2512.08956
作者: Kumarjit Pathak,Karthik K,Sachin Madan,Jitin Kapila
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:K-Nearest Neighbors (KNN) is one of the most used ML classifiers. However, if we observe closely, standard distance-weighted KNN and relative variants assume all ‘k’ neighbors are equally reliable. In heterogeneous feature space, this becomes a limitation that hinders reliability in predicting true levels of the observation. We propose DW-KNN (Double Weighted KNN), a transparent and robust variant that integrates exponential distance with neighbor validity. This enables instance-level interpretability, suppresses noisy or mislabeled samples, and reduces hyperparameter sensitivity. Comprehensive evaluation on 9 data-sets helps to demonstrate that DW-KNN achieves 0.8988 accuracy on average. It ranks 2nd among six methods and within 0.2% of the best-performing Ensemble KNN. It also exhibits the lowest cross-validation variance (0.0156), indicating reliable prediction stability. Statistical significance test confirmed ( p 0.001 ) improvement over compactness weighted KNN (+4.09%) and Kernel weighted KNN (+1.13%). The method provides a simple yet effective alternative to complex adaptive schemes, particularly valuable for high-stakes applications requiring explainable predictions. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.08956 [cs.LG] (or arXiv:2512.08956v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.08956 Focus to learn more arXiv-issued DOI via DataCite

[LG-51] Optimizing Algorithms for Mobile Health Interventions with Active Querying Optimization

链接: https://arxiv.org/abs/2512.08950
作者: Aseel Rawashdeh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning in mobile health (mHealth) interventions requires balancing intervention efficacy with user burden, particularly when state measurements (for example, user surveys or feedback) are costly yet essential. The Act-Then-Measure (ATM) heuristic addresses this challenge by decoupling control and measurement actions within the Action-Contingent Noiselessly Observable Markov Decision Process (ACNO-MDP) framework. However, the standard ATM algorithm relies on a temporal-difference-inspired Q-learning method, which is prone to instability in sparse and noisy environments. In this work, we propose a Bayesian extension to ATM that replaces standard Q-learning with a Kalman filter-style Bayesian update, maintaining uncertainty-aware estimates of Q-values and enabling more stable and sample-efficient learning. We evaluate our method in both toy environments and clinically motivated testbeds. In small, tabular environments, Bayesian ATM achieves comparable or improved scalarized returns with substantially lower variance and more stable policy behavior. In contrast, in larger and more complex mHealth settings, both the standard and Bayesian ATM variants perform poorly, suggesting a mismatch between ATM’s modeling assumptions and the structural challenges of real-world mHealth domains. These findings highlight the value of uncertainty-aware methods in low-data settings while underscoring the need for new RL algorithms that explicitly model causal structure, continuous states, and delayed feedback under observation cost constraints.

[LG-52] Learning-based social coordination to improve safety and robustness of cooperative autonomous vehicles in mixed traffic

链接: https://arxiv.org/abs/2211.11963
作者: Rodolfo Valiente,Behrad Toghi,Mahdi Razzaghpour,Ramtin Pedarsani,Yaser P. Fallah
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2202.00881

点击查看摘要

Abstract:It is expected that autonomous vehicles(AVs) and heterogeneous human-driven vehicles(HVs) will coexist on the same road. The safety and reliability of AVs will depend on their social awareness and their ability to engage in complex social interactions in a socially accepted manner. However, AVs are still inefficient in terms of cooperating with HVs and struggle to understand and adapt to human behavior, which is particularly challenging in mixed autonomy. In a road shared by AVs and HVs, the social preferences or individual traits of HVs are unknown to the AVs and different from AVs, which are expected to follow a policy, HVs are particularly difficult to forecast since they do not necessarily follow a stationary policy. To address these challenges, we frame the mixed-autonomy problem as a multi-agent reinforcement learning (MARL) problem and propose an approach that allows AVs to learn the decision-making of HVs implicitly from experience, account for all vehicles’ interests, and safely adapt to other traffic situations. In contrast with existing works, we quantify AVs’ social preferences and propose a distributed reward structure that introduces altruism into their decision-making process, allowing the altruistic AVs to learn to establish coalitions and influence the behavior of HVs.

[LG-53] Robustness and Adaptability of Reinforcement Learning based Cooperative Autonomous Driving in Mixed-autonomy Traffic

链接: https://arxiv.org/abs/2202.00881
作者: Rodolfo Valiente,Behrad Toghi,Ramtin Pedarsani,Yaser P. Fallah
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building autonomous vehicles (AVs) is a complex problem, but enabling them to operate in the real world where they will be surrounded by human-driven vehicles (HVs) is extremely challenging. Prior works have shown the possibilities of creating inter-agent cooperation between a group of AVs that follow a social utility. Such altruistic AVs can form alliances and affect the behavior of HVs to achieve socially desirable outcomes. We identify two major challenges in the co-existence of AVs and HVs. First, social preferences and individual traits of a given human driver, e.g., selflessness and aggressiveness are unknown to an AV, and it is almost impossible to infer them in real-time during a short AV-HV interaction. Second, contrary to AVs that are expected to follow a policy, HVs do not necessarily follow a stationary policy and therefore are extremely hard to predict. To alleviate the above-mentioned challenges, we formulate the mixed-autonomy problem as a multi-agent reinforcement learning (MARL) problem and propose a decentralized framework and reward function for training cooperative AVs. Our approach enables AVs to learn the decision-making of HVs implicitly from experience, optimizes for a social utility while prioritizing safety and allowing adaptability; robustifying altruistic AVs to different human behaviors and constraining them to a safe action space. Finally, we investigate the robustness, safety and sensitivity of AVs to various HVs behavioral traits and present the settings in which the AVs can learn cooperative policies that are adaptable to different situations.

[LG-54] Optimal certification of constant-local Hamiltonians

链接: https://arxiv.org/abs/2512.09778
作者: Junseo Lee,Myeongjin Shin
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:We study the problem of certifying local Hamiltonians from real-time access to their dynamics. Given oracle access to e^-itH for an unknown k -local Hamiltonian H and a fully specified target Hamiltonian H_0 , the goal is to decide whether H is exactly equal to H_0 or differs from H_0 by at least \varepsilon in normalized Frobenius norm, while minimizing the total evolution time. We introduce the first intolerant Hamiltonian certification protocol that achieves optimal performance for all constant-locality Hamiltonians. For general n -qubit, k -local, traceless Hamiltonians, our procedure uses O(c^k/\varepsilon) total evolution time for a universal constant c , and succeeds with high probability. In particular, for O(1) -local Hamiltonians, the total evolution time becomes \Theta(1/\varepsilon) , matching the known \Omega(1/\varepsilon) lower bounds and achieving the gold-standard Heisenberg-limit scaling. Prior certification methods either relied on implementing inverse evolution of H , required controlled access to e^-itH , or achieved near-optimal guarantees only in restricted settings such as the Ising case ( k=2 ). In contrast, our algorithm requires neither inverse evolution nor controlled operations: it uses only forward real-time dynamics and achieves optimal intolerant certification for all constant-locality Hamiltonians.

[LG-55] ransformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport

链接: https://arxiv.org/abs/2512.09530
作者: Antonio Candelieri,Alessandro Quadrio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset. Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2512.09530 [stat.ML] (or arXiv:2512.09530v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.09530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] ransport Novelty Distance: A Distributional Metric for Evaluating Material Generative Models

链接: https://arxiv.org/abs/2512.09514
作者: Paul Hagemann,Simon Müller,Janine George,Philipp Benner
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in generative machine learning have opened new possibilities for the discovery and design of novel materials. However, as these models become more sophisticated, the need for rigorous and meaningful evaluation metrics has grown. Existing evaluation approaches often fail to capture both the quality and novelty of generated structures, limiting our ability to assess true generative performance. In this paper, we introduce the Transport Novelty Distance (TNovD) to judge generative models used for materials discovery jointly by the quality and novelty of the generated materials. Based on ideas from Optimal Transport theory, TNovD uses a coupling between the features of the training and generated sets, which is refined into a quality and memorization regime by a threshold. The features are generated from crystal structures using a graph neural network that is trained to distinguish between materials, their augmented counterparts, and differently sized supercells using contrastive learning. We evaluate our proposed metric on typical toy experiments relevant for crystal structure prediction, including memorization, noise injection and lattice deformations. Additionally, we validate the TNovD on the MP20 validation set and the WBM substitution dataset, demonstrating that it is capable of detecting both memorized and low-quality material data. We also benchmark the performance of several popular material generative models. While introduced for materials, our TNovD framework is domain-agnostic and can be adapted for other areas, such as images and molecules.

[LG-57] Estimation of Stochastic Optimal Transport Maps NEURIPS2025

链接: https://arxiv.org/abs/2512.09499
作者: Sloan Nietert,Ziv Goldfeld
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Appeared at NeurIPS 2025

点击查看摘要

Abstract:The optimal transport (OT) map is a geometry-driven transformation between high-dimensional probability distributions which underpins a wide range of tasks in statistics, applied probability, and machine learning. However, existing statistical theory for OT map estimation is quite restricted, hinging on Brenier’s theorem (quadratic cost, absolutely continuous source) to guarantee existence and uniqueness of a deterministic OT map, on which various additional regularity assumptions are imposed to obtain quantitative error bounds. In many real-world problems these conditions fail or cannot be certified, in which case optimal transportation is possible only via stochastic maps that can split mass. To broaden the scope of map estimation theory to such settings, this work introduces a novel metric for evaluating the transportation quality of stochastic maps. Under this metric, we develop computationally efficient map estimators with near-optimal finite-sample risk bounds, subject to easy-to-verify minimal assumptions. Our analysis further accommodates common forms of adversarial sample contamination, yielding estimators with robust estimation guarantees. Empirical experiments are provided which validate our theory and demonstrate the utility of the proposed framework in settings where existing theory fails. These contributions constitute the first general-purpose theory for map estimation, compatible with a wide spectrum of real-world applications where optimal transport may be intrinsically stochastic.

[LG-58] Meta-learning three-factor plasticity rules for structured credit assignment with sparse feedback NEURIPS2025

链接: https://arxiv.org/abs/2512.09366
作者: Dimitra Maoutsa
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 10 pages, 2 figures; accepted presented at NeurIPS 2025 workshop Symmetry and Geometry in Neural Representations (NeurReps)

点击查看摘要

Abstract:Biological neural networks learn complex behaviors from sparse, delayed feedback using local synaptic plasticity, yet the mechanisms enabling structured credit assignment remain elusive. In contrast, artificial recurrent networks solving similar tasks typically rely on biologically implausible global learning rules or hand-crafted local updates. The space of local plasticity rules capable of supporting learning from delayed reinforcement remains largely unexplored. Here, we present a meta-learning framework that discovers local learning rules for structured credit assignment in recurrent networks trained with sparse feedback. Our approach interleaves local neo-Hebbian-like updates during task execution with an outer loop that optimizes plasticity parameters via \textbftangent-propagation through learning. The resulting three-factor learning rules enable long-timescale credit assignment using only local information and delayed rewards, offering new insights into biologically grounded mechanisms for learning in recurrent circuits.

[LG-59] Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores

链接: https://arxiv.org/abs/2512.09295
作者: Tengyuan Liang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages

点击查看摘要

Abstract:We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution X \sim P from noisy observations Y = X + \sigma Z , with Z being standard Gaussian independent of X and \sigma0 a known noise level. Let Q denote the distribution of Y . We introduce a hierarchy of denoisers T_0, T_1, \ldots, T_\infty : \mathbbR \to \mathbbR that are agnostic to the signal distribution P , depending only on higher-order score functions of Q . Each denoiser T_K is progressively refined using the (2K-1) -th order score function of Q at noise resolution \sigma^2K , achieving better denoising quality measured by the Wasserstein metric W(T_K \sharp Q, P) . The limiting denoiser T_\infty identifies the optimal transport map with T_\infty \sharp Q = P . We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from Q : (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes. Comments: 23 pages Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.09295 [math.ST] (or arXiv:2512.09295v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2512.09295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

链接: https://arxiv.org/abs/2512.09275
作者: Weiyi He,Yue Xing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:Positional encoding (PE) is a core architectural component of Transformers, yet its impact on the Transformer’s generalization and robustness remains unclear. In this work, we provide the first generalization analysis for a single-layer Transformer under in-context regression that explicitly accounts for a completely trainable PE module. Our result shows that PE systematically enlarges the generalization gap. Extending to the adversarial setting, we derive the adversarial Rademacher generalization bound. We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models. Our bounds are empirically validated by a simulation study. Together, this work establishes a new framework for understanding the clean and adversarial generalization in ICL with PE.

[LG-61] Robust and Sparse Estimation of Unbounded Density Ratio under Heavy Contamination

链接: https://arxiv.org/abs/2512.09266
作者: Ryosuke Nagumo,Hironori Fujisawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine the non-asymptotic properties of robust density ratio estimation (DRE) in contaminated settings. Weighted DRE is the most promising among existing methods, exhibiting doubly strong robustness from an asymptotic perspective. This study demonstrates that Weighted DRE achieves sparse consistency even under heavy contamination within a non-asymptotic framework. This method addresses two significant challenges in density ratio estimation and robust estimation. For density ratio estimation, we provide the non-asymptotic properties of estimating unbounded density ratios under the assumption that the weighted density ratio function is bounded. For robust estimation, we introduce a non-asymptotic framework for doubly strong robustness under heavy contamination, assuming that at least one of the following conditions holds: (i) contamination ratios are small, and (ii) outliers have small weighted values. This work provides the first non-asymptotic analysis of strong robustness under heavy contamination.

[LG-62] WTNN: Weibull-Tailored Neural Networks for survival analysis

链接: https://arxiv.org/abs/2512.09163
作者: Gabrielle Rives,Olivier Lopez,Nicolas Bousquet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The Weibull distribution is a commonly adopted choice for modeling the survival of systems subject to maintenance over time. When only proxy indicators and censored observations are available, it becomes necessary to express the distribution’s parameters as functions of time-dependent covariates. Deep neural networks provide the flexibility needed to learn complex relationships between these covariates and operational lifetime, thereby extending the capabilities of traditional regression-based models. Motivated by the analysis of a fleet of military vehicles operating in highly variable and demanding environments, as well as by the limitations observed in existing methodologies, this paper introduces WTNN, a new neural network-based modeling framework specifically designed for Weibull survival studies. The proposed architecture is specifically designed to incorporate qualitative prior knowledge regarding the most influential covariates, in a manner consistent with the shape and structure of the Weibull distribution. Through numerical experiments, we show that this approach can be reliably trained on proxy and right-censored data, and is capable of producing robust and interpretable survival predictions that can improve existing approaches.

[LG-63] Understanding temperature tuning in energy-based models

链接: https://arxiv.org/abs/2512.09152
作者: Peter W Fields,Vudtiwat Ngampruetikorn,David J Schwab,Stephanie E Palmer
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Generative models of complex systems often require post-hoc parameter adjustments to produce useful outputs. For example, energy-based models for protein design are sampled at an artificially low ‘‘temperature’’ to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade-off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ‘‘energy gap’’ - separating a small fraction of meaningful states from a vast space of unrealistic states - learning from sparse data causes models to systematically overestimate high-energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system’s underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emphraising it results in better generative performance. Our framework thus casts post-hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.

[LG-64] Interpretable machine learning of halo gas density profiles: a sensitivity analysis of cosmological hydrodynamical simulations

链接: https://arxiv.org/abs/2512.09021
作者: Daniele Sorini,Sownak Bose,Mathilda Denison,Romeel Davé
类目: Astrophysics of Galaxies (astro-ph.GA); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: To be submitted to The Open Journal of Astrophysics

点击查看摘要

Abstract:Stellar and AGN-driven feedback processes affect the distribution of gas on a wide range of scales, from within galaxies well into the intergalactic medium. Yet, it remains unclear how feedback, through its connection to key galaxy properties, shapes the radial gas density profile in the host halo. We tackle this question using suites of the EAGLE, IllustrisTNG, and Simba cosmological hydrodynamical simulations, which span a variety of feedback models. We develop a random forest algorithm that predicts the radial gas density profile within haloes from the total halo mass and five global properties of the central galaxy: gas and stellar mass; star formation rate; mass and accretion rate of the central black hole (BH). The algorithm reproduces the simulated gas density profiles with an average accuracy of \sim 80-90% over the halo mass range 10^9.5 , \mathrmM_\odot M_\rm 200c 10^15 , \mathrmM_\odot and redshift interval 0z4 . For the first time, we apply Sobol statistical sensitivity analysis to full cosmological hydrodynamical simulations, quantifying how each feature affects the gas density as a function of distance from the halo centre. Across all simulations and redshifts, the total halo mass and the gas mass of the central galaxy are the most strongly tied to the halo gas distribution, while stellar and BH properties are generally less informative. The exact relative importance of the different features depends on the feedback scenario and redshift. Our framework can be readily embedded in semi-analytic models of galaxy formation to incorporate halo gas density profiles consistent with different hydrodynamical simulations. Our work also provides a proof of concept for constraining feedback models with future observations of galaxy properties and of the surrounding gas distribution.

[LG-65] FuXi-Nowcast: Meet the longstanding challenge of convective initiation in nowcasting

链接: https://arxiv.org/abs/2512.08974
作者: Lei Chen,Zijian Zhu,Xiaoran Zhuang,Tianyuan Qi,Yuxuan Feng,Xiaohui Zhong,Hao Li
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate nowcasting of convective storms remains a major challenge for operational forecasting, particularly for convective initiation and the evolution of high-impact rainfall and strong winds. Here we present FuXi-Nowcast, a deep-learning system that jointly predicts composite radar reflectivity, surface precipitation, near-surface temperature, wind speed and wind gusts at 1-km resolution over eastern China. FuXi-Nowcast integrates multi-source observations, such as radar, surface stations and the High-Resolution Land Data Assimilation System (HRLDAS), with three-dimensional atmospheric fields from the machine-learning weather model FuXi-2.0 within a multi-task Swin-Transformer architecture. A convective signal enhancement module and distribution-aware hybrid loss functions are designed to preserve intense convective structures and mitigate the rapid intensity decay common in deep-learning nowcasts. FuXi-Nowcast surpasses the operational CMA-MESO 3-km numerical model in Critical Success Index for reflectivity, precipitation and wind gusts across thresholds and lead times up to 12 h, with the largest gains for heavy rainfall. Case studies further show that FuXi-Nowcast more accurately captures the timing, location and structure of convective initiation and subsequent evolution of convection. These results demonstrate that coupling three-dimensional machine-learning forecasts with high-resolution observations can provide multi-hazard, long-lead nowcasts that outperforms current operational systems.

[LG-66] Multivariate time series prediction using clustered echo state network

链接: https://arxiv.org/abs/2512.08963
作者: S. Hariharan,R. Suresh,V. K. Chandrasekar
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: Published in Eur. Phys. J. Plus 140, 1133 (2025)

点击查看摘要

Abstract:Many natural and physical processes can be understood by analyzing multiple system variables evolving, forming a multivariate time series. Predicting such time series is challenging due to the inherent noise and interdependencies among variables. Echo state networks (ESNs), a class of Reservoir Computing (RC) models, offer an efficient alternative to conventional recurrent neural networks by training only the output weights while keeping the reservoir dynamics fixed, reducing computational complexity. We propose a clustered ESNs (CESNs) that enhances the ability to model and predict multivariate time series by organizing the reservoir nodes into clusters, each corresponding to a distinct input variable. Input signals are directly mapped to their associated clusters, and intra-cluster connections remain dense while inter-cluster connections are sparse, mimicking the modular architecture of biological neural networks. This architecture improves information processing by limiting cross-variable interference and enhances computational efficiency through independent cluster-wise training via ridge regression. We further explore different reservoir topologies, including ring, Erdős-Rényi (ER), and scale-free (SF) networks, to evaluate their impact predictive performance. Our algorithm works well across diverse real-world datasets such as the stock market, solar wind, and chaotic Rössler system, demonstrating that CESNs consistently outperform conventional ESNs in terms of predictive accuracy and robustness to noise, particularly when using ER and SF topologies. These findings highlight the adaptability of CESNs for complex, multivariate time series forecasting.

[LG-67] Online Inference of Constrained Optimization: Primal-Dual Optimality and Sequential Quadratic Programming

链接: https://arxiv.org/abs/2512.08948
作者: Yihang Gao,Michael K. Ng,Michael W. Mahoney,Sen Na
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 80 pages, 5 figures, 5 tables

点击查看摘要

Abstract:We study online statistical inference for the solutions of stochastic optimization problems with equality and inequality constraints. Such problems are prevalent in statistics and machine learning, encompassing constrained M -estimation, physics-informed models, safe reinforcement learning, and algorithmic fairness. We develop a stochastic sequential quadratic programming (SSQP) method to solve these problems, where the step direction is computed by sequentially performing a quadratic approximation of the objective and a linear approximation of the constraints. Despite having access to unbiased estimates of population gradients, a key challenge in constrained stochastic problems lies in dealing with the bias in the step direction. As such, we apply a momentum-style gradient moving-average technique within SSQP to debias the step. We show that our method achieves global almost-sure convergence and exhibits local asymptotic normality with an optimal primal-dual limiting covariance matrix in the sense of Hájek and Le Cam. In addition, we provide a plug-in covariance matrix estimator for practical inference. To our knowledge, the proposed SSQP method is the first fully online method that attains primal-dual asymptotic minimax optimality without relying on projection operators onto the constraint set, which are generally intractable for nonlinear problems. Through extensive experiments on benchmark nonlinear problems, as well as on constrained generalized linear models and portfolio allocation problems using both synthetic and real data, we demonstrate superior performance of our method, showing that the method and its asymptotic behavior not only solve constrained stochastic problems efficiently but also provide valid and practical online inference in real-world applications.

信息检索

[IR-0] Passing the Baton: High Throughput Distributed Disk-Based Vector Search with BatANN VLDB2026

链接: https://arxiv.org/abs/2512.09331
作者: Nam Anh Dang(1),Ben Landrum(1),Ken Birman(1) ((1) Cornell University)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注: 12 pages, 14 figures, submitted to VLDB 2026

点击查看摘要

Abstract:Vector search underpins modern information-retrieval systems, including retrieval-augmented generation (RAG) pipelines and search engines over unstructured text and images. As datasets scale to billions of vectors, disk-based vector search has emerged as a practical solution. However, looking to the future, we need to anticipate datasets too large for any single server. We present BatANN, a distributed disk-based approximate nearest neighbor (ANN) system that retains the logarithmic search efficiency of a single global graph while achieving near-linear throughput scaling in the number of servers. Our core innovation is that when accessing a neighborhood which is stored on another machine, we send the full state of the query to the other machine to continue executing there for improved locality. On 100M- and 1B-point datasets at 0.95 recall using 10 servers, BatANN achieves 6.21-6.49x and 2.5-5.10x the throughput of the scatter-gather baseline, respectively, while maintaining mean latency below 6 ms. Moreover, we get these results on standard TCP. To our knowledge, BatANN is the first open-source distributed disk-based vector search system to operate over a single global graph.

[IR-1] Meta Lattice: Model Space Redesign for Cost-Effective Industry-Scale Ads Recommendations

链接: https://arxiv.org/abs/2512.09200
作者: Liang Luo,Yuxin Chen,Zhengyu Zhang,Mengyue Hang,Andrew Gu,Buyun Zhang,Boyang Liu,Chen Chen,Chengze Fan,Dong Liang,Fan Yang,Feifan Gu,Huayu Li,Jade Nie,Jiayi Xu,Jiyan Yang,Jongsoo Park,Laming Chen,Longhao Jin,Qianru Li,Qin Huang,Shali Jiang,Shiwen Shen,Shuaiwen Wang,Sihan Zeng,Siyang Yuan,Tongyi Tang,Weilin Zhang,Wenjun Wang,Xi Liu,Xiaohan Wei,Xiaozhen Xia,Yuchen Hao,Yunlong He,Yasmine Badr,Zeliang Chen,Maxim Naumov,Yantao Yao,Wenlin Chen,Santanu Kolay,GP Musumeci,Ellie Dingqiao Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements. To address this challenge, we propose Lattice, a recommendation framework centered around model space redesign that extends Multi-Domain, Multi-Objective (MDMO) learning beyond models and learning objectives. Lattice addresses these challenges through a comprehensive model space redesign that combines cross-domain knowledge sharing, data consolidation, model unification, distillation, and system optimizations to achieve significant improvements in both quality and cost-efficiency. Our deployment of Lattice at Meta has resulted in 10% revenue-driving top-line metrics gain, 11.5% user satisfaction improvement, 6% boost in conversion rate, with 20% capacity saving. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2512.09200 [cs.IR] (or arXiv:2512.09200v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.09200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] PoultryTalk: A Multi-modal Retrieval-Augmented Generation (RAG ) System for Intelligent Poultry Management and Decision Support

链接: https://arxiv.org/abs/2512.08995
作者: Kapalik Khanal,Biswash Khatiwada,Stephen Afrifa,Ranjan Sapkota,Sanjay Shah,Frank Bai,Ramesh Bahadur Bist
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Poultry industry plays a vital role in global food security, yet small- and medium-scale farmers frequently lack timely access to expert-level support for disease diagnosis, nutrition planning, and management decisions. With rising climate stress, unpredictable feed prices, and persistent disease threats, poultry producers often struggle to make quick, informed decisions. Therefore, there is a critical need for intelligent, data-driven systems that can deliver reliable, on-demand consultation. This paper presents PoultryTalk, a novel multi-modal Retrieval-Augmented Generation (RAG) system designed to provide real-time expert guidance through text and image-based interaction. PoultryTalk uses OpenAI’s text-embedding-3-small and GPT-4o to provide smart, context-aware poultry management advice from text, images, or questions. System usability and performance were evaluated using 200 expert-verified queries and feedback from 34 participants who submitted 267 queries to the PoultryTalk prototype. The expert-verified benchmark queries confirmed strong technical performance, achieving a semantic similarity of 84.0% and an average response latency of 3.6 seconds. Compared with OpenAI’s GPT-4o, PoultryTalk delivered more accurate and reliable information related to poultry. Based on participants’ evaluations, PoultryTalk achieved a response accuracy of 89.9%, with about 9.1% of responses rated as incorrect. A post-use survey indicated high user satisfaction: 95.6% of participants reported that the chatbot provided “always correct” and “mostly correct” answers. 82.6% indicated they would recommend the tool, and 17.4% responded “maybe.” These results collectively demonstrate that PoultryTalk not only delivers accurate, contextually relevant information but also demonstrates strong user acceptance and scalability potential.

附件下载

点击下载今日全部论文列表