本篇博文主要内容为 2025-12-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-01)

今日共更新919篇论文,其中:

  • 自然语言处理128篇(Computation and Language (cs.CL))
  • 人工智能234篇(Artificial Intelligence (cs.AI))
  • 计算机视觉242篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习227篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] hetaEvolve: Test-time Learning on Open Problems

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的数学优化问题求解方法中存在效率低、缺乏持续学习能力的问题,特别是针对AlphaEvolve这类依赖前沿模型集成且无法内化演化策略的纯推理系统。其解决方案的关键在于提出ThetaEvolve——一个开源框架,通过单一LLM结合大规模程序数据库以增强探索能力、批量采样提升吞吐量、懒惰惩罚机制抑制无效输出,并引入可选的奖励塑形来稳定训练信号,从而在测试阶段实现高效的上下文学习与强化学习(Reinforcement Learning, RL)协同进化。该设计使小型开源模型(如DeepSeek-R1-0528-Qwen3-8B)也能在圆盘 packing 和首自相关不等式等开放优化问题上达到新的最优边界,并验证了模型确实能从经验中持续学习并迁移至未见过的任务。

链接: https://arxiv.org/abs/2511.23473
作者: Yiping Wang,Shao-Rong Su,Zhiyuan Zeng,Eva Xu,Liliang Ren,Xinyu Yang,Zeyi Huang,Xuehai He,Luyao Ma,Baolin Peng,Hao Cheng,Pengcheng He,Weizhu Chen,Shuohang Wang,Simon Shaolei Du,Yelong Shen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 30 pages, link: this https URL

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: this https URL
zh

[NLP-1] MegaChat: A Synthetic Persian QA Dataset for High-Quality Sales Chatbot Evaluation

【速读】: 该论文旨在解决伊朗中小企业(SMEs)在Telegram平台上开展电商销售时,因缺乏高质量、低成本的 Persian 语言问答(QA)数据集而导致智能客服聊天机器人开发困难的问题。其关键解决方案是提出了一种全自动的多智能体架构(multi-agent architecture),通过从活跃的Telegram购物频道中自动收集并生成具有角色感知(persona-aware)的QA对,结合专门设计的提问生成、验证与优化代理模块,实现高真实感和多样性的对话数据合成。该方法无需昂贵的人工标注或复杂微调,显著提升了生成数据的质量与可扩展性,为低资源语言环境下的生成式AI应用提供了高效可行的技术路径。

链接: https://arxiv.org/abs/2511.23397
作者: Mahdi Rahmani,AmirHossein Saffari,Reyhane Rahmani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Small and medium-sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real-time engagement is essential for conversion. However, developing AI-driven chatbots for this purpose requires large, high-quality question-and-answer (QA) datasets, which are typically expensive and resource-intensive to produce, especially for low-resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian QA dataset designed to evaluate intelligent sales chatbots in Telegram-based e-commerce. We propose a novel, automated multi-agent architecture that generates persona-aware QA pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval-augmented generation (RAG) models with our advanced agentic system, which features multi-query retrieval, reranking, and persona-aligned response synthesis. Using GPT-5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high-quality datasets without relying on expensive human annotation or complex fine-tuning. MegaChat provides SMEs with an efficient, cost-effective solution for building intelligent customer engagement systems in specialized commercial domains, enabling advancements in multilingual conversational AI for low-resource languages. Download: this https URL
zh

[NLP-2] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization EMNLP2025

【速读】: 该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在训练过程中因偏好对中存在语义相似或重复内容(即模糊内容,ambiguous content)而导致的对齐性能瓶颈问题。研究表明,这类模糊内容会引入不确定性,从而限制模型进一步优化对齐效果。解决方案的关键在于提出一种称为模糊感知优化(Ambiguity Awareness Optimization, AAO)的新方法,其核心机制是通过计算偏好对中的语义相似度自动识别并重新加权模糊内容,以降低训练过程中的歧义性。AAO无需额外标注或复杂结构,仅依赖现有偏好数据即可实现有效优化,在多个基准测试集(如AlpacaEval 2、MT-Bench和Arena-Hard)上显著优于当前最优方法,且不增加响应长度。

链接: https://arxiv.org/abs/2511.23391
作者: Jian Li,Shenglin Yin,Yujia Zhang,Alan Zhao,Xi Chen,Xiaohui Zhou,Pengfei Xu
机构: AI Technology Center of OVB, Tencent(腾讯), China; School of Computer Science, Peking University(北京大学), China
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 main

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.
zh

[NLP-3] Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking EMNLP2025

【速读】: 该论文旨在解决企业场景中任务导向型对话系统如何通过个性化设计提升用户体验与任务执行效率的问题,尤其关注新手与专家用户在交互偏好上的差异是否应被系统性地适配。其解决方案的关键在于引入“被动个性化”(passive personalization)机制,即系统根据用户类型自动调整响应策略,并通过用户研究验证其对降低任务负荷(task load)和改善助手感知效果的积极作用;同时指出仅靠被动个性化存在任务特定局限,需结合“主动个性化”(active personalization)以增强用户控制权(user agency),从而实现更优的用户体验与效能平衡。

链接: https://arxiv.org/abs/2511.23376
作者: Li Siyan,Jason Zhang,Akash Maharaj,Yuanming Shi,Yunyao Li
机构: Columbia University (哥伦比亚大学); Georgia Institute of Technology (佐治亚理工学院); Adobe
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted into Tailoring AI: Exploring Active and Passive LLM Personalization (PALS) workshop at EMNLP 2025

点击查看摘要

Abstract:Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user experience and task performance remains understudied. To investigate the effects of expertise-based personalization, we first built a version of an enterprise AI assistant with passive personalization. We then conducted a user study where participants completed timed exams, aided by the two versions of the AI assistant. Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception, but reveal task-specific limitations that can be addressed through providing more user agency. These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.
zh

[NLP-4] Optimizing Multimodal Language Models through Attention-based Interpretability

【速读】: 该论文旨在解决多模态语言模型(Multimodal Language Models, MLMs)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)过程中难以识别关键组件的问题,从而在保证性能的同时提升训练效率。其核心挑战在于MLMs的黑箱特性导致无法有效定位对图像理解最具影响力的注意力头(attention heads)。解决方案的关键在于提出一种基于注意力得分的可解释性方法,通过分析注意力头对图像关键对象(key objects)的关注程度来量化其重要性,并据此选择最优的模型层进行微调。具体而言,作者定义了Head Impact (HI)分数以衡量注意力头对关键对象的聚焦强度,并实验证明:仅微调具有最高HI分数的少量层(约0.01%参数),即可显著提升图像描述生成等任务的表现,优于随机选择或低HI分数层的微调策略。

链接: https://arxiv.org/abs/2511.23375
作者: Alexander Sergeev,Evgeny Kotelnikov
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for ICAI-2025 conference

点击查看摘要

Abstract:Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method’s effectiveness. By calculating Head Impact (HI) scores we quantify an attention head’s focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.
zh

[NLP-5] Scaling HuBERT for African Languages: From Base to Large and XL

【速读】: 该论文旨在解决非洲语言在多语言语音处理中长期存在的代表性不足问题,尤其是在低资源监督条件下缺乏高性能、可迁移的开放权重编码器(encoder)的问题。其解决方案的关键在于首次训练并发布了专为非洲语音数据设计的大规模自监督模型——SSA-HuBERT-Large(317M参数)和SSA-HuBERT-XL(964M参数),并通过受控实验验证了更大模型容量能够有效利用大规模非洲语音数据集,在自动语音识别(ASR)与语言识别(LID)任务上显著提升性能,从而证明了模型规模与数据组成之间的协同效应。

链接: https://arxiv.org/abs/2511.23370
作者: Antoine Caubrière,Elodie Gauthier
机构: Orange Research (法国电信研究院)
类目: Computation and Language (cs.CL)
备注: Journée d’études AFIA-ATALA 2025 : Technologies linguistiques pour les langues peu dotées

点击查看摘要

Abstract:Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see this https URL. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
zh

[NLP-6] owards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

【速读】: 该论文旨在解决知识增强型文本生成中因语言模型缺乏可解释性而导致的可靠性与透明度不足的问题。现有方法依赖于特定领域的知识检索器,难以泛化到多样化的数据类型和任务场景。其解决方案的关键在于设计一种任务无关的结构化知识猎手(task-agnostic structured knowledge hunter),该方法利用结构化知识的两层架构(高层实体与低层知识三元组)进行表示学习,并采用局部-全局交互机制和分层Transformer指针网络来高效选择相关知识,从而在保持语言模型强大生成能力的同时,显著提升生成结果的可解释性和忠实度。

链接: https://arxiv.org/abs/2511.23335
作者: Shuqi Liu,Han Wu,Guanzhi Deng,Jianshu Chen,Xiaoyang Wang,Linqi Song
机构: City University of Hong Kong (香港城市大学); City University of Hong Kong Shenzhen Research Institute (香港城市大学深圳研究院); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.
zh

[NLP-7] ackling a Challenging Corpus for Early Detection of Gambling Disorder: UNSL at MentalRiskES 2025

【速读】: 该论文旨在解决网络环境中赌博障碍(Gambling Disorder)的早期风险识别(Early Risk Detection, ERD)问题,即通过分析社交媒体活动来识别潜在高风险用户。其解决方案的关键在于提出三种基于CPI+DMC(Contextual Prediction Integration + Decision-Making Control)框架的方法,利用SS3、扩展词汇量的BERT以及SBERT模型提取用户行为特征,并结合历史用户分析制定决策策略,以同时优化预测效果与决策速度。实验表明,其中两种方法在MentalRiskES 2025挑战赛Task 1中位列前两名,在决策指标上表现突出,验证了该框架的有效性。

链接: https://arxiv.org/abs/2511.23325
作者: Horacio Thompson,Marcelo Errecalde
机构: Universidad Nacional de San Luis (UNSL)(圣路易斯国立大学); Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)(国家科学与技术研究理事会)
类目: Computation and Language (cs.CL)
备注: In Iberian Language Evaluation Forum (IberLEF 2025), Zaragoza, Spain

点击查看摘要

Abstract:Gambling disorder is a complex behavioral addiction that is challenging to understand and address, with severe physical, psychological, and social consequences. Early Risk Detection (ERD) on the Web has become a key task in the scientific community for identifying early signs of mental health behaviors based on social media activity. This work presents our participation in the MentalRiskES 2025 challenge, specifically in Task 1, aimed at classifying users at high or low risk of developing a gambling-related disorder. We proposed three methods based on a CPI+DMC approach, addressing predictive effectiveness and decision-making speed as independent objectives. The components were implemented using the SS3, BERT with extended vocabulary, and SBERT models, followed by decision policies based on historical user analysis. Although it was a challenging corpus, two of our proposals achieved the top two positions in the official results, performing notably in decision metrics. Further analysis revealed some difficulty in distinguishing between users at high and low risk, reinforcing the need to explore strategies to improve data interpretation and quality, and to promote more transparent and reliable ERD systems for mental disorders.
zh

[NLP-8] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

【速读】: 该论文旨在解决构建“能够记忆的机器”这一挑战,核心问题在于高效地建模超长上下文(ultra-long context modeling)。为实现这一目标,作者提出三个关键属性:稀疏性(sparsity)、随机访问灵活性(random-access flexibility)以及长度泛化能力(length generalization)。解决方案的关键是引入一种新型注意力机制——分层稀疏注意力(Hierarchical Sparse Attention, HSA),该机制同时满足上述三项特性,并将其集成到Transformer架构中形成HSA-UltraLong模型。该模型是一个80亿参数的MoE(Mixture of Experts)模型,在超过8万亿token的数据上训练,并在不同任务中对域内和域外上下文长度进行了严格评估,结果表明其在域内长度下性能媲美全注意力基线,且在1600万token的上下文长度下,多数上下文检索任务准确率仍超过90%,验证了其在超长上下文建模上的有效性。

链接: https://arxiv.org/abs/2511.23319
作者: Xiang Hu,Zhanchao Zhou,Ruiqi Liang,Zehuan Li,Wei Wu,Jianguo Li
机构: Ant Group (蚂蚁集团); Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work explores the challenge of building ``Machines that Can Remember’', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbfsparsity, \textbfrandom-access flexibility, and \textbflength generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.
zh

[NLP-9] oward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

【速读】: 该论文旨在解决自动驾驶场景中多视角视觉信息融合与安全事件检测的问题,特别是如何利用大模型(如生成式 AI (Generative AI))处理来自驾驶员面向摄像头和道路面向摄像头的同步视频输入,以实现对驾驶行为的全面安全监控。解决方案的关键在于构建一个专门针对驾驶场景的多模态数据集,并通过微调(fine-tuning)预训练的大规模视觉语言模型(LVLMs),使其能够生成准确且具有安全意识的驾驶指令;实验表明,微调后的LVLMs在安全事件识别上显著优于原始预训练模型,但对细微或复杂事件的检测仍存在挑战。

链接: https://arxiv.org/abs/2511.23311
作者: Haruki Sakajo,Hiroshi Takato,Hiroshi Tsutsui,Komei Soda,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Teatis inc.; Queensland university of technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to MMLoSo 2025

点击查看摘要

Abstract:Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
zh

[NLP-10] ransformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla

【速读】: 该论文旨在解决孟加拉语(Bangla)社交媒体文本中作者意图分类(Author Intent Classification)的问题,尤其针对传统单模态方法在处理多模态内容时的局限性。其关键解决方案是提出一种新颖的中间融合策略(intermediate fusion strategy),在模型训练过程中将文本和视觉特征在中间层进行整合,而非早期或晚期融合。实验表明,该策略显著优于现有方法,在Uddessho数据集上使用mBERT与Swin Transformer组合时达到84.11%的宏F1分数,较此前最先进方法提升8.4个百分点,验证了跨模态特征在中间层级融合能够实现模态特异性表示与跨模态学习之间的最优平衡,为低资源语言的多模态意图识别建立了新的基准。

链接: https://arxiv.org/abs/2511.23287
作者: Ariful Islam,Tanvir Mahmud,Md Rifat Hossen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at the 28th International Conference on Computer and Information Technology (ICCIT 2025). To be published in IEEE proceedings

点击查看摘要

Abstract:The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).
zh

[NLP-11] MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)驱动的网页代理在不同交互接口下性能差异缺乏系统比较的问题。现有研究虽探索了HTML浏览、基于预爬取内容的检索增强生成(Retrieval-Augmented Generation, RAG)、通过模型上下文协议(Model Context Protocol, MCP)调用Web API以及自然语言查询(NLWeb)等多种交互方式,但尚未在同一受控环境中对这些架构进行公平对比。其解决方案的关键在于构建了一个包含四个模拟电商平台的测试床,每个平台提供HTML、MCP和NLWeb三种接口,并为每种接口开发专用代理执行相同任务集(如商品搜索、价格比较及结账流程)。实验表明,RAG、MCP与NLWeb代理在效果(F1分数)和效率(token消耗与运行时间)上均显著优于传统HTML代理,其中RAG结合GPT 5实现最优综合表现(F1=0.87,完成率=0.79),而RAG结合GPT 5 mini则在成本与性能间取得良好平衡。

链接: https://arxiv.org/abs/2511.23281
作者: Aaron Steiner,Ralph Peeters,Christian Bizer
机构: University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.23281 [cs.CL] (or arXiv:2511.23281v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.23281 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-12] Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLM s

【速读】: 该论文旨在解决长系统提示(system prompt)在大型语言模型(Large Language Model, LLM)代理中导致的推理延迟高、计算成本大及有效上下文长度减少的问题。其核心解决方案是提出一种轻量级三阶段训练框架,通过学习一个特定于提示的行为等效标记(Behavior-Equivalent token, [BE]),将原始系统提示的语义内容与下游任务行为压缩至单一token中。关键创新在于无需访问模型内部结构、不依赖辅助压缩模型或标注响应,仅通过重建和行为蒸馏实现高效压缩,实验证明该方法可实现最高达3000倍的提示长度缩减,同时保持约98%的原始性能,显著降低推理开销并释放几乎全部上下文窗口用于用户输入。

链接: https://arxiv.org/abs/2511.23271
作者: Jiancheng Dong,Pengyue Jia,Jingyu Peng,Maolin Wang,Yuhao Wang,Lixin Su,Xin Sun,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.
zh

[NLP-13] BanglaSentNet: An Explainable Hybrid Deep Learning Framework for Multi-Aspect Sentiment Analysis with Cross-Domain Transfer Learning

【速读】: 该论文旨在解决孟加拉语电子商务评论中多方面情感分析(Multi-aspect Sentiment Analysis)的挑战,包括标注数据稀缺、形态学复杂性、代码混杂现象(code-mixing)以及领域迁移问题,这些问题严重影响了约3亿孟加拉语用户的实际应用效果。解决方案的关键在于提出BanglaSentNet——一个可解释的混合深度学习框架,通过动态加权集成学习融合LSTM、BiLSTM、GRU与BanglaBERT模型,并引入SHAP特征归因和注意力可视化以实现透明决策过程。该方法在8,755条人工标注的孟加拉语产品评论数据集上实现了85%准确率和0.88 F1分数,显著优于单一模型和传统方法,且具备强跨域泛化能力,在零样本和少样本场景下仍保持高有效性,为孟加拉语低资源环境下的商业应用提供了切实可行的技术路径。

链接: https://arxiv.org/abs/2511.23264
作者: Ariful Islam,Md Rifat Hossen,Tanvir Mahmud
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to Springer Nature Computer Science (SNCS) as an extended version of our ICDSAIA 2025 conference paper

点击查看摘要

Abstract:Multi-aspect sentiment analysis of Bangla e-commerce reviews remains challenging due to limited annotated datasets, morphological complexity, code-mixing phenomena, and domain shift issues, affecting 300 million Bangla-speaking users. Existing approaches lack explainability and cross-domain generalization capabilities crucial for practical deployment. We present BanglaSentNet, an explainable hybrid deep learning framework integrating LSTM, BiLSTM, GRU, and BanglaBERT through dynamic weighted ensemble learning for multi-aspect sentiment classification. We introduce a dataset of 8,755 manually annotated Bangla product reviews across four aspects (Quality, Service, Price, Decoration) from major Bangladeshi e-commerce platforms. Our framework incorporates SHAP-based feature attribution and attention visualization for transparent insights. BanglaSentNet achieves 85% accuracy and 0.88 F1-score, outperforming standalone deep learning models by 3-7% and traditional approaches substantially. The explainability suite achieves 9.4/10 interpretability score with 87.6% human agreement. Cross-domain transfer learning experiments reveal robust generalization: zero-shot performance retains 67-76% effectiveness across diverse domains (BanglaBook reviews, social media, general e-commerce, news headlines); few-shot learning with 500-1000 samples achieves 90-95% of full fine-tuning performance, significantly reducing annotation costs. Real-world deployment demonstrates practical utility for Bangladeshi e-commerce platforms, enabling data-driven decision-making for pricing optimization, service improvement, and customer experience enhancement. This research establishes a new state-of-the-art benchmark for Bangla sentiment analysis, advances ensemble learning methodologies for low-resource languages, and provides actionable solutions for commercial applications.
zh

[NLP-14] ourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

【速读】: 该论文旨在解决印度语境下低资源领域(Hindi旅游领域)中缺乏针对文化敏感性任务的问答(QA)资源问题,尤其聚焦于瓦拉纳西这一具有深厚宗教与文化内涵的城市。其核心挑战在于如何在有限标注数据条件下构建高效且精准的提取式QA系统,以应对如“Ganga Aarti”或“Kund”等嵌入本地文化语境的术语理解难题。解决方案的关键在于:首先构建了一个包含7,715条人工标注的Hindi QA对,并通过Llama模型零样本提示(zero-shot prompting)扩充至35,170条;其次采用基于基础模型(BERT与RoBERTa)的微调策略,对比监督微调(SFT)与低秩适应(LoRA)方法,在保证性能的同时显著降低可训练参数量(LoRA减少98%);实验表明,RoBERTa结合SFT在捕捉文化相关术语的上下文语义方面表现最优,而LoRA则在参数效率上展现出优势,为低资源语言场景下的旅游类QA系统提供了可复用的基准框架和优化路径。

链接: https://arxiv.org/abs/2511.23235
作者: Praveen Gatla,Anushka,Nikita Kanwar,Gouri Sahoo,Rajesh Kumar Mundotiya
机构: Banaras Hindu University (贝拿勒斯印度教大学); Indian Institute of Technology (BHU) (印度理工学院 (BHU)); Indian Institute of Technology (印度理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3% F1) while reducing trainable parameters by 98% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.
zh

[NLP-15] WEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

【速读】: 该论文旨在解决现代硬件中对FP8(Float8)精度支持在训练大规模Transformer模型时因极端激活异常值(activation outliers)导致的性能瓶颈问题。现有方法要么依赖复杂的混合精度工程,要么需要侵入式的架构修改,难以推广。论文的关键突破在于重新定义了异常值的本质:通过理论分析和实证发现,极端异常值并非由数据驱动,而是训练过程中由权重矩阵的特定结构特性(如共线性)机械产生的伪影。基于此洞察,作者提出非侵入式损失函数TWEO(Transformers Without Extreme Outliers),其核心是一个简单有效的正则化项,能将异常值数量从10⁴级别降至20以下,从而实现无需任何工程技巧或架构改动的全模型FP8预训练。TWEO不仅显著提升训练稳定性与吞吐量(相比标准FP8训练提升36%),还首次使硬件友好的W8A8 per-tensor静态量化在LLM上达到SOTA性能,开辟了新的量化范式。

链接: https://arxiv.org/abs/2511.23225
作者: Guang Liang,Jie Shao,Ningyuan Tang,Xinyao Liu,Jianxin Wu
机构: Nanjing University (南京大学); Zhongguancun Academy (中关村学院); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.
zh

[NLP-16] Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction

【速读】: 该论文旨在解决Aspect Sentiment Quad Prediction (ASQP)任务中四元组结构预测的挑战,即如何准确识别并关联方面项(aspect term, a)、方面类别(aspect category, c)、观点项(opinion term, o)和情感极性(sentiment polarity, s)这四个核心情感元素。传统基于标记的方法难以建模元素间的复杂关系,尤其在标准监督微调下对高阶元素(如c和s)的预测性能显著下降。解决方案的关键在于引入基于推理的生成范式:通过统一模板输出四元组及自然语言解释(rationale),并在元素前缀引导下增强显式关系推理与可解释性;同时设计一种列表偏好优化框架(listwise preference optimization),利用句法与语义邻近性生成易混淆候选集,并以列表级目标训练模型优先选择正确四元组,从而提升结构有效性与关系一致性。

链接: https://arxiv.org/abs/2511.23184
作者: Wenna Lai,Haoran Xie,Guandong Xu,Qing Li,S. Joe Qin
机构: The Hong Kong Polytechnic University (香港理工大学); Lingnan University (岭南大学); University of Technology Sydney (悉尼科技大学); The Education University of Hong Kong (香港教育大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, and 6 tables

点击查看摘要

Abstract:Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category ©, opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.
zh

[NLP-17] Are LLM s Good Safety Agents or a Propaganda Engine?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对敏感内容时的拒绝响应行为是否源于真实的安全策略,还是受政治审查驱动的问题。由于当前缺乏系统性分析来区分安全驱动的拒答与政治动机的审查,研究者构建了PSP数据集——一个专门用于从明确政治语境中探测LLM拒绝行为的数据集。其关键解决方案在于:首先通过数据驱动方法(使PSP隐式包含政治敏感性)和表征层面方法(消除“政治”概念)分析七种LLM在政治敏感性下的表现;其次利用提示注入攻击(Prompt Injection Attacks, PIAs)评估模型对PSP的脆弱性。结果表明,多数LLM表现出某种形式的政治审查特征,从而揭示了影响拒绝分布的关键属性,包括模型类型、国家语境及内容隐含意图等。

链接: https://arxiv.org/abs/2511.23174
作者: Neemesh Yadav,Francesco Ortu,Jiarui Liu,Joeun Yook,Bernhard Schölkopf,Rada Mihalcea,Alberto Cazzaniga,Zhijing Jin
机构: SMU(南洋理工大学); University of Trieste(特里斯特大学); AREA Science Park(科学园区); CMU(卡内基梅隆大学); University of Toronto(多伦多大学); Vector Institute(向量研究所); MPI for Intelligent Systems(智能系统马克斯普朗克研究所); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 tables, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.
zh

[NLP-18] Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因复杂推理能力不足而面临的瓶颈问题,特别是现有测试时扩展方法(如思维树 Tree-of-Thought 和思维图 Graph-of-Thought)中存在的推理策略多样性有限、冗余搜索分支以及异构推理路径间缺乏有效整合与错误修正等问题。其解决方案的关键在于提出一种新颖的多链图精炼选择框架(Multi-chain Graph Refinement Selection, MGRS),该框架首先生成多个多样化的推理轨迹,随后通过复合的自验证与交叉验证机制对候选答案进行精炼,继而构建推理关系图并估计中间节点的成功概率,最终基于累积成功概率选择最可靠的答案及其对应推理路径,从而在提升推理准确性的同时显著优化计算效率。

链接: https://arxiv.org/abs/2511.23136
作者: Yujiao Yang,Jing Lian,Linhui Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.
zh

[NLP-19] Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

【速读】: 该论文旨在解决从通用网页中高效且准确提取主体内容(main content)的问题,这对于构建大规模语言模型的训练数据至关重要。现有方法受限于上下文窗口长度、推理成本以及生成式模型常见的格式幻觉(format hallucination)等问题。其解决方案的核心在于提出 Dripper 框架,通过四项关键技术实现突破:一是设计专用的 HTML 简化算法,将输入 token 数量减少至原始 HTML 的 22% 同时保留关键结构信息;二是将主内容提取重构为语义块序列分类任务,显著降低推理开销;三是引入受控解码机制,利用 logits 处理器严格限制输出空间以消除小模型常见的幻觉问题;四是构建 WebMainBench 数据集,包含超过 7800 个网页的人工标注标签,用于全面评估性能。实验表明,仅使用 0.6B 参数的小型语言模型,Dripper 在多个基准上均达到最先进水平,ROUGE-N F1 得分高达 81.58%(采用回退策略时达 83.13%)。

链接: https://arxiv.org/abs/2511.23119
作者: Mengjie Liu,Jiahui Peng,Pei Chu,Jiantao Qiu,Ren Ma,He Zhu,Rui Min,Lindong Lu,Wenchang Ning,Linfeng Hou,Kaiwen Liu,Yuan Qu,Zhenxiang Li,Chao Xu,Zhongying Tu,Wentao Zhang,Conghui He
机构: 北京大学 (Peking University)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58%( 83.13% with fall-back strategy) on our proposed WebMainBench dataset.
zh

[NLP-20] Mind Reading or Misreading? LLM s on the Big Five Personality Test

【速读】: 该论文旨在解决利用大语言模型(Large Language Models, LLMs)进行文本自动人格预测(Automatic Personality Prediction from Text, APPT)的可靠性问题,尤其是在基于二元五因素模型(Binary Five Factor Model, BIG5)框架下的表现评估。其关键解决方案在于系统性地考察不同LLM配置(包括GPT-4与轻量级开源模型)、多种提示策略(minimal vs. enriched prompts)以及多数据集(Essays、MyPersonality、Pandora)组合下的性能差异,并强调prompt设计中融入语言学和心理学线索可减少无效输出并改善类别平衡,但可能引入对特质存在的系统性偏倚;同时指出传统聚合指标如准确率和宏F1可能掩盖类间差异,而每类召回率(per-class recall)更具诊断价值,从而为构建可解释、可靠的APPT系统提供方法论指导。

链接: https://arxiv.org/abs/2511.23101
作者: Francesco Di Cursi,Chiara Boldrini,Marco Conti,Andrea Passarella
机构: IIT-CNR (Italian Institute of Technology - National Research Council)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Funding: SoBigDatait (IR0000013), FAIR (PE00000013), ICSC (CN00000013)

点击查看摘要

Abstract:We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models – including GPT-4 and lightweight open-source alternatives – are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.
zh

[NLP-21] Accent Placement Models for Rigvedic Sanskrit Text AACL

【速读】: 该论文旨在解决古代梵语文本《梨俱吠陀》(Rigveda)中声调标记(pitch-accent system,包括udātta、anudātta、svarita)在现代电子文本中常被省略的问题,从而实现自动声调恢复(accent restoration)。其核心挑战在于如何在保持Unicode编码安全性和音节结构准确性的同时,精准标注缺失的声调符号。解决方案的关键在于构建了一个包含带声调与无声调诗句的平行语料库,并系统比较了三种方法:(i) 对ByT5模型进行全量微调(full fine-tuning),该模型基于字节级Transformer直接处理Unicode组合标记;(ii) 从零开始训练BiLSTM-CRF序列标注基线模型;(iii) 在ByT5基础上采用LoRA(Low-Rank Adaptation)参数高效微调策略。实验表明,全量微调效果最优,而LoRA在准确率与计算效率之间提供了良好平衡,同时强调了Unicode安全预处理、标记感知分词及区分字形错误与声调错误的评估指标(如Diacritic Error Rate, DER)对任务成功的重要性。

链接: https://arxiv.org/abs/2511.23088
作者: Akhil Rajeev P,Annarao Kulkarni
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to AACL-IJCNLP 2025

点击查看摘要

Abstract:The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship. Comments: Submitted to AACL-IJCNLP 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.23088 [cs.CL] (or arXiv:2511.23088v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.23088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

【速读】: 该论文旨在解决印度语言场景文本识别(Scene Text Recognition, STR)面临的挑战,包括脚本多样性、非标准字体、书写风格差异以及高质量数据集和开源模型的缺乏。其关键解决方案是提出了 Bharat Scene Text Dataset (BSTD),这是一个大规模、多任务的基准数据集,涵盖11种印度语言及英语,包含超过10万单词和6,500张来自印度不同语言区域的真实场景图像,支持文本检测、脚本识别、裁剪词识别和端到端识别等多种任务。通过在该数据集上对英文STR模型进行微调(fine-tuning),验证了其在印度语言STR中的潜力与局限性,为该领域研究提供了重要基础。

链接: https://arxiv.org/abs/2511.23071
作者: Anik De,Abhirama Subramanyam Penamakuri,Rajeev Yadav,Aditya Rathore,Harshiv Shah,Devesh Sharma,Sagar Agarwal,Pravin Kumar,Anand Mishra
机构: IIT Jodhpur (印度理工学院贾多普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Peer Review

点击查看摘要

Abstract:Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
zh

[NLP-23] Conveying Imagistic Thinking in TCM Translation: A Prompt Engineering and LLM -Based Evaluation Framework

【速读】: 该论文试图解决传统中医理论(Traditional Chinese Medicine, TCM)在英文翻译中因依赖直译而导致目标语读者难以重构其概念网络并应用于临床实践的问题。解决方案的关键在于采用“人机协同”(Human-in-the-loop, HITL)框架,结合提示工程(prompt-based cognitive scaffolding)引导大语言模型(LLM)识别源语文本中的隐喻(metaphor)与转喻(metonymy),从而实现对中医核心概念的语义保真传递。实验表明,经提示调整后的LLM翻译在五个认知维度上均表现最优,且跨模型与跨角色一致性高,验证了该方法在古籍概念密集文本翻译中的有效性、可重复性与认知适配性。

链接: https://arxiv.org/abs/2511.23059
作者: Jiatong Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 3 figures

点击查看摘要

Abstract:Traditional Chinese Medicine theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis. Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers’ cognitive preferences. This study provides a cognitive, efficient and replicable HITL methodological pathway for translation of ancient, concept-dense texts like TCM.
zh

[NLP-24] Standard Occupation Classifier – A Natural Language Processing Approach

【速读】: 该论文旨在解决如何利用自然语言处理(Natural Language Processing, NLP)技术从招聘广告中自动识别并分类职业类别,从而实现对劳动力市场需求的动态监测。其核心问题是传统职业分类系统(如英国ONS的Standard Occupational Classification, SOC和美国O*NET SOC)难以高效整合大规模非结构化招聘信息的问题。解决方案的关键在于构建一个集成模型,该模型融合Google BERT与神经网络分类器,并综合考虑职位标题、描述及技能信息,显著提升了分类准确性——在SOC第四层级达到61%、第三层级达72%,为实时追踪劳动力市场演变提供了可靠的数据驱动方法。

链接: https://arxiv.org/abs/2511.23057
作者: Sidharth Rony,Jack Patman
机构: Royal Holloway, University of London (伦敦大学皇家霍洛威学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Standard Occupational Classifiers (SOC) are systems used to categorize and classify different types of jobs and occupations based on their similarities in terms of job duties, skills, and qualifications. Integrating these facets with Big Data from job advertisement offers the prospect to investigate labour demand that is specific to various occupations. This project investigates the use of recent developments in natural language processing to construct a classifier capable of assigning an occupation code to a given job advertisement. We develop various classifiers for both UK ONS SOC and US O*NET SOC, using different Language Models. We find that an ensemble model, which combines Google BERT and a Neural Network classifier while considering job title, description, and skills, achieved the highest prediction accuracy. Specifically, the ensemble model exhibited a classification accuracy of up to 61% for the lower (or fourth) tier of SOC, and 72% for the third tier of SOC. This model could provide up to date, accurate information on the evolution of the labour market using job advertisements.
zh

[NLP-25] Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

【速读】: 该论文旨在解决历史文本的年代归属问题(temporal dating),即准确判断英文文本的创作时间,以支持文化遗产收藏的组织与解读。其解决方案的关键在于构建一种可解释的、基于特征工程的树模型(tree-based machine learning models),整合五类互补特征:压缩特征(compression-based)、词汇结构特征(lexical structure)、可读性特征(readability)、新词检测特征(neologism detection)和距离特征(distance features)。通过多特征融合,模型在世纪级分类上达到76.7%准确率,在十年级分类上达26.1%,显著优于随机基线,并展现出强排序能力(AUCROC高达94.8%)和良好的误差控制(平均绝对偏差分别为27年和30年)。此外,SHAP分析揭示了系统性的语言演变模式,表明19世纪是跨特征域的转折点,凸显了该方法在可解释性和实用性上的优势。

链接: https://arxiv.org/abs/2511.23056
作者: Paulo J. N. Pinto,Armando J. Pinho,Diogo Pratas
机构: IEETA - Institute of Electronics and Informatics Engineering of Aveiro(电子与信息工程研究所); LASI - Intelligent Systems Associate Laboratory(智能系统关联实验室); DETI - Department of Electronics, Telecommunications and Informatics(电子、电信与信息系); University of Aveiro(阿维罗大学); DoV - Department of Virology(病毒学系); University of Helsinki(赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.
zh

[NLP-26] Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses

【速读】: 该论文旨在解决在线英文写作中拼写变异(spelling variation)的社会感知问题,即不同拼写形式如何影响人们对文本及其作者的社交判断(如正式程度、细致程度和年龄印象)。其解决方案的关键在于采用社会语言学方法,通过对比人类与大语言模型(Large Language Models, LLMs)对拼写变异在三个核心社会属性上的评分,系统评估LLMs是否能准确模拟人类的社会感知机制。研究发现,尽管整体相关性较强,但在评分分布和不同类型拼写变异的差异上仍存在显著分歧,揭示了LLMs在社会语用理解方面的局限性与潜力。

链接: https://arxiv.org/abs/2511.23041
作者: Dong Nguyen,Laura Rosseel
机构: Utrecht University (乌得勒支大学); Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.
zh

[NLP-27] ShoppingComp: Are LLM s Really Ready for Your Shopping Cart?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在电商场景中部署时存在的可靠性与安全性不足问题,尤其针对其在真实购物任务中表现不佳、易产生危险推荐等关键缺陷。解决方案的关键在于构建一个名为ShoppingComp的新型基准测试集,该基准涵盖120个复杂任务和1,026个可验证场景,由35位专家精心设计以反映真实的购物需求,并引入产品安全风险识别作为新的评估维度,从而系统性地衡量LLM在精准商品检索、专家级报告生成及高危决策判断三方面的能力。实验结果表明,即使是最先进的模型如GPT-5和Gemini-2.5-Flash也仅取得极低的准确率(分别为11.22%和3.92%),凸显了现有研究基准与实际应用之间的显著差距,进而推动更可靠、实用的电商智能代理的发展。

链接: https://arxiv.org/abs/2511.22978
作者: Huaixiao Tou,Ying Zeng,Cong Ma,Muzhi Li,Minghao Li,Weijie Yuan,He Zhang,Kai Jia
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ShoppingComp, a challenging real-world benchmark for rigorously evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces highly complex tasks under the principle of guaranteeing real products and ensuring easy verifiability, adding a novel evaluation dimension for identifying product safety hazards alongside recommendation accuracy and report quality. The benchmark comprises 120 tasks and 1,026 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 11.22% for GPT-5, 3.92% for Gemini-2.5-Flash). These findings highlight a substantial gap between research benchmarks and real-world deployment, where LLMs make critical errors such as failure to identify unsafe product usage or falling for promotional misinformation, leading to harmful recommendations. ShoppingComp fills the gap and thus establishes a new standard for advancing reliable and practical agents in e-commerce.
zh

[NLP-28] Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification

【速读】: 该论文旨在解决虚假新闻检测(fake news detection)任务中如何有效利用预训练Transformer模型的表示能力问题,其核心挑战在于区分不同架构的Transformer编码器(如BERT、GPT-2、Transformer-XL)在作为固定嵌入器(frozen embedders)时对下游分类性能的影响。解决方案的关键在于:将Transformer模型冻结为特征提取器,并与轻量级分类器(如逻辑回归)结合,在控制预处理条件(如池化vs.填充、神经网络头vs.线性头)的基础上进行系统评估,结果表明基于自注意力机制的上下文编码具有稳定的迁移能力,其中BERT嵌入配合逻辑回归在LIAR数据集上优于复杂神经网络基线,且简单最大池化或平均池化策略在序列长度截断下仍保持鲁棒性,从而验证了注意力驱动的token编码器是可信度判断任务中的可靠架构基础。

链接: https://arxiv.org/abs/2511.22977
作者: Sumit Mamtani,Abhijeet Bhure
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the IEEE 7th Computing, Communications and IoT Applications Conference (ComComAp 2025), Madrid, Spain, December 2025. 6 pages

点击查看摘要

Abstract:This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight classifiers. Through controlled preprocessing comparing pooling versus padding and neural versus linear heads, results demonstrate that contextual self-attention encodings consistently transfer effectively. BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits, while analyses of sequence length and aggregation reveal robustness to truncation and advantages from simple max or average pooling. This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks, isolating Transformer contributions from classifier complexity.
zh

[NLP-29] raining-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因自回归生成机制导致的高延迟问题,以及现有推测解码(Speculative Decoding, SPD)方法中严格精确匹配验证策略造成的语义有效续写被误判丢弃的问题。解决方案的关键在于提出一种无需训练的松散推测解码方法(Training-Free Loosely Speculative Decoding, FLy),其核心创新是利用目标模型自身的自我修正能力来判断 draft 与 target 模型之间的 token 不一致是否仍保持语义正确性。FLy 设计了两级机制:一是基于熵的门控机制,用于识别当前 token 是否存在多个合理替代选项或接近确定性;二是基于 token 的延迟窗口机制,区分真正的错误与语义正确但表述不同的变体。此外,通过多级加速策略同时优化目标模型和 drafter 的执行效率,使得 FLy 在不需重新调参的情况下可适配任意 draft-target 配对并跨模型和领域保持高性能。

链接: https://arxiv.org/abs/2511.22972
作者: Jinze Li,Yixing Xu,Guanchen Li,Shuo Yang,Jinfeng Xu,Xuanwu Yin,Dong Li,Edith C.H.Ngai,Emad Barsoum
机构: Advanced Micro Devices, Inc.(超威半导体公司); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model’s self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model’s accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
zh

[NLP-30] Visual Puns from Idioms: An Iterative LLM -T2IM-MLLM Framework ICASSP2026

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中如何自动创建并评估意象双关(idiom-based visual puns)图像的问题,即设计一种能够将习语的字面意义与隐喻意义在视觉上统一的图像生成与理解框架。其解决方案的关键在于提出一个迭代式系统,该系统协同大型语言模型(LLM)、文本到图像模型(T2IM)和多模态大语言模型(MLLM),通过循环执行生成详细视觉提示、合成图像、从图像中推断习语以及基于识别结果优化提示四个步骤,直至成功识别或达到最大迭代次数。实验表明,MLLM的选择是性能的主要决定因素,而Claude在提示生成方面表现最优。

链接: https://arxiv.org/abs/2511.22943
作者: Kelaiti Xiao,Liang Yang,Dongyu Zhang,Paerhati Tulajiang,Hongfei Lin
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026 (under review)

点击查看摘要

Abstract:We study idiom-based visual puns–images that align an idiom’s literal and figurative meanings–and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.
zh

[NLP-31] Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols AACL

【速读】: 该论文试图解决的问题是:当前视觉语言模型(Vision Language Models, VLMs)在识别和理解艺术作品中情感表达方面的有效性及其局限性。解决方案的关键在于通过一项案例研究,系统评估三种主流VLMs(Llava-Llama与两个Qwen模型)在四类递进复杂度的问题上的表现——包括图像一般内容、情绪内容、情绪表达方式及情绪符号,并结合艺术史专家的定性评价,揭示模型在不同抽象层级图像中的识别能力差异。结果表明,VLMs对具象图像的情感内容和表达方式识别效果较好,但在高度抽象或象征性图像中表现显著下降,且符号识别仍存在根本性困难,同时模型在一致性方面仍存在大语言模型(Large Language Models, LLMs)固有的缺陷。

链接: https://arxiv.org/abs/2511.22929
作者: Sebastian Padó,Kerstin Thomas
机构: Institute for Natural Language Processing (IMS) (自然语言处理研究所); Institute for Art History (IKG) (艺术史研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted for publication at the IJCNLP-AACL workshop on Multimodal Models for Low-Resource Contexts and Social Impact

点击查看摘要

Abstract:Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.
zh

[NLP-32] Language-conditioned world model improves policy generalization by reading environmental descriptions NEURIPS2025

【速读】: 该论文旨在解决智能体在真实世界中与人类有效交互时,如何理解描述环境动态(dynamics)的语言,而不仅仅是执行任务指令的问题。现有基于模型的方法虽然将语言融入世界模型以学习行为策略,但普遍存在政策泛化能力不足或依赖限制性假设(如推理时规划延迟可接受或需专家示范)的缺陷。本文的关键解决方案是提出一种无需规划和专家示范的模型增强型强化学习框架——Language-aware Encoder for Dreamer World Model (LED-WM),其核心在于在DreamerV3基础上引入语言感知编码器,通过注意力机制显式地将语言描述与观测中的实体对齐,从而提升策略在未见过的游戏场景中对新型动态和语言描述的泛化能力。实验表明,LED-WM在MESSENGER等环境中显著优于基线方法,并可通过合成测试轨迹进行微调以进一步优化部署前的策略性能。

链接: https://arxiv.org/abs/2511.22904
作者: Anh Nguyen,Stefan Lee
机构: Oregon State University (俄勒冈州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeuRIPS 2025. Workshop: LAW 2025: Bridging Language, Agent, and World Models

点击查看摘要

Abstract:To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment–that is, how the environment behaves–rather than just task instructions specifying “what to do”. Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior. Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions. We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model–without planning or expert demonstrations. Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3. LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and this http URL highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.
zh

[NLP-33] ORION: Teaching Language Models to Reason Efficiently in the Language of Thought

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在数学、代码生成和任务规划等任务中因依赖冗长且低效的“思考”标记序列而导致的高延迟、冗余和推理路径不连贯问题。其核心解决方案是受“思想语言假说”(Language of Thought Hypothesis)启发,引入一种名为Mentalese的紧凑符号化推理框架,将抽象推理编码为超压缩、结构化的token表示,从而实现以更少步骤完成复杂问题求解。关键创新在于提出SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO) 方法——一种基于强化学习的优化策略,通过奖励简洁但正确的推理路径,在保证准确性的同时显著减少token数量与计算开销。实验表明,基于此框架的ORION模型在多个基准测试中实现了4–16倍的token压缩率、最高5倍的推理延迟降低及7–9倍的训练成本削减,同时保持90–98%的原始模型准确率,验证了Mentalese风格压缩推理在提升认知效率方面的有效性。

链接: https://arxiv.org/abs/2511.22891
作者: Kumar Tanmay,Kriti Aggarwal,Paul Pu Liang,Subhabrata Mukherjee
机构: Harvard University (哈佛大学); Hippocratic AI; Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose “thinking” tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.
zh

[NLP-34] FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在K-12英语写作教学中进行细粒度错误分析(Fine-grained Error Analysis)能力不足的问题。其解决方案的关键在于构建了首个面向英语学习者的细粒度错误分析基准——FEANEL(Fine-grained Error ANalysis for English Learners),该基准包含1,000篇中小学生英文作文,并基于词性(part-of-speech)的错误分类体系,由语言教育专家标注每个错误的类型、严重程度及教学反馈,从而系统评估LLMs在教育场景下的错误识别与教学指导能力。

链接: https://arxiv.org/abs/2511.22883
作者: Jingheng Ye,Shen Wang,Jiaqi Chen,Hebin Wang,Deqing Zou,Yanyu Zhu,Jiwei Tang,Hai-Tao Zheng,Ruitong Liu,Haoyang Li,Yanfeng Wang,Qingsong Wen
机构: Squirrel Ai Learning; Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures, and 4 tables. The dataset is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs’ ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.
zh

[NLP-35] JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在日语法律领域知识评估缺乏系统性基准的问题。现有资源多集中于民法典(Civil Code),难以全面衡量模型对日本法律体系的理解能力。解决方案的关键在于构建JBE-QA数据集,该数据集源自日本司法考试(Bar Exam)的多项选择题(tanto-shiki)部分(2015–2024年),涵盖民法典、刑法典(Penal Code)和宪法(Constitution),并首次提供针对日语法律领域LLMs的综合性评估基准。每个问题被分解为独立的真/假判断,并配有结构化上下文字段,共包含3,464个标注均衡的样本。通过在26个不同类型的LLM上进行评估,研究揭示了启用推理机制的专有模型表现最优,且宪法类问题普遍比民法或刑法类问题更易解答。

链接: https://arxiv.org/abs/2511.22869
作者: Zhihan Cao,Fumihito Nishino,Hiroaki Yamada,Nguyen Ha Thanh,Yusuke Miyao,Ken Satoh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Three tables and one figure

点击查看摘要

Abstract:We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models’ legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.
zh

[NLP-36] RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms SIGIR

【速读】: 该论文旨在解决如何设计一个符合法律规范的基于检索增强生成(Retrieval-Augmented Generation, RAG)的大语言模型(Large Language Model, LLM)系统,以支持日本医疗诉讼程序中专家证人角色的替代问题。其核心挑战在于确保系统在提供专业医学知识时严格遵守法律对证据来源和时效性的要求。解决方案的关键在于构建满足三项约束条件的RAG架构:(1) 检索模块必须依据禁止使用私人知识的原则,从外部权威数据库中获取与争议焦点相关的适配知识;(2) 生成响应必须完全源自RAG提供的上下文,保持内容忠实于原始信息;(3) 检索模块需引用具有对应时间戳的外部知识,以确保所用信息与案件发生时的医学标准一致。

链接: https://arxiv.org/abs/2511.22858
作者: Yuya Ishihara,Atsushi Keyaki,Hiroaki Yamada,Ryutaro Ohara,Mihoko Sumida
机构: Hitotsubashi University (一桥大学); Institute of Science Tokyo (东京科学研究所); Nakamura, Tsunoda & Matsumoto (中村、津野与松本律师事务所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This is a preprint version of a paper reviewed and accepted at BREV-RAG 2025: Beyond Relevance-based EValuation of RAG Systems, a SIGIR-AP 2025 workshop

点击查看摘要

Abstract:This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to legal norms is imposed. Specifically, three requirements arise: (1) the retrieval module must retrieve appropriate external knowledge relevant to the disputed issues in accordance with the principle prohibiting the use of private knowledge, (2) the responses generated must originate from the context provided by the RAG and remain faithful to that context, and (3) the retrieval module must reference external knowledge with appropriate timestamps corresponding to the issues at hand. This paper discusses the design of a RAG-based LLM system that satisfies these requirements.
zh

[NLP-37] Mitigating Semantic Drift: Evaluating LLM s Efficacy in Psychotherapy through MI Dialogue Summarization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理治疗等低资源敏感领域中存在的语义漂移(semantic drift)、事实错误、共情表达不一致、偏见及幻觉等问题,这些问题限制了LLMs对人类复杂心理理解的准确捕捉。解决方案的关键在于采用混合方法学设计,利用动机访谈(Motivational Interviewing, MI)对话中的核心要素构建两阶段标注方案,并基于Motivational Interviewing Treatment Integrity (MITI)框架中的六个维度(即唤起、协作、自主性、方向、共情与非评判态度)建立多类分类任务;同时通过渐进式提示技术(包括零样本和少样本提示)评估模型性能,从而提升LLMs在心理治疗场景下的精确语境理解能力。

链接: https://arxiv.org/abs/2511.22818
作者: Vivek Kumar,Pushpraj Singh Rajawat,Eirini Ntoutsi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown their potential across both general and domain-specific tasks. However, there is a growing concern regarding their lack of sensitivity, factual incorrectness in responses, inconsistent expressions of empathy, bias, hallucinations, and overall inability to capture the depth and complexity of human understanding, especially in low-resource and sensitive domains such as psychology. To address these challenges, our study employs a mixed-methods approach to evaluate the efficacy of LLMs in psychotherapy. We use LLMs to generate precise summaries of motivational interviewing (MI) dialogues and design a two-stage annotation scheme based on key components of the Motivational Interviewing Treatment Integrity (MITI) framework, namely evocation, collaboration, autonomy, direction, empathy, and a non-judgmental attitude. Using expert-annotated MI dialogues as ground truth, we formulate multi-class classification tasks to assess model performance under progressive prompting techniques, incorporating one-shot and few-shot prompting. Our results offer insights into LLMs’ capacity for understanding complex psychological constructs and highlight best practices to mitigate ``semantic drift" in therapeutic settings. Our work contributes not only to the MI community by providing a high-quality annotated dataset to address data scarcity in low-resource domains but also critical insights for using LLMs for precise contextual interpretation in complex behavioral therapy.
zh

[NLP-38] Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence

【速读】: 该论文旨在解决传统神经网络架构中神经元作为“黑箱”单元、缺乏内部状态与动态通信能力的问题,从而限制了模型的计算灵活性与训练稳定性。其核心挑战在于如何设计一种更具生物启发性且计算高效的神经网络结构,以实现更稳定、可解释和可扩展的智能行为。解决方案的关键在于提出智能神经网络(Intelligent Neural Networks, INN),其中每个神经元均为具有内部记忆的第一类实体,具备选择性状态空间动力学(knowing when to activate)与基于注意力的路由机制(knowing to whom to send signals),并通过完全图(complete graph)拓扑结构组织,使神经元间能够通过图结构交互实现涌现式计算。实验表明,INN在Text8字符建模任务上显著优于Transformer,并在参数匹配条件下优于Mamba基线,证明其图结构带来的训练稳定性及学习到的神经路由机制对性能至关重要。

链接: https://arxiv.org/abs/2511.22813
作者: Antoine Salomon
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Code available at this https URL

点击查看摘要

Abstract:Biological neurons exhibit remarkable intelligence: they maintain internal states, communicate selectively with other neurons, and self-organize into complex graphs rather than rigid hierarchical layers. What if artificial intelligence could emerge from similarly intelligent computational units? We introduce Intelligent Neural Networks (INN), a paradigm shift where neurons are first-class entities with internal memory and learned communication patterns, organized in complete graphs rather than sequential layers. Each Intelligent Neuron combines selective state-space dynamics (knowing when to activate) with attention-based routing (knowing to whom to send signals), enabling emergent computation through graph-structured interactions. On the standard Text8 character modeling benchmark, INN achieves 1.705 Bit-Per-Character (BPC), significantly outperforming a comparable Transformer (2.055 BPC) and matching a highly optimized LSTM baseline. Crucially, a parameter-matched baseline of stacked Mamba blocks fails to converge (3.4 BPC) under the same training protocol, demonstrating that INN’s graph topology provides essential training stability. Ablation studies confirm this: removing inter-neuron communication degrades performance or leads to instability, proving the value of learned neural routing. This work demonstrates that neuron-centric design with graph organization is not merely bio-inspired – it is computationally effective, opening new directions for modular, interpretable, and scalable neural architectures. Comments: Code available at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2511.22813 [cs.LG] (or arXiv:2511.22813v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.22813 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在云边协同推理中隐私保护与推理质量之间难以平衡的问题。现有方法采用统一的隐私保护策略,未考虑输入文本的语义敏感性,导致对非敏感token也施加不必要的扰动,从而降低模型输出质量。解决方案的关键在于提出一种上下文感知的隐私路由框架PRISM(Privacy-aware Routing for Inference with Semantic Modulation),其核心机制包括:(1)边缘设备对实体级敏感性进行建模;(2)通过软门控模块动态选择云端、边缘或协同执行模式;(3)在协同路径中,基于实体风险应用自适应的两层本地差分隐私(Local Differential Privacy, LDP)机制;(4)云端LLM生成语义草图后,由边缘侧小型语言模型(Small Language Model, SLM)结合本地上下文进行精细化重构。该设计实现了隐私与推理质量的动态权衡,在保障强隐私约束的同时显著降低能耗和延迟(仅为基线方法的40–50%),且保持高输出质量。

链接: https://arxiv.org/abs/2511.22788
作者: Junfei Zhan,Haoxun Shen,Zheng Lin,Tengjiao He
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to AAAI 2026. This is the arXiv preprint version

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive capabilities in natural language understanding and generation, but incur high communication overhead and privacy risks in cloud deployments, while facing compute and memory constraints when confined to edge devices. Cloud-edge inference has emerged as a promising paradigm for improving privacy in LLM services by retaining sensitive computations on local devices. However, existing cloud-edge inference approaches apply uniform privacy protection without considering input sensitivity, resulting in unnecessary perturbation and degraded utility even for non-sensitive tokens. To address this limitation, we propose Privacy-aware Routing for Inference with Semantic Modulation (PRISM), a context-aware framework that dynamically balances privacy and inference quality. PRISM executes in four stages: (1) the edge device profiles entity-level sensitivity; (2) a soft gating module on the edge selects an execution mode - cloud, edge, or collaboration; (3) for collaborative paths, the edge applies adaptive two-layer local differential privacy based on entity risks; and (4) the cloud LLM generates a semantic sketch from the perturbed prompt, which is then refined by the edge-side small language model (SLM) using local context. Our results show that PRISM consistently achieves superior privacy-utility trade-offs across various scenarios, reducing energy consumption and latency to 40-50% of baseline methods such as Uniform and Selective LDP, while maintaining high output quality under strong privacy constraints. These findings are validated through comprehensive evaluations involving realistic prompts, actual energy measurements, and heterogeneous cloud-edge model deployments.
zh

[NLP-40] Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

【速读】: 该论文旨在解决当前多语言大模型在处理罗马化脚本(Romanized script)时面临的挑战,尤其是在南亚地区广泛使用的印欧语系语言(如印地语和孟加拉语)中,由于发音与拼写变体多样、代码混合数据不足以及低资源适应性差等问题,导致现有 transliteration(音译)技术效果有限。其解决方案的关键在于构建一个大规模、高质量的音译数据集,涵盖近180万对印地语和100万对孟加拉语的音译样本,并基于此数据集预训练一个基于Marian架构的定制多语言序列到序列(seq2seq)大语言模型(LLM),从而显著提升BLEU和字符错误率(CER)等指标下的性能表现。

链接: https://arxiv.org/abs/2511.22769
作者: Kanchon Gharami,Quazi Sarwar Muhtaseem,Deepti Gupta,Lavanya Elluri,Shafika Showkat Moni
机构: Embry-Riddle Aeronautical University (埃姆布里-里德航空大学); Hishab Singapore Pte. Ltd (Hishab新加坡有限公司); Texas A&M University - Central Texas (得克萨斯农工大学中央德州分校)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 8th Workshop on Big Data for Cybersecurity (BigCyber)

点击查看摘要

Abstract:The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient diversity in pronunciation and spelling variations, adequate code-mixed data for large language model (LLM) training, and low-resource adaptation. To address this research gap, we introduce a novel transliteration dataset for two popular Indo-Aryan languages, Hindi and Bengali, which are ranked as the 3rd and 7th most spoken languages worldwide. Our dataset comprises nearly 1.8 million Hindi and 1 million Bengali transliteration pairs. In addition to that, we pre-train a custom multilingual seq2seq LLM based on Marian architecture using the developed dataset. Experimental results demonstrate significant improvements compared to existing relevant models in terms of BLEU and CER metrics.
zh

[NLP-41] ReAG: Reasoning -Augmented Generation for Knowledge-based Visual Question Answering

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理领域特定或知识密集型视觉问答(Knowledge-based Visual Question Answering, KB-VQA)任务时,因预训练数据中相关知识代表性不足而导致性能受限的问题。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法普遍存在检索精度低、噪声段落干扰以及推理能力有限等缺陷。解决方案的关键在于提出一种新的推理增强型多模态RAG框架(Reasoning-Augmented Multimodal RAG, ReAG),其核心包括:1)结合粗粒度与细粒度检索以提升上下文相关性;2)引入一个批判模型(critic model)过滤无关段落,确保高质量外部信息注入;3)采用多阶段训练策略,通过强化学习优化对检索内容的推理能力,而监督微调仅作为冷启动初始化。实验表明,ReAG在Encyclopedic-VQA和InfoSeek数据集上显著优于现有方法,不仅提升了答案准确性,还提供了基于检索证据的可解释推理过程。

链接: https://arxiv.org/abs/2511.22715
作者: Alberto Compagnoni,Marco Morini,Sara Sarto,Federico Cocchi,Davide Caffagni,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia, Italy; University of Pisa, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: this https URL.
zh

[NLP-42] Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations

【速读】: 该论文旨在解决当前视觉-语言-动作(Vision-Language Action, VLA)模型在机器人任务中微调时缺乏针对性的问题,即现有方法对所有任务使用相同的参数调整策略,无法有效适应不同任务的物理特性、视觉输入和语言描述差异。解决方案的关键在于提出一种基于机制可解释性的微调方法——Robotic Steering,该方法通过少量示范识别与特定任务相关的注意力头(attention heads),并仅对其实施选择性微调,从而实现更高效、鲁棒且可解释的VLA模型适配,显著优于LoRA等通用微调方法。

链接: https://arxiv.org/abs/2511.22697
作者: Chancharik Mitra,Yusen Luo,Raj Saravanan,Dantong Niu,Anirudh Pai,Jesse Thomason,Trevor Darrell,Abrar Anwar,Deva Ramanan,Roei Herzig
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task’s visual, linguistic, and physical characteristics. Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks. Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.
zh

[NLP-43] Improving LLM -based Ontology Matching with fine-tuning on synthetic data

【速读】: 该论文旨在解决如何有效利用大语言模型(Large Language Models, LLMs)直接对本体模块(ontology modules)进行匹配并生成对应对齐结果的问题,尤其是在缺乏足够标注参考对齐数据的情况下提升模型在零样本(zero-shot)场景下的性能。其解决方案的关键在于提出了一种结合自动数据集生成与微调(fine-tuning)的策略:首先通过搜索空间缩减技术从源和目标本体中提取相关子模块,并自动生成提示(prompt);其次,针对标注数据稀缺问题,设计了一种基于LLM的合成数据生成方法,构建包含本体子模块对及其参考对齐的语料库;最后,使用该合成数据对LLM进行微调,显著提升了模型在OAEI复杂赛道多个数据集上的匹配性能。

链接: https://arxiv.org/abs/2511.22612
作者: Guilherme Sousa,Rinaldo Lima,Cassia Trojahn
机构: IRIT & Université de Toulouse 2 Jean Jaurès (IRIT & 图卢兹第二大学让·饶勒斯); Universidade Federal Rural de Recife (联邦农村大学佩尔纳布科); Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP (格勒诺布尔阿尔卑斯大学, Inria, 法国国家科学研究中心, 格勒诺布尔理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model’s matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.
zh

[NLP-44] Smarter not Bigger: Fine-Tuned RAG -Enhanced LLM s for Automotive HIL Testing

【速读】: 该论文旨在解决汽车硬件在环(Hardware-in-the-Loop, HIL)测试中测试用例和需求文档碎片化、利用率低的问题。其解决方案的关键在于提出HIL-GPT系统,该系统基于检索增强生成(Retrieval-Augmented Generation, RAG)架构,融合领域适配的大语言模型(Large Language Models, LLMs)与语义检索技术;通过启发式挖掘和LLM辅助合成构建的领域特定数据集进行嵌入(embedding)微调,并结合向量索引实现可扩展、可追溯的测试用例与需求检索。实验表明,微调后的轻量级模型(如bge-base-en-v1.5)在准确率、延迟和成本之间取得更优平衡,验证了“模型越大越好”的传统认知并不适用于工业场景。

链接: https://arxiv.org/abs/2511.22584
作者: Chao Feng,Zihan Liu,Siddhant Gupta,Gongpei Cui,Jan von der Assen,Burkhard Stiller
机构: University of Zurich UZH (苏黎世大学); Volvo Car Corporation (沃尔沃汽车公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \textttbge-base-en-v1.5, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.
zh

[NLP-45] Extension Condition “violations” and Merge optimality constraints

【速读】: 该论文旨在解决一系列传统上被视为违反扩展条件(Extension Condition, EC)的语法现象,如头-头移动、短语附着词、句法代词化、动词-粒子交替及操作符-变量现象等。作者基于强最小主义纲领(Strong Minimalist Thesis)中的合并(Merge)数学形式化,提出这些现象实际上无需违反EC即可解释。解决方案的关键在于引入侧向合并(Sideward Merge, SM),其虽在资源限制(Resource Restrictions)成本函数下产生不同程度的最优性违反,但均满足EC;对于最优性违反较大的情况,可采用不依赖EC和SM的替代推导路径;而唯一仍需SM的情况(头-头移动)仅涉及微小的最优性偏离(近平衡波动)。此外,论文进一步阐明EC具有内在的代数结构意义,是模型本身的结构性约束,并指出最小最优性违反的SM在Merge的马尔可夫性质中起关键作用,从而将语法推导与Hopf代数马尔可夫链的动力学特性关联起来。

链接: https://arxiv.org/abs/2511.22582
作者: Matilde Marcolli,Richard Larson,Riny Huijbregts
机构: 未知
类目: Computation and Language (cs.CL); Rings and Algebras (math.RA)
备注: 85 pages

点击查看摘要

Abstract:We analyze, using the mathematical formulation of Merge within the Strong Minimalist Thesis framework, a set of linguistic phenomena, including head-to-head movement, phrasal affixes and syntactic cliticization, verb-particle alternation, and operator-variable phenomena. These are often regarded as problematic, as violations of the Extension Condition. We show that, in fact, all of these phenomena can be explained without involving any EC violation. We first show that derivations using Sideward Merge are possible for all of these cases: these respect EC, though they involve some amount of optimality violations, with respect to Resource Restrictions cost functions, andthe amount of violation differs among these cases. We show that all the cases that involve large optimality violations can be derived in alternative ways involving neither EC nor the use of SM. The main remaining case (head-to-head movement) only involves SM with minimal violations of optimality (near equilibrium fluctuations). We analyze explicitly also the cases of multiple wh-fronting, clusters of clitics in Romance languages and possessor agreement construction in Korean, and how an explanation of these phenomena based on SM can be made compatible with the colored operad generators for phases and theta roles. We also show that the EC condition has a clear algebraic meaning in the mathematical formulation of Merge and is therefore an intrinsic structural algebraic constraint of the model, rather than an additional assumption. We also show that the minimal optimality violating SM plays a structural role in the Markovian properties of Merge, and we compare different optimality conditions coming from Minimal Search and from Resource Restriction in terms of their effect on the dynamics of the Hopf algebra Markov chain, in a simple explicit example.
zh

[NLP-46] DeepSeek Math-V2: Towards Self-Verifiable Mathematical Reasoning

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在数学推理任务中仅追求最终答案正确性而忽视推理过程严谨性的问题,尤其针对需要严格步骤推导的定理证明类任务,传统基于最终答案奖励的方法无法有效提升推理质量。解决方案的关键在于构建一个可自验证的数学推理框架:首先训练一个准确且忠实的LLM-based验证器(verifier)用于评估证明的每一步逻辑正确性;随后利用该验证器作为奖励模型来训练证明生成器(proof generator),激励其在生成过程中主动识别并修正自身推理中的漏洞;同时通过扩大验证计算资源自动标注难以验证的新证明样本,持续优化验证器性能,从而维持生成与验证之间的能力差距,推动模型向更深层次的数学推理演进。此方法使DeepSeekMath-V2在IMO 2025、CMO 2024和Putnam 2024等竞赛中取得接近人类顶尖水平的表现。

链接: https://arxiv.org/abs/2511.22570
作者: Zhihong Shao,Yuxiang Luo,Chengda Lu,Z.Z. Ren,Jiewen Hu,Tian Ye,Zhibin Gou,Shirong Ma,Xiaokang Zhang
机构: DeepSeek-AI(深度求索)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn’t address a key issue: correct answers don’t guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
zh

[NLP-47] Joint Speech and Text Training for LLM -Based End-to-End Spoken Dialogue State Tracking ICASSP2026

【速读】: 该论文旨在解决端到端语音对话状态追踪(Spoken Dialogue State Tracking, DST)中因语音输入处理复杂性和训练数据稀缺导致的性能瓶颈问题,尤其在跨领域泛化能力不足时难以适应新领域的问题。其解决方案的关键在于:通过联合训练可用的语音DST数据与来自其他领域的文本DST数据,利用大规模语言模型的语义理解能力,实现无需目标领域语音标注数据即可获得良好的跨域DST性能,从而显著降低数据收集成本并提升模型的泛化能力。

链接: https://arxiv.org/abs/2511.22503
作者: Katia Vendrame,Bolaji Yusuf,Santosh Kesiraju,Šimon Sedláček,Oldřich Plchot,Jan Černocký
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to ICASSP 2026

点击查看摘要

Abstract:End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.
zh

[NLP-48] What Shape Is Optimal for Masks in Text Removal?

【速读】: 该论文旨在解决复杂场景下文档图像中密集文本去除的难题,尤其是现有方法在处理实际工业应用中具有复杂布局和高密度文本的图像时性能下降的问题。其关键解决方案在于提出一种基于贝叶斯优化(Bayesian optimization)的灵活掩码建模方法,能够学习并生成字符级别的掩码轮廓(character-wise masks),从而显著提升文本移除的精度与鲁棒性;研究进一步发现,仅覆盖文本区域的最小掩码并非最优策略,强调了掩码形状精细调整对实际任务的重要性。

链接: https://arxiv.org/abs/2511.22499
作者: Hyakka Nakada,Marika Kubota
机构: Recruit Co., Ltd.(株式会社リクルート); Beans Labo Co., Ltd.(Beans Labo 株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 17 figures

点击查看摘要

Abstract:The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.
zh

[NLP-49] Exploring Performance Variations in Finetuned Translators of Ultra-Low Resource Languages: Do Linguistic Differences Matter?

【速读】: 该论文试图解决的问题是:在使用少量数据对预训练语言模型进行微调以构建超低资源语言(如濒危原住民语言)翻译器时,为何不同研究中得到的翻译性能存在显著差异。解决方案的关键在于系统性地评估多个潜在因素的影响,包括数据清洗方法、预训练模型的能力限制、基础模型规模以及训练数据集大小,并在两个结构特征显著但相关的巴西原住民语言之间双向验证。结果表明,这些训练因素对性能差异的影响甚微,提示语言本身的特性可能是决定微调效果的关键因素。

链接: https://arxiv.org/abs/2511.22482
作者: Isabel Gonçalves,Paulo Cavalin,Claudio Pinhanez
机构: PUC-Rio(天主教联邦大学); IBM Research Brazil(IBM巴西研究实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.
zh

[NLP-50] Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLM s AAAI’26

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中对语言不确定性(linguistic uncertainty)的内部表征机制不明确的问题,尤其是如何识别和量化模型在输入层面对于语义模态差异(如确定性表述与可能性表述)的敏感性。其解决方案的关键在于构建了一个对比数据集,其中包含具有不同认知模态(epistemic modality)的临床语句,并提出了一种分层探测指标——模型不确定性敏感度(Model Sensitivity to Uncertainty, MSU),用于量化因不确定性提示引发的激活水平变化。实验表明,LLMs对临床不确定性表现出结构化的、深度依赖的敏感性,揭示了不确定性信息在深层网络中逐步编码的规律,从而为提升模型的可解释性和认知可靠性提供了依据。

链接: https://arxiv.org/abs/2511.22402
作者: Srivarshinee Sridhar,Raghav Kaushik Ravi,Kripabandhu Ghosh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI’26 SECURE-AI4H Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in clinical settings, where sensitivity to linguistic uncertainty can influence diagnostic interpretation and decision-making. Yet little is known about where such epistemic cues are internally represented within these models. Distinct from uncertainty quantification, which measures output confidence, this work examines input-side representational sensitivity to linguistic uncertainty in medical text. We curate a contrastive dataset of clinical statements varying in epistemic modality (e.g., ‘is consistent with’ vs. ‘may be consistent with’) and propose Model Sensitivity to Uncertainty (MSU), a layerwise probing metric that quantifies activation-level shifts induced by uncertainty cues. Our results show that LLMs exhibit structured, depth-dependent sensitivity to clinical uncertainty, suggesting that epistemic information is progressively encoded in deeper layers. These findings reveal how linguistic uncertainty is internally represented in LLMs, offering insight into their interpretability and epistemic reliability.
zh

[NLP-51] SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续学习(Continual Learning, CL)场景下面临的灾难性遗忘问题,尤其是在任务数量众多(Large Number of Tasks, LNT)设置中,传统基于正则化和回放(replay)的方法性能显著落后于多任务学习。其核心解决方案包含两个关键组件:一是提出惊喜优先回放(Surprise-prioritised Replay, SuRe),通过选择具有高负对数似然(Negative Log-Likelihood)的序列进行存储,优化回放样本的选择策略;二是引入双学习者架构(dual-learner design),结合快速与慢速的LoRA适配器,并通过指数移动平均(EMA)融合权重,实现新知识的快速适应与长期稳定巩固。这两个机制协同作用,在LNT和标准CL基准上均取得最优性能,且在低频回放和小缓冲区条件下仍保持鲁棒性,验证了回放作为LLM持续微调基线的有效性。

链接: https://arxiv.org/abs/2511.22367
作者: Hugo Hazard,Zafeirios Fountas,Martin A. Benfeghoul,Adnan Oomerjee,Jun Wang,Haitham Bou-Ammar
机构: University College London (伦敦大学学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual learning, one’s ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.
zh

[NLP-52] PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel ASPLOS’26

【速读】: 该论文针对大语言模型(Large Language Model, LLM)推理中解码阶段的注意力计算瓶颈问题展开研究,核心挑战在于:当前主流的解码注意力(decode attention)操作因大量KV缓存从全局内存加载而成为内存受限型运算;同时,真实场景下的请求普遍存在层级共享前缀(如系统提示、工具模板或检索增强生成RAG内容),现有注意力实现未能有效利用此类共享结构——“一查询一线程块”(one-query-per-CTA)策略导致重复加载共享前缀的KV缓存,而“统一分块大小”(one-size-fits-all tiling)则造成片上资源闲置并加剧不均衡KV长度带来的流水线气泡(bubbles)。解决方案的关键在于提出PAT(Prefix-aware Attention Transformer)机制,其采用“打包-前向-合并”(pack-forward-merge)范式:首先按共享前缀对查询进行打包以减少重复内存访问,随后运行定制化的多分块(multi-tile)内核提升资源利用率,并结合多流前向传输与KV分割技术降低气泡效应,最终通过在线Softmax合并实现低开销聚合。实验证明,PAT在vLLM框架下作为即插即用插件可平均降低注意力延迟67.4%,并显著减少端到端处理时间(TPOT)达13.6%–83.4%。

链接: https://arxiv.org/abs/2511.22333
作者: Jinjun Yi,Zhixin Zhao,Yitao Hu,Ke Yan,Weiwei Sun,Hao Wang,Laiping Zhao,Yuhao Zhang,Wenxin Li,Keqiu Li
机构: Tianjin University (天津大学); Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注: Accepted by ASPLOS’26

点击查看摘要

Abstract:LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 67.4% on average and TPOT by 13.6-83.4% under the same configurations against state-of-the-art attention kernels. Comments: Accepted by ASPLOS’26 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL) Cite as: arXiv:2511.22333 [cs.DC] (or arXiv:2511.22333v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.22333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-53] Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)技术在低资源和代表性不足语言中的 inclusivity(包容性)与全球适用性之间的失衡问题,特别是针对库尔德语索拉尼方言(Kurdish Sorani)缺乏命名实体识别(Named Entity Recognition, NER)数据集的现状。其解决方案的关键在于构建首个面向该语言的NER标注数据集(包含64,563个标注词元),并开发一个可扩展至多种语言的工具框架,同时通过系统性对比分析经典机器学习模型与神经网络方法,发现传统条件随机场(Conditional Random Fields, CRF)模型在该低资源场景下获得F1分数0.825,显著优于基于双向长短期记忆网络(BiLSTM)的模型(F1=0.706),从而挑战了“神经方法在NLP中始终占优”的既有认知,表明在资源受限条件下,更简洁、计算效率更高的经典方法仍具优势。

链接: https://arxiv.org/abs/2511.22315
作者: Bakhtawar Abdalla,Rebwar Mala Nabi,Hassan Eshkiki,Fabio Caraffini
机构: Sultan Qaboos University (苏丹·卡布斯大学); Karak University (卡拉大学); Swansea University (斯旺西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first ‘name entity recognition’ dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.
zh

[NLP-54] Sentiment Analysis Of Shopee Product Reviews Using Distilbert

【速读】: 该论文旨在解决电商平台上海量用户评论(如Shopee产品评论)的自动化情感分析问题,传统人工分析方法效率低下,难以满足大规模数据处理需求。解决方案的关键在于采用轻量级Transformer模型DistilBERT进行情感分类,其在保持高准确率(94.8%)的同时,相比BERT显著降低计算时间(减少55%以上),从而实现了情感分析中准确性和计算效率之间的最优平衡,适用于大规模电商平台的实时情感挖掘场景。

链接: https://arxiv.org/abs/2511.22313
作者: Zahri Aksa Dautd,Aviv Yuniar Rahman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 11 figures

点击查看摘要

Abstract:The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.
zh

[NLP-55] oken-Level Marginalization for Multi-Label LLM Classifiers

【速读】: 该论文旨在解决生成式语言模型(Generative Language Models, LLMs)在多标签内容安全分类任务中缺乏可解释置信度评分的问题。由于生成式架构本身不直接提供类别级概率,导致模型置信度评估困难,进而影响动态阈值设定和细粒度错误分析。解决方案的关键在于提出并验证三种新颖的基于token级别的概率估计方法,通过利用token logits来提升生成式分类器的可解释性和可靠性,从而实现更精细化的内容安全审核。

链接: https://arxiv.org/abs/2511.22312
作者: Anjaneya Praharaj,Jaykumar Kasundra
机构: ServiceNow(服务-now); ServiceNow(服务-now)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.
zh

[NLP-56] Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation

【速读】: 该论文旨在解决从头设计蛋白质(de novo protein design)中面临的重大挑战,即如何在庞大的序列空间中高效、灵活地生成具有特定结构、理化性质和功能特性的蛋白质序列,而无需依赖模板骨架或大量任务特定数据。当前主流的生成式AI方法如蛋白语言模型(Protein Language Models, PLMs)和基于扩散架构的方法通常需要繁琐的微调或模型重构才能实现目标导向设计,限制了其通用性和可扩展性。论文提出的关键解决方案是一种受群体智能启发的去中心化代理框架(decentralized, agent-based framework),其中多个大型语言模型(Large Language Model, LLM)代理并行运行于每个残基位置,通过整合设计目标、局部邻域相互作用以及历史迭代的记忆与反馈机制,实现上下文感知的迭代突变建议。这种逐位协同策略不依赖于保守基序或多序列比对,展现出涌现式的序列多样性与结构合理性,并能在数GPU小时内完成高效的目标导向设计,且无需任何微调或专用训练,为蛋白质设计乃至更广泛的生物分子系统设计提供了通用、可扩展的新范式。

链接: https://arxiv.org/abs/2511.22311
作者: Fiona Y. Wang,Di Sheng Lee,David L. Kaplan,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院); Tufts University (塔夫茨大学)
类目: Artificial Intelligence (cs.AI); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Soft Condensed Matter (cond-mat.soft); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.
zh

[NLP-57] Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)任务中评估与奖励机制的瓶颈问题,特别是现有方法依赖昂贵的人工标注黄金SQL查询进行评估,且强化学习(Reinforcement Learning, RL)仅使用最终执行结果作为粗粒度奖励信号,无法捕捉结构和语义层面的细粒度错误。解决方案的关键在于提出RuCo-C框架,其核心创新是引入一个生成式判别模型(generative judge model),自动构建查询特定的评估标准(evaluation rubrics),并基于这些标准生成可解释的批评(interpretable critiques),从而实现无需人工干预的细粒度自动评估;同时,在RL训练中采用“渐进探索”策略动态调整奖励反馈,显著提升模型性能。

链接: https://arxiv.org/abs/2511.22258
作者: Guifeng Wang,Yuanfeng Song,Meng Yang,Tao Zhu,Xiaoming Yin,Xing Chen
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a “progressive exploration” strategy during the RL training process, which dynamically adjusts the rewards to enhance the model’s performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.
zh

[NLP-58] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

【速读】: 该论文旨在解决当前医疗多模态大语言模型(Medical Multi-modal Large Language Models, MLLMs)在临床应用中受限于单一图像理解能力的问题,尤其在需要整合来自不同模态或时间点的多张医学影像以进行综合诊断和病情评估时表现不足。其关键解决方案是提出了一种五阶段、上下文感知的指令生成范式,通过“分而治之”策略将复杂的多图像分析任务分解为可管理的子任务,并利用生物医学文献中广泛存在的开源许可复合图像(compound images)作为高质量训练数据源,从而赋能模型学习跨图像的空间、时间和模态间复杂关系。该方法最终构建了M3LLM模型,在多项基准测试中显著优于通用及专用医疗MLLMs,展现出对纵向胸部X光片等真实场景的强大泛化能力。

链接: https://arxiv.org/abs/2511.22232
作者: Zhen Chen,Yihang Fu,Gabriel Madera,Mauro Giuffre,Serina Applebaum,Hyunjae Kim,Hua Xu,Qingyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
zh

[NLP-59] Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成链式思维(Chain-of-Thought, CoT)推理过程中存在的冗余token消耗和高推理延迟问题。现有优化方法多聚焦于模型层面的干预,如强化学习或监督微调以减少冗余表达,但效率提升有限。本文提出一种无需训练、以输入为中心的解决方案——聚焦链式思维(Focused Chain-of-Thought, F-CoT),其核心在于将信息提取与推理过程解耦:首先从原始查询中提炼出结构化的关键信息作为紧凑上下文,随后引导模型仅基于该上下文进行推理,从而避免注意力机制对无关细节的关注,自然生成更短的推理路径。实验表明,在算术应用题任务中,F-CoT可实现2-3倍的token减少,同时保持与标准零样本CoT相当的准确性,验证了结构化输入作为提升LLM推理效率的有效手段。

链接: https://arxiv.org/abs/2511.22176
作者: Lukas Struppek,Dominik Hintersdorf,Hannah Struppek,Daniel Neider,Kristian Kersting
机构: FAR.AI; German Research Center for Artificial Intelligence (DFKI); Technical University of Darmstadt; University of Kassel; TU Dortmund University; TU Center for Trustworthy Data Science and Security, University Alliance Ruhr; Hessian Center for AI (Hessian.AI); Centre for Cognitive Science, Technical University of Darmstadt
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.
zh

[NLP-60] RefineBench: Evaluating Refinement Capability of Language Models via Checklists

【速读】: 该论文旨在解决语言模型(Language Models, LMs)是否具备自我修正(self-refine)其生成响应的能力这一关键问题,尤其在用户提出开放式查询并提供不同程度反馈的现实场景下。研究发现,当前前沿模型在无指导的自修正模式中表现有限,即使如Gemini 2.5 Pro和GPT-5等先进模型也仅获得约31%和29%的基准准确率,且多数模型无法在多轮迭代中持续改进;相比之下,在有自然语言反馈引导的“引导式修正”模式下,无论是商用还是开源大模型均能通过针对性反馈在五轮内逼近完美性能。解决方案的关键在于引入RefineBench——一个包含1000个跨11个领域的挑战性任务及基于检查表的评估框架,为系统评测模型自修正能力提供了标准化测试平台,并揭示了当前模型在自主纠错方面的显著瓶颈。

链接: https://arxiv.org/abs/2511.22173
作者: Young-Jun Lee,Seungone Kim,Byung-Kwan Lee,Minkyeong Moon,Yechan Hwang,Jong Myoung Kim,Graham Neubig,Sean Welleck,Ho-Jin Choi
机构: KAIST(韩国科学技术院); Carnegie Mellon University (卡内基梅隆大学); NVIDIA(英伟达); Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs’ refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.
zh

[NLP-61] Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo

【速读】: 该论文旨在解决语音产生过程中多发音器官(如唇、舌、下颌)之间存在的动态协调与权衡问题,特别是唇-下颌和舌-下颌的协同运动机制。其解决方案的关键在于提出并应用动态构音模型 DYNARTmo,该模型虽未采用完整的任务动力学二阶生物力学建模,但通过一阶任务空间手势规范(task-space gesture specifications)实现了对高阶任务轨迹与低阶发音器执行之间关系的建模,并引入简化机制以分配多个发音器之间的构音努力。模拟结果表明,该模型能再现多种已知的构音协同现象,如舌部闭塞由下颌支撑、双唇塞音中下唇抬升、舌-下颌共动以及双唇狭窄处的饱和效应,从而在计算简化假设下生成符合实证观察的时空构音模式。

链接: https://arxiv.org/abs/2511.22155
作者: Bernd J. Kröger
机构: RWTH Aachen University (亚琛工业大学); Kröger Lab (Kröger 实验室)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: 12 pages, 3 figures, supplementary material: python code

点击查看摘要

Abstract:This paper investigates how the dynamic articulatory model DYNARTmo accounts for articulatory tradeoffs between primary and secondary articulators, with a focus on lips-jaw and tongue-jaw coordination. While DYNARTmo does not implement full task-dynamic second-order biomechanics, it adopts first-order task-space gesture specifications comparable to those used in articulatory phonology and integrates a simplified mechanism for distributing articulatory effort across multiple articulators. We first outline the conceptual relationship between task dynamics and DYNARTmo, emphasizing the distinction between high-level task-space trajectories and their low-level articulatory execution. We then present simulation results for a set of CV syllables that illustrate how jaw displacement varies as a function of both place of articulation (labial, apical, dorsal) and vowel context (/a/, /i/, /u/). The model reproduces empirically attested patterns of articulatory synergy, including jaw-supported apical closures, lower-lip elevation in bilabial stops, tongue-jaw co-movement, and saturation effects in labial constrictions. These results demonstrate that even with computationally simplified assumptions, DYNARTmo can generate realistic spatio-temporal movement patterns that capture key aspects of articulatory tradeoff and synergy across a range of consonant-vowel combinations.
zh

[NLP-62] A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM -Generated Text

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)快速发展背景下,人类与机器作者界限模糊所引发的学术诚信和信息可靠性风险问题。现有文本检测方法通常依赖单一方法论范式,存在泛化能力差和误报率高(False Positive Rate, FPR)的问题,尤其在高风险学术文本中表现不佳。其解决方案的关键在于提出一种理论驱动的混合集成方法,系统融合三种互补的检测范式:基于RoBERTa的Transformer分类器用于深层语义特征提取、基于GPT-2的概率检测器利用扰动诱导的似然曲率、以及统计语言学特征分析器捕捉风格特征模式;核心创新在于设计了一个在概率单纯形上学习的加权投票框架,通过最大化F1分数而非启发式设定权重来优化集成效果,并通过偏差-方差分析验证了各模型间低相关性(ρ ~ 0.35–0.42),从而实现显著降低误报率(在学术文本上相对减少35%)并提升整体检测性能(准确率达94.2%,AUC为0.978)。

链接: https://arxiv.org/abs/2511.22153
作者: Sepyan Purnama Kristanto,Lutfi Hakim
机构: Politeknik Negeri Banyuwangi (邦尤万吉国立理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has blurred the line between human and machine authorship, creating practical risks for academic integrity and information reliability. Existing text detectors typically rely on a single methodological paradigm and suffer from poor generalization and high false positive rates (FPR), especially on high-stakes academic text. We propose a theoretically grounded hybrid ensemble that systematically fuses three complementary detection paradigms: (i) a RoBERTa-based transformer classifier for deep semantic feature extraction, (ii) a GPT-2-based probabilistic detector using perturbation-induced likelihood curvature, and (iii) a statistical linguistic feature analyzer capturing stylometric patterns. The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score rather than set heuristically. We provide a bias-variance analysis and empirically demonstrate low inter-model correlation (rho ~ 0.35-0.42), a key condition for variance reduction. Evaluated on a large-scale, multigenerator corpus of 30,000 documents, our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text. This yields a more reliable and ethically responsible detector for real-world deployment in education and other high-stakes domains.
zh

[NLP-63] From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures

【速读】: 该论文旨在解决文本嵌入空间(text embedding space)的几何与拓扑结构表征问题,即如何有效刻画和理解不同嵌入模型在高维空间中的组织方式,从而提升模型可解释性并揭示影响下游任务性能的关键因素。其解决方案的关键在于提出统一拓扑签名(Unified Topological Signatures, UTS),该框架通过整合多种拓扑与几何度量,克服了单一指标冗余且区分能力有限的问题,实现了对嵌入空间的多属性、整体性刻画,并成功将拓扑结构与文档检索效果等实际任务性能关联起来,验证了从全局视角分析嵌入几何的重要性。

链接: https://arxiv.org/abs/2511.22150
作者: Florian Rottach,William Rudman,Bastain Rieck,Harrisen Scells,Carsten Eickhoff
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.
zh

[NLP-64] C2DLM: Causal Concept-Guided Diffusion Large Language Models

【速读】: 该论文旨在解决当前主流大语言模型中推理能力不足的问题,尤其是自回归(Autoregressive, AR)语言模型和扩散语言模型(Diffusion Language Models, DLMs)在建模自然语言因果结构方面的局限性。AR模型受限于严格的左到右生成顺序,无法捕捉自然语言中灵活的因果关系;而DLMs虽采用全连接注意力机制,却完全忽略了因果顺序。为填补这一空白,作者提出了一种因果概念引导的扩散语言模型(Causal Concept-Guided Diffusion Language Model, C²DLM),其核心创新在于:首先从教师模型中提取概念级别的因果图,随后显式地引导注意力机制学习概念间的因果关系,从而避免因因果反转带来的干扰。该方法在COT-OrderPerturb任务上提升12%,训练速度提高约3.2倍,并在六个下游推理任务上平均性能提升1.31%。

链接: https://arxiv.org/abs/2511.22146
作者: Kairong Han,Nuanqiao Shan,Ziyu Zhao,Zijing Hu,Xinpeng Dong,Junjian Ye,Lujia Pan,Fei Wu,Kun Kuang
机构: Zhejiang University (浙江大学); Huawei Technologies (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline\textbfCausal \underline\textbfConcept-Guided \underline\textbfDiffusion \underline\textbfLanguage \underline\textbfModel (C ^2 DLM). Starting from DLM’s fully connected attention, C ^2 DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C ^2 DLM improves 12% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31% across six downstream reasoning tasks. More details in the repository ~\hrefthis https URLhere.
zh

[NLP-65] Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples ACL

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在跨模态检索中因不同模态间相似度量纲不一致而导致的“模态差距”(modality gap)问题,这会显著影响检索准确性。解决方案的关键在于提出一种基于伪数据构建的相似度标准化方法:首先利用查询与候选文本或图像之间最高余弦相似度匹配得到的伪配对样本,计算各模态下相似度的均值和方差;随后使用这些模态特异性统计量对所有相似度分数进行标准化处理,使其在同一尺度上可比。该方法无需人工标注数据,在多个VLM和多模态问答基准(MMQA和WebQA)上验证了其有效性,尤其在跨模态检索场景下显著提升召回率(如MMQA上Recall@20提升64%)。

链接: https://arxiv.org/abs/2511.22141
作者: Shuhei Yamashita,Daiki Shirafuji,Tatsuhiko Saito
机构: Mitsubishi Electric Corporation (三菱电机公司)
类目: Computation and Language (cs.CL)
备注: Accepted to PACLIC2025

点击查看摘要

Abstract:Advances in vision-language models (VLMs) have enabled effective cross-modality retrieval. However, when both text and images exist in the database, similarity scores would differ in scale by modality. This phenomenon, known as the modality gap, hinders accurate retrieval. Most existing studies address this issue with manually labeled data, e.g., by fine-tuning VLMs on them. In this work, we propose a similarity standardization approach with pseudo data construction. We first compute the mean and variance of the similarity scores between each query and its paired data in text or image modality. Using these modality-specific statistics, we standardize all similarity scores to compare on a common scale across modalities. These statistics are calculated from pseudo pairs, which are constructed by retrieving the text and image candidates with the highest cosine similarity to each query. We evaluate our method across seven VLMs using two multi-modal QA benchmarks (MMQA and WebQA), where each question requires retrieving either text or image data. Our experimental results show that our method significantly improves retrieval performance, achieving average Recall@20 gains of 64% on MMQA and 28% on WebQA when the query and the target data belong to different modalities. Compared to E5-V, which addresses the modality gap through image captioning, we confirm that our method more effectively bridges the modality gap.
zh

[NLP-66] A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models

【速读】: 该论文旨在解决传统信念修正(belief revision)模型难以适应社交媒体时代大规模文本交互场景的问题,即如何在在线话语环境中有效预测说服效果。其解决方案的关键在于融合心理学实验特征与大语言模型(Large Language Models, LLMs)的能力:通过LLM对已有心理实验中验证的八个核心特征进行自动评分,并基于这些评分构建随机森林分类模型,从而预测某条信息是否会导致信念改变。结果显示,“认知情绪”(epistemic emotion)和“分享意愿”(willingness to share)是最重要的两个预测因子,表明该方法能借助LLM增强基于心理理论的说服力建模,具有在线影响力识别、虚假信息治理及网络叙事效果评估等广泛应用潜力。

链接: https://arxiv.org/abs/2511.22109
作者: Gia Bao Hoang,Keith J Ransom,Rachel Stephens,Carolyn Semmler,Nicolas Fay,Lewis Mitchell
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional psychological models of belief revision focus on face-to-face interactions, but with the rise of social media, more effective models are needed to capture belief revision at scale, in this rich text-based online discourse. Here, we use a hybrid approach, utilizing large language models (LLMs) to develop a model that predicts successful persuasion using features derived from psychological experiments. Our approach leverages LLM generated ratings of features previously examined in the literature to build a random forest classification model that predicts whether a message will result in belief change. Of the eight features tested, \textitepistemic emotion and \textitwillingness to share were the top-ranking predictors of belief change in the model. Our findings provide insights into the characteristics of persuasive messages and demonstrate how LLMs can enhance models of successful persuasion based on psychological theory. Given these insights, this work has broader applications in fields such as online influence detection and misinformation mitigation, as well as measuring the effectiveness of online narratives. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.22109 [cs.CL] (or arXiv:2511.22109v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.22109 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICWSM Workshop Proceedings (2025) Related DOI: https://doi.org/10.36190/2025.38 Focus to learn more DOI(s) linking to related resources
zh

[NLP-67] Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

【速读】: 该论文旨在解决利用电子健康记录(Electronic Health Records, EHR)中的临床笔记进行慢性病早期风险预测时面临的挑战,包括文本长度长、事件分布不规则、复杂的时间依赖性、隐私约束及资源限制等问题。其解决方案的关键在于提出两种互补方法:一是HiTGNN(分层时间图神经网络),通过整合单次就诊内的事件时间结构、跨就诊动态变化以及医学知识,实现细粒度时间维度上的患者轨迹建模;二是ReVeAL(轻量级测试时推理框架),将大语言模型的推理能力蒸馏为小型验证器模型,在提升对2型糖尿病(Type 2 Diabetes, T2D)真实病例敏感性的同时保留可解释性。实证表明,这两种方法在保证隐私和减少对大型专有模型依赖的前提下,显著提升了近中期风险预测性能,并在不同子群体中表现出更公平的表现。

链接: https://arxiv.org/abs/2511.22038
作者: Rochana Chaturvedi,Yue Zhou,Andrew Boyd,Brian T. Layden,Mudassir Rashid,Lu Cheng,Ali Cinar,Barbara Di Eugenio
机构: Argonne National Laboratory(阿贡国家实验室); University of Illinois Chicago(伊利诺伊大学芝加哥分校); Illinois Institute of Technology(伊利诺伊理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight, test-time framework that distills the reasoning of large language models into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.
zh

[NLP-68] ResearchArcade: Graph Interface for Academic Tasks

【速读】: 该论文旨在解决学术研究中数据来源多样、任务定义不统一以及模型支持有限的问题,核心挑战在于如何构建一个能够整合多源异构数据、统一任务接口并支持多种基础模型的通用数据框架,以提升机器学习在学术研究全流程中的应用效能。解决方案的关键在于提出ResearchArcade——一个基于图结构的统一数据接口,它通过多表(multi-table)格式组织来自ArXiv和OpenReview等平台的文本、图表等多模态信息,并保留稿件与研究社区层面的时间演化特性,从而实现跨数据源、跨任务的协同建模与性能提升。

链接: https://arxiv.org/abs/2511.22036
作者: Jingjun Xu,Chongshan Lin,Haofei Yu,Tao Feng,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Academic research generates diverse data sources, and as researchers increasingly use machine learning to assist research tasks, a crucial question arises: Can we build a unified data interface to support the development of machine learning models for various academic tasks? Models trained on such a unified interface can better support human researchers throughout the research process, eventually accelerating knowledge discovery. In this work, we introduce ResearchArcade, a graph-based interface that connects multiple academic data sources, unifies task definitions, and supports a wide range of base models to address key academic challenges. ResearchArcade utilizes a coherent multi-table format with graph structures to organize data from different sources, including academic corpora from ArXiv and peer reviews from OpenReview, while capturing information with multiple modalities, such as text, figures, and tables. ResearchArcade also preserves temporal evolution at both the manuscript and community levels, supporting the study of paper revisions as well as broader research trends over time. Additionally, ResearchArcade unifies diverse academic task definitions and supports various models with distinct input requirements. Our experiments across six academic tasks demonstrate that combining cross-source and multi-modal information enables a broader range of tasks, while incorporating graph structures consistently improves performance over baseline methods. This highlights the effectiveness of ResearchArcade and its potential to advance research progress.
zh

[NLP-69] AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models

【速读】: 该论文旨在解决现有AI偏见评估基准主要反映西方视角、忽视非洲语境的问题,从而导致在各类应用场景中产生有害刻板印象。其解决方案的关键在于构建了首个开源的非洲刻板印象数据集与评估框架AfriStereo,该框架基于当地社会文化背景,通过跨塞内加尔、肯尼亚和尼日利亚的社区参与式采集,收集了1,163条涵盖性别、族裔、宗教、年龄和职业维度的刻板印象,并利用少量样本提示(few-shot prompting)结合人工验证的方法扩展至5,000余对刻板印象-反刻板印象样本,同时采用语义聚类与文化敏感评审员的手动标注进行质量控制。初步评估表明,11个语言模型中有9个表现出显著偏见(Bias Preference Ratios, BPR 0.63–0.78, p = 0.05),尤其在年龄、职业和性别维度上偏好刻板印象;而领域特定模型则显示出较弱偏见,提示任务驱动训练可能有助于缓解部分关联。此工作为未来基于文化语境的偏见评估与缓解研究提供了重要方法论支撑。

链接: https://arxiv.org/abs/2511.22016
作者: Yann Le Beux,Oluchi Audu,Oche D. Ankeli,Dhananjay Balakrishnan,Melissah Weya,Marie D. Ralaiarinosy,Ignatius Ezeani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype-antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p = 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.
zh

[NLP-70] Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity ACL

【速读】: 该论文旨在解决Transformer语言模型(Language Models, LMs)中自注意力矩阵操作如何映射到可解释的计算或功能,以及个体注意力头(attention heads)何时及如何发展出专门化注意力模式的问题。其解决方案的关键在于提出一个系统性探测注意力机制的流程,并采用“发育”(developmental)视角,通过利用词汇歧义(lexical ambiguity)来隔离对词义消歧(word sense disambiguation)有贡献的注意力机制;具体而言,研究首先识别不同模型在训练过程中消歧性能的拐点,进而发现与整体性能变化相关的注意力头,并通过扰动测试和因果分析验证这些头的稳健性和必要性,从而揭示小模型(如14M)中敏感但脆弱的机制与大模型(如410M)中更具泛化能力的注意力行为之间的差异。

链接: https://arxiv.org/abs/2511.21974
作者: Pamela D. Rivière,Sean Trott
机构: University of California, San Diego (加州大学圣地亚哥分校); Rutgers University - Newark (罗格斯大学纽瓦克分校)
类目: Computation and Language (cs.CL)
备注: 13 pages (main text), 5 figures (main text) 6 pages (appendix), 6 figures (appendix), journal submission to TACL (“a” decision: pre-MIT Press publication version)

点击查看摘要

Abstract:Despite an in-principle understanding of self-attention matrix operations in Transformer language models (LMs), it remains unclear precisely how these operations map onto interpretable computations or functions–and how or when individual attention heads develop specialized attention patterns. Here, we present a pipeline to systematically probe attention mechanisms, and we illustrate its value by leveraging lexical ambiguity–where a single word has multiple meanings–to isolate attention mechanisms that contribute to word sense disambiguation. We take a “developmental” approach: first, using publicly available Pythia LM checkpoints, we identify inflection points in disambiguation performance for each LM in the suite; in 14M and 410M, we identify heads whose attention to disambiguating words covaries with overall disambiguation performance across development. We then stress-test the robustness of these heads to stimulus perturbations: in 14M, we find limited robustness, but in 410M, we identify multiple heads with surprisingly generalizable behavior. Then, in a causal analysis, we find that ablating the target heads demonstrably impairs disambiguation performance, particularly in 14M. We additionally reproduce developmental analyses of 14M across all of its random seeds. Together, these results suggest: that disambiguation benefits from a constellation of mechanisms, some of which (especially in 14M) are highly sensitive to the position and part-of-speech of the disambiguating cue; and that larger models (410M) may contain heads with more robust disambiguation behavior. They also join a growing body of work that highlights the value of adopting a developmental perspective when probing LM mechanisms.
zh

[NLP-71] A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics

【速读】: 该论文旨在解决中文歌词领域作者归属识别(authorship attribution)问题,其核心挑战在于该领域缺乏干净且公开可用的数据集。为应对这一问题,研究提出两个关键解决方案:一是构建了一个覆盖多种音乐流派的平衡中文歌词数据集,填补了该领域的数据空白;二是开发并微调了一个面向特定领域的模型,并与零样本推理的DeepSeek大语言模型(LLM)进行对比实验。研究表明,模型性能显著依赖于流派类型——结构化流派(如民间传统类)的归属准确率远高于抽象流派(如爱情浪漫类),这凸显了流派敏感性评估的重要性。此外,实验发现微调在真实世界复杂场景中提升模型鲁棒性和泛化能力,但在小规模合成增强数据集上效果有限,说明测试集设计缺陷(如标签不平衡、词汇差异浅层化等)可能掩盖微调的真实优势。因此,该工作的关键创新在于首次建立跨流派中文歌词归属的基准评测体系,并提供可复用的数据集与分析框架,为后续研究指明方向:扩充多样化测试集、减少对词级数据增强的依赖、平衡各流派作者分布,并探索领域自适应预训练以进一步提升归属准确性。

链接: https://arxiv.org/abs/2511.21930
作者: Yuxin Li,Lorraine Xu,Meng Fan Wang
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance. Comments: 8 pages, 6 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.21930 [cs.CL] (or arXiv:2511.21930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.21930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-72] racing How Annotators Think: Augmenting Preference Judgments with Reading Processes

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中主观任务(如偏好标注)的注释可靠性与决策机制理解不足的问题。传统标注方法仅记录最终标签,忽略了 annotator 决策过程中的认知行为,导致难以解释标注差异和评估其可靠性。解决方案的关键在于提出一种新型注释框架,通过鼠标轨迹追踪(mouse tracking)捕捉 annotator 在阅读提示(prompt)与候选响应之间的细粒度阅读行为,包括聚焦区域、重读和略读等过程。基于此,作者构建了 PreferRead 数据集,揭示了阅读行为与标注结果之间的显著关联:例如,重读行为与更高的一致性相关,而较长的阅读路径和时间则与更低的一致性相关,从而为理解 annotator 的认知过程提供了补充维度。

链接: https://arxiv.org/abs/2511.21912
作者: Karin de Langis,William Walker,Khanh Chi Le,Dongyeop Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose an annotation approach that captures not only labels but also the reading process underlying annotators’ decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.
zh

[NLP-73] A Customer Journey in the Land of Oz: Leverag ing the Wizard of Oz Technique to Model Emotions in Customer Service Interactions

【速读】: 该论文旨在解决情感感知型客户服务中缺乏领域特定对话数据、丰富标注以及预测能力的问题,现有资源通常处于非目标领域、标签单一且仅关注事后情感识别。其解决方案的关键在于设计并实施了一个受控的“巫师之奥兹”(Wizard of Oz, WOZ)实验,通过人工干预引导对话参与者产生预设的情感轨迹,从而构建出首个面向客户支持场景的双语(荷兰语-英语)情感标注对话语料库 EmoWOZ-CS,包含2,148条来自商业航空、电子商务、在线旅游代理和电信等领域的对话。该方法不仅验证了WOZ驱动的情绪轨迹设计在情感研究中的有效性,还量化了人类标注的一致性差异,并首次在真实客服交互中对前瞻性情绪推理进行了基准测试,揭示了主动式情感感知支持的复杂性与挑战。

链接: https://arxiv.org/abs/2511.21909
作者: Sofie Labat,Thomas Demeester,Véronique Hoste
机构: Ghent University (根特大学); IDLab, Internet technology and Data science Lab, Ghent University–imec (根特大学–imec互联网技术和数据科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.
zh

[NLP-74] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在多项选择题(Multiple Choice Question Answering, MCQA)基准测试中得分不可靠的问题,尤其是当模型表现出高MCQA分数但响应一致性低时,其真实能力被高估。解决方案的关键在于提出了一种新的评估指标——一致性再平衡准确率(Consistency-Rebalanced Accuracy, CoRA),该指标通过引入两个中间评分:最低一致性准确率(Bare-Minimum-Consistency Accuracy, BMCA)和一致性指数(Consistency Index, CI),利用合成生成的、答案选项被扰动的问题来量化模型响应的一致性,并据此对原始MCQA分数进行调整,从而更真实地反映LLM的实际性能。

链接: https://arxiv.org/abs/2511.21860
作者: Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano
机构: IBM Research Brazil(IBM研究巴西)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.
zh

[NLP-75] FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

【速读】: 该论文旨在解决科学论文中错误定位(error localization)的自动化评估难题,即如何有效识别并精确定位削弱研究核心主张的错误。随着科研产出指数级增长,人工审稿人难以应对海量文献中的错误检测需求,而大语言模型(Large Language Models, LLMs)在辅助学术评审方面的潜力尚未充分挖掘。解决方案的关键在于构建一个名为FLAWS(Fault Localization Across Writing in Science)的自动化基准测试集,包含713对论文-错误配对,通过系统性地向已发表论文中注入具有破坏性的错误,并设计可扩展、自动化的评估指标来衡量LLMs识别与定位这些错误的能力。该基准解决了三大挑战:确保插入错误语义明确且具挑战性、避免人为引入的提示线索(artifacts)、以及实现高效、客观的评估机制。实验表明,GPT 5在k=10时达到最高识别准确率39.1%,验证了该基准的有效性和对LLM能力的量化价值。

链接: https://arxiv.org/abs/2511.21843
作者: Sarina Xi,Vishisht Rao,Justin Payan,Nihar B. Shah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 30 pages, 12 tables, 2 figures

点击查看摘要

Abstract:The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.
zh

[NLP-76] Factors That Support Grounded Responses in LLM Conversations: A Rapid Review

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话中可能出现的输出与用户意图不一致、缺乏上下文 grounding 以及产生幻觉(hallucination)等问题,这些问题会显著降低基于 LLM 应用的可靠性。其解决方案的关键在于系统性地识别和分析三类对齐策略:推理时(inference-time)、后训练(post-training)及基于强化学习(reinforcement learning-based)的方法,并指出推理时方法最为高效——无需重新训练即可实现用户意图对齐、上下文约束增强和幻觉缓解,从而为提升 LLM 输出质量与可靠性提供了结构化机制。

链接: https://arxiv.org/abs/2511.21762
作者: Gabriele Cesar Iwashima,Claudia Susie Rodrigues,Claudio Dipolitto,Geraldo Xexéo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) may generate outputs that are misaligned with user intent, lack contextual grounding, or exhibit hallucinations during conversation, which compromises the reliability of LLM-based applications. This review aimed to identify and analyze techniques that align LLM responses with conversational goals, ensure grounding, and reduce hallucination and topic drift. We conducted a Rapid Review guided by the PRISMA framework and the PICO strategy to structure the search, filtering, and selection processes. The alignment strategies identified were categorized according to the LLM lifecycle phase in which they operate: inference-time, post-training, and reinforcement learning-based methods. Among these, inference-time approaches emerged as particularly efficient, aligning outputs without retraining while supporting user intent, contextual grounding, and hallucination mitigation. The reviewed techniques provided structured mechanisms for improving the quality and reliability of LLM responses across key alignment objectives.
zh

[NLP-77] LLM s for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在方言及低资源语种翻译任务中表现不佳的问题,特别是针对孟加拉语方言——锡尔赫特语(Sylheti)这一低资源语言的机器翻译(Machine Translation, MT)挑战。现有LLMs虽在通用翻译任务中表现出色,但在处理方言特有的词汇和表达时存在显著局限性。为应对这一问题,作者提出Sylheti-CAP(Context-Aware Prompting),其核心在于构建一个三步式提示框架:首先嵌入语言规则手册,其次整合包含2,260个核心词汇与习语的词典,最后引入真实性校验机制,全部集成于提示(prompt)中以增强上下文感知能力。实验表明,该方法能显著提升多模型在双向翻译(Bangla ⇔ Sylheti)中的质量,减少幻觉、歧义和生硬表达,从而为方言及低资源场景下的MT提供可扩展的解决方案。

链接: https://arxiv.org/abs/2511.21761
作者: Tabia Tanzin Prama,Christopher M. Danforth,Peter Sheridan Dodds
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla \Leftrightarrow Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2,260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: \hrefthis https URLthis https URL
zh

[NLP-78] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

【速读】: 该论文旨在解决如何将多模态大语言模型(Multimodal Large Language Models, MLLMs)的能力扩展至脑成像领域,以实现神经活动与语义认知之间的统一建模,并构建跨模态的脑表示。其核心挑战在于缺乏自然的fMRI-文本配对数据以及如何将功能性磁共振成像(fMRI)信号转化为可被语言模型理解的结构化表示。解决方案的关键在于提出一个三阶段框架:首先通过神经分词器(neural tokenizer)将fMRI信号映射到语言一致空间中的离散token;其次利用预训练大语言模型(LLM)联合建模fMRI token与文本,结合自建的结构化文本描述语料库弥补数据稀缺问题;最后通过多任务、多范式指令微调(instruction tuning),赋予模型高层语义理解能力,从而在多种下游任务中实现零样本和少样本性能,且支持参数高效微调(如LoRA),为构建语言对齐、通用的fMRI结构与语义理解模型提供了可扩展路径。

链接: https://arxiv.org/abs/2511.21760
作者: Yuxiang Wei,Yanteng Zhang,Xi Xiao,Chengxuan Qian,Tianyang Wang,Vince D. Calhoun
机构: TreNDS; University of Alabama at Birmingham; Jiangsu University; Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
zh

[NLP-79] Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

【速读】: 该论文旨在解决扩散式大语言模型(Diffusion-based Large Language Models, dLLMs)在推理过程中因双向注意力机制导致的缓存刷新频繁问题,进而引发预填充(prefill)与解码(decoding)阶段交替执行、显著增加推理开销并限制加速潜力的瓶颈。解决方案的关键在于提出 ODB-dLLM 框架,通过双边界(dual-boundaries)协同优化:在预填充阶段引入自适应长度预测机制,动态调整生成长度以减少冗余计算;在解码阶段设计针对 dLLM 特性的跳步共享推测解码(jump-share speculative decoding)方法,降低解码迭代次数,从而实现高效推理。实验表明,该方案相较基线 dLLM 和 Fast-dLLM 分别获得 46–162× 和 2.63–6.30× 的加速比,同时缓解了现有加速框架中的精度下降问题。

链接: https://arxiv.org/abs/2511.21759
作者: Linye Wei,Wenjue Chen,Pingzhi Tang,Xiaotian Guo,Le Ye,Runsheng Wang,Meng Li
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.
zh

[NLP-80] Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLM s

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗场景中应用时,因依赖通用危害定义而导致的伦理安全漏洞问题,尤其无法识别和防范具有情境依赖性的违规行为,如行政欺诈和临床歧视。解决方案的关键在于提出“Medical Malice”数据集,其中包含214,219条针对巴西统一卫生系统(SUS)监管与伦理复杂性校准的对抗性提示,并附带每项违规背后的推理逻辑,使模型能够内化伦理边界而非仅记忆固定拒绝规则。通过在人格驱动的流水线中使用未对齐代理(Grok-4)生成七类高保真威胁(包括采购操纵、排队插队至产科暴力),该研究推动从通用安全范式向情境感知安全范式的转变,从而为医疗AI提供免疫机制以应对高风险医疗环境中系统性、细微的威胁。

链接: https://arxiv.org/abs/2511.21757
作者: Andrew Maranhão Ventura D’addario
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into healthcare demands a safety paradigm rooted in \textitprimum non nocere. However, current alignment techniques rely on generic definitions of harm that fail to capture context-dependent violations, such as administrative fraud and clinical discrimination. To address this, we introduce Medical Malice: a dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical complexities of the Brazilian Unified Health System (SUS). Crucially, the dataset includes the reasoning behind each violation, enabling models to internalize ethical boundaries rather than merely memorizing a fixed set of refusals. Using an unaligned agent (Grok-4) within a persona-driven pipeline, we synthesized high-fidelity threats across seven taxonomies, ranging from procurement manipulation and queue-jumping to obstetric violence. We discuss the ethical design of releasing these “vulnerability signatures” to correct the information asymmetry between malicious actors and AI developers. Ultimately, this work advocates for a shift from universal to context-aware safety, providing the necessary resources to immunize healthcare AI against the nuanced, systemic threats inherent to high-stakes medical environments – vulnerabilities that represent the paramount risk to patient safety and the successful integration of AI in healthcare systems.
zh

[NLP-81] Dissecting the Ledger: Locating and Suppressing “Liar Circuits” in Financial Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域高风险应用中因算术推理导致的可复现性幻觉(hallucination)问题。现有缓解策略多将模型视为黑箱,缺乏对内在机制的理解。其解决方案的关键在于采用因果追踪(Causal Tracing)方法,揭示了GPT-2 XL模型在执行算术推理时存在双阶段机制:中间层(L12–L30)构成分布式计算临时存储区(scratchpad),而晚期层(特别是第46层)则形成决定性的聚合电路(aggregation circuit)。通过抑制第46层,模型对幻觉输出的信心下降81.8%,且基于该层训练的线性探测器在未见金融主题上仍达到98%准确率,表明算术欺骗具有普遍几何结构。

链接: https://arxiv.org/abs/2511.21756
作者: Soham Mirajkar
机构: 未知
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model’s confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.
zh

[NLP-82] Extracting Disaster Impacts and Impact Related Locations in Social Media Posts Using Large Language Models

【速读】: 该论文旨在解决大规模灾害事件中因传感器数据和遥感影像获取受限而导致的地理-时间信息缺口问题,尤其是在灾情影响范围识别上的滞后与不准确。传统方法难以及时获取受灾地点的精确信息,而社交媒体虽可提供实时动态,但其中混杂大量非受灾地点(如“希腊”、“雅典”),导致资源调配效率低下。解决方案的关键在于利用微调后的大型语言模型(Large Language Models, LLMs)对灾难相关社交媒体文本进行细粒度分析,精准区分出“影响”(impact)和“受影响地点”(impacted location),并能处理非正式表达、缩写及简写形式,从而实现从海量非结构化文本中自动提取关键灾情地理信息。实验表明,该方法在受影响地点抽取任务上达到0.74的F1分数,显著优于预训练基线模型,为应急响应中的态势感知与资源分配提供了可扩展的智能支持。

链接: https://arxiv.org/abs/2511.21753
作者: Sameeah Noreen Hameed,Surangika Ranathunga,Raj Prasanna,Kristin Stock,Christopher B. Jones
机构: Massey University (梅西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale disasters can often result in catastrophic consequences on people and infrastructure. Situation awareness about such disaster impacts generated by authoritative data from in-situ sensors, remote sensing imagery, and/or geographic data is often limited due to atmospheric opacity, satellite revisits, and time limitations. This often results in geo-temporal information gaps. In contrast, impact-related social media posts can act as “geo-sensors” during a disaster, where people describe specific impacts and locations. However, not all locations mentioned in disaster-related social media posts relate to an impact. Only the impacted locations are critical for directing resources effectively. e.g., “The death toll from a fire which ripped through the Greek coastal town of #Mati stood at 80, with dozens of people unaccounted for as forensic experts tried to identify victims who were burned alive #Greecefires #AthensFires #Athens #Greece.” contains impacted location “Mati” and non-impacted locations “Greece” and “Athens”. This research uses Large Language Models (LLMs) to identify all locations, impacts and impacted locations mentioned in disaster-related social media posts. In the process, LLMs are fine-tuned to identify only impacts and impacted locations (as distinct from other, non-impacted locations), including locations mentioned in informal expressions, abbreviations, and short forms. Our fine-tuned model demonstrates efficacy, achieving an F1-score of 0.69 for impact and 0.74 for impacted location extraction, substantially outperforming the pre-trained baseline. These robust results confirm the potential of fine-tuned language models to offer a scalable solution for timely decision-making in resource allocation, situational awareness, and post-disaster recovery planning for responders.
zh

[NLP-83] Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本分类任务中因依赖自然语言提示而易受提示注入攻击(prompt injection attacks)的问题,特别是针对类别导向型注入攻击(class-directive injections),此类攻击利用模型标签集知识(如“正面”与“负面”)通过对抗性指令篡改模型行为。解决方案的关键在于提出一种轻量级、模型无关的防御策略——标签伪装防御(Label Disguise Defense, LDD),其核心机制是通过语义变换或无关替代标签(如“蓝色”替代“正面”)隐藏真实标签,并借助少量示例(few-shot demonstrations)使模型隐式学习新的标签映射,从而阻断注入指令与决策输出之间的直接对应关系,有效提升模型对提示注入攻击的鲁棒性。

链接: https://arxiv.org/abs/2511.21752
作者: Yanxi Li,Ruocheng Shan
机构: George Washington University (乔治华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model’s label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.
zh

[NLP-84] SO-Bench: A Structural Output Evaluation of Multimodal LLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实代理场景中生成符合预定义数据模式(data schemas)的结构化输出能力不足的问题。当前尽管文本领域的结构化生成已有进展,但缺乏系统评估视觉输入下模型进行结构化信息提取与推理能力的基准。为此,作者提出了SO-Bench基准,涵盖UI界面、自然图像、文档和图表四大视觉领域,包含超过6.5K个JSON schema和1.8K张人工验证质量的图像-模式配对。实验表明,开源与前沿闭源模型在生成准确且符合schema的输出方面仍存在显著差距,揭示了提升多模态结构化推理能力的必要性;进一步的训练实验也证明可通过针对性优化显著增强模型的结构化输出能力,其关键在于设计高质量、多样化的视觉-结构化输出数据集并结合有效训练策略以强化模型对schema约束的理解与遵循。

链接: https://arxiv.org/abs/2511.21750
作者: Di Feng,Kaixin Ma,Feng Nan,Haofeng Chen,Bohan Zhai,David Griffiths,Mingfei Gao,Zhe Gan,Eshan Verma,Yinfei Yang,Zhifeng Chen,Afshin Dehghan
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model’s structured output capability. We plan to make the benchmark available to the community.
zh

[NLP-85] Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

【速读】: 该论文旨在解决生成式 AI(Generative AI)在信息环境中对 persuasion attacks(说服攻击)的检测与防御能力不足的问题,尤其是不同大语言模型(Large Language Models, LLMs)在识别复杂说服策略时表现差异显著、且缺乏系统性评估框架的问题。解决方案的关键在于提出 BRIES 架构——一个由四个专业化智能体组成的复合型 AI 系统:Twister 用于生成针对性的说服攻击内容,Detector 实现可配置参数的攻击类型识别,Defender 通过内容免疫(content inoculation)构建抗攻击内容,Assessor 则基于因果推断量化免疫效果。该架构不仅揭示了 GPT-4 在复杂说服技术检测上的优势及开源模型如 Llama3 和 Mistral 对细微修辞手法识别能力的局限,还发现温度设置和提示工程对检测效能具有显著的模型特异性影响,并首次从认知维度解析了不同攻击类型的靶向机制,从而为提升人类认知韧性提供结构化干预路径。

链接: https://arxiv.org/abs/2511.21749
作者: Svitlana Volkova,Will Dupree,Hsien-Te Kao,Peter Bautista,Gabe Ganberg,Jeff Beaubien,Laura Cassani
机构: Aptima, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.
zh

[NLP-86] Building Domain-Specific Small Language Models via Guided Data Generation

【速读】: 该论文旨在解决在专业领域中部署大语言模型(Large Language Models, LLMs)时面临的两大挑战:一是作为SaaS服务部署时的数据隐私风险,二是开源模型在领域适配和部署过程中对计算资源的高需求。为此,作者提出了一种成本高效且可扩展的训练流水线,其关键在于结合从少量种子语料库生成的引导式合成数据与自下而上的领域数据筛选策略,并集成领域自适应预训练(Domain-Adaptive Pretraining, DAPT)、领域特定监督微调(Domain-specific Supervised Fine-tuning, DSFT)以及直接偏好优化(Direct Preference Optimization, DPO)三个阶段,从而训练出适用于工业故障诊断、根本原因分析和维修建议等场景的小规模专用模型(DiagnosticSLM)。该方案有效缓解了高质量领域数据稀缺的问题,并在多个领域基准测试中显著优于同类规模或更大规模的开源模型。

链接: https://arxiv.org/abs/2511.21748
作者: Aman Kumar,Ekant Muljibhai Amin,Xian Yeow Lee,Lasitha Vidyaratne,Ahmed K. Farahat,Dipanjan D. Ghosh,Yuta Koreeda,Chetan Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Thirty-Eighth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-26)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.
zh

[NLP-87] DELTA: Language Diffusion-based EEG-to-Text Architecture

【速读】: 该论文旨在解决脑电图(Electroencephalogram, EEG)到文本生成任务中的高维噪声、个体差异显著以及自回归解码过程中误差累积等核心挑战。其解决方案的关键在于提出DELTA框架,该框架由两部分组成:一是基于残差向量量化(Residual Vector Quantization, RVQ)的EEG分词器,用于将连续EEG信号离散化为多层token以降低噪声和个体差异;二是基于掩码语言建模的扩散模型(Masked Language Diffusion Model, LLaDA),通过非序列化的去噪机制实现句子重建。这一设计有效提升了语义对齐性能,并在小规模EEG-文本数据集上实现了可靠的文字生成能力,为构建可扩展的多模态EEG-语言模型提供了新路径。

链接: https://arxiv.org/abs/2511.21746
作者: Mingyu Jeon,Hyobin Kim
机构: MODULABS; Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.
zh

[NLP-88] A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features

【速读】: 该论文旨在解决当前AI生成文本检测方法中存在的计算资源消耗大、跨领域泛化能力弱以及轻量级模型准确率低的问题。现有方法多依赖大规模Transformer模型微调或集成学习,虽有一定效果但效率低下;而轻量级方案则往往在大型数据集上表现不佳。其解决方案的关键在于提出一种名为NEULIF的轻量化检测框架,通过将文本分解为风格特征(stylometric features)和可读性特征(readability features),并使用紧凑的卷积神经网络(CNN)或随机森林(Random Forest)进行分类,从而在保证高检测精度(CNN达97%准确率,F1≈0.95)的同时,显著降低模型体积(<25 MB)与推理开销,可在普通CPU设备上高效运行,且具备良好的跨语言、跨域及流式场景适应潜力。

链接: https://arxiv.org/abs/2511.21744
作者: Sergey K. Aityan,William Claster,Karthik Sai Emani,Sohni Rais,Thy Tran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 3 tables

点击查看摘要

Abstract:A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing this http URL study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.
zh

[NLP-89] Scaling Competence Shrinking Reasoning : Cognitive Signatures in Language Model Learning

【速读】: 该论文旨在解决语言模型在特定任务微调过程中推理行为演化机制不明确的问题,特别是如何量化和理解模型从错误输出到高效解决任务的推理路径。解决方案的关键在于借鉴认知科学中的“四阶段胜任模型”(Four Stages of Competence),将推理标记(reasoning tokens)视为类比人类工作记忆的中间步骤,并发现推理标记长度随训练阶段呈现先增后减的趋势:在意识性熟练阶段达到峰值,随后因任务内化而减少。这一动态特征可作为诊断训练阶段、判断收敛性及指导早停策略的有效信号,从而为优化生成式 AI 的推理能力提供可操作的指标体系。

链接: https://arxiv.org/abs/2511.21743
作者: Mukul Singh,Ananya Singha,Arjun Radhakrishna,Sumit Gulwani
机构: Microsoft(微软); Microsoft(微软); Microsoft(微软); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We analyze reasoning in language models during task-specific fine-tuning and draws parallel between reasoning tokens–intermediate steps generated while solving problem and the human working memory. Drawing from cognitive science, we align training dynamics with the Four Stages of Competence: models initially produce incorrect outputs without reasoning, then begin reasoning (but still fail), eventually reason effectively, and finally solve tasks without explicit reasoning. We find that reasoning token length expands as performance improves, peaks at the stage of conscious competence, then declines as the model internalizes the task. Notably, after training, models retain performance even when reasoning is removed–suggesting it scaffolded learning but is no longer needed. This progression offers actionable insights: reasoning token dynamics can serve as a signal for diagnosing training stage, identifying convergence, and guiding early stopping. We propose metrics to track this trajectory and argue that reasoning behavior is valuable for understanding and optimizing reasoning model training.
zh

[NLP-90] EduMod-LLM : A Modular Approach for Designing Flexible and Transparent Educational Assistants AAAI

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的问答(Question-Answering, QA)系统在教育场景中缺乏细粒度性能评估的问题。现有系统通常作为黑箱整体运行,难以识别各模块(如函数调用、检索和生成)的具体贡献与失效模式,从而限制了其可解释性与教学适配性。解决方案的关键在于提出一个模块化的函数调用LLM流水线(modular function-calling LLM pipeline),通过分离并独立评估函数调用策略、检索方法及生成模型三个核心组件,实现对系统性能的精细化分析。该设计揭示了特定组件的失败模式与性能规律,显著提升了教育QA系统的透明度与教学一致性。

链接: https://arxiv.org/abs/2511.21742
作者: Meenakshi Mittal,Rishi Khare,Mihran Miroyan,Chancharik Mitra,Narges Norouzi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce \model, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: this https URL
zh

[NLP-91] A Multiscale Geometric Method for Capturing Relational Topic Alignment

【速读】: 该论文旨在解决现有基于密集Transformer嵌入的主題模型在科学语料中难以识别稀有主题(rare topics)且无法捕捉平滑时间演变的问题。其关键解决方案是提出一种几何方法,通过整合多模态文本数据与合著者网络数据,利用Hellinger距离和Ward聚类算法构建层次化主题树状图(hierarchical topic dendrogram),从而同时捕捉局部与全局结构,支持跨语义与时间维度的多尺度学习,有效识别稀有主题并可视化主题随时间的连续漂移。

链接: https://arxiv.org/abs/2511.21741
作者: Conrad D. Hougen,Karl T. Pazdernik,Alfred O. Hero
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 5 pages, 3 figures, 2025 IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing

点击查看摘要

Abstract:Interpretable topic modeling is essential for tracking how research interests evolve within co-author communities. In scientific corpora, where novelty is prized, identifying underrepresented niche topics is particularly important. However, contemporary models built from dense transformer embeddings tend to miss rare topics and therefore also fail to capture smooth temporal alignment. We propose a geometric method that integrates multimodal text and co-author network data, using Hellinger distances and Ward’s linkage to construct a hierarchical topic dendrogram. This approach captures both local and global structure, supporting multiscale learning across semantic and temporal dimensions. Our method effectively identifies rare-topic structure and visualizes smooth topic drift over time. Experiments highlight the strength of interpretable bag-of-words models when paired with principled geometric alignment.
zh

[NLP-92] Decoding inner speech with an end-to-end brain-to-text neural interface

【速读】: 该论文旨在解决语音脑机接口(Speech Brain-Computer Interface, BCI)中传统级联式框架的局限性问题,即通过分阶段解码音素再结合n-gram语言模型(Language Model, LM)生成句子的方式,难以实现各模块的联合优化。其解决方案的关键在于提出一种端到端的Brain-to-Text (BIT) 框架,该框架采用单一可微神经网络直接从神经活动映射到连贯语句,并引入跨任务、跨物种预训练的神经编码器(pretrained neural encoder),其表征能够迁移至尝试发音和想象发音两种情境;同时,通过对比学习(contrastive learning)对齐音频大语言模型(audio Large Language Model, LLM)的模态特征,显著降低词错误率(Word Error Rate, WER)至10.22%,并首次证明小规模音频LLM在端到端解码中的有效性,从而推动了神经数据的整合与可微优化的统一建模。

链接: https://arxiv.org/abs/2511.21740
作者: Yizi Zhang,Linyang He,Chaofei Fan,Tingkai Liu,Han Yu,Trung Le,Jingyuan Li,Scott Linderman,Lea Duncker,Francis R Willett,Nima Mesgarani,Liam Paninski
机构: Columbia University (哥伦比亚大学); Stanford University (斯坦福大学); Microsoft (微软); University of Washington (华盛顿大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
zh

[NLP-93] Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

【速读】: 该论文旨在解决如何可靠评估语言模型内部表征对齐性(alignment)的问题,特别是针对现有无监督探测方法(如对比一致搜索,CCS)是否能够有效识别模型中潜在的有害信念。其核心解决方案是提出一种极性感知的无监督探测方法——Polarity-Aware CCS (PA-CCS),该方法通过在语义极性反转条件下检测模型内部表示的一致性变化,来衡量模型隐式知识的语义鲁棒性。关键创新在于引入两个面向对齐性的量化指标:极性一致性(Polar-Consistency)和矛盾指数(Contradiction Index),并验证了PA-CCS能区分不同架构与层位置下模型对有害知识编码的差异,从而为模型对齐性提供可解释、结构敏感的评估框架。

链接: https://arxiv.org/abs/2511.21737
作者: Sabrina Sadiekh,Elena Ericheva,Chirag Agarwal
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model’s internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model’s latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: this https URL. WARNING: This paper contains potentially sensitive, harmful, and offensive content.
zh

[NLP-94] R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

【速读】: 该论文旨在解决极端压缩条件下(如2-bit量化)导致的模型精度严重下降问题,尤其是在大型语言模型(Large Language Models, LLMs)中应用低比特量化时面临的性能瓶颈。其解决方案的关键在于提出了一种名为残差精炼量化(Residual Refinement Quantization, R2Q)的新框架,该框架将2-bit量化分解为两个顺序进行的1-bit子量化过程,构建自适应量化网格,并通过残差学习机制对量化误差进行迭代修正,从而在保持极低存储与计算开销的同时显著提升模型性能、增强训练稳定性并加快收敛速度。

链接: https://arxiv.org/abs/2511.21736
作者: Jiayi Chen,Jieqi Shi,Jing Huo,Chen Wu
机构: Nanjing University (南京大学); Microsoft AI (微软人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.
zh

[NLP-95] Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting

【速读】: 该论文旨在解决放射科医生在高患者流量下因扩展的筛查指南、复杂病例和人力资源短缺而导致的工作负担过重问题,尤其是在胸部X光片(CXR)报告中对导管和线路(Lines and Tubes, LT)的重复性且耗时的描述任务。解决方案的关键在于开发并临床验证了一种多模态人工智能模型MAIRA-X,该模型基于包含310万份研究(600万张图像,来自80.6万名患者)的大规模、多中心纵向数据集训练而成,能够同时生成临床发现与LT相关信息的高质量报告。其创新点包括:引入针对LT属性(如类型、位置变化)的新型评估指标框架,并通过首次开展的盲法用户评估研究(9名不同经验水平的放射科医生评审600份独立病例),证明AI生成报告在关键错误率(原报告3.0% vs. AI生成4.6%)和可接受句子比例(原报告97.8% vs. AI生成97.4%)方面与人工报告相当,显著优于以往研究,表明MAIRA-X可在高负荷临床环境中有效辅助放射科医生。

链接: https://arxiv.org/abs/2511.21735
作者: Harshita Sharma,Maxwell C. Reynolds,Valentina Salvatelli,Anne-Marie G. Sykes,Kelly K. Horst,Anton Schwaighofer,Maximilian Ilse,Olesya Melnichenko,Sam Bond-Taylor,Fernando Pérez-García,Vamshi K. Mugu,Alex Chan,Ceylan Colak,Shelby A. Swartz,Motassem B. Nashawaty,Austin J. Gonzalez,Heather A. Ouellette,Selnur B. Erdal,Beth A. Schueler,Maria T. Wetscherek,Noel Codella,Mohit Jain,Shruthi Bannur,Kenza Bouzid,Daniel C. Castro,Stephanie Hyland,Panos Korfiatis,Ashish Khandelwal,Javier Alvarez-Valle
机构: Mayo Clinic (梅奥诊所); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-assisted report generation offers the opportunity to reduce radiologists’ workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (LT) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and LT reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and LT-related elements. A novel LT-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.
zh

[NLP-96] Asking LLM s to Verify First is Almost Free Lunch

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力提升过程中面临的高训练成本和测试阶段采样开销问题。为实现低成本、高效增强推理能力,作者提出验证优先(Verification-First, VF)策略,其核心在于引导模型在生成答案前先对一个给定的候选答案(即使为随机或无意义的答案)进行验证,从而触发一种“逆向推理”过程——该过程比传统的正向思维链(Chain-of-Thought, CoT)更易执行且能有效激发模型的批判性思维,减少逻辑错误。进一步地,研究将VF扩展为迭代式验证优先(Iter-VF),通过在测试阶段循环执行验证与生成步骤,形成一种顺序式测试时缩放(Test-Time Scaling, TTS)方法,显著优于现有TTS策略。实验表明,VF在多种基准任务(包括数学推理、代码生成和代理型任务)及不同规模的LLM上均表现优异,且计算开销极低。

链接: https://arxiv.org/abs/2511.21734
作者: Shiguang Wu,Quanming Yao
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a “reverse reasoning” process that is cognitively easier and complementary to standard forward Chain-of-Thought (CoT), effectively invoking the model’s critical thinking to reduce logical errors. We further generalize the VF strategy to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model’s previous answer. Extensive experiments across various benchmarks (from mathematical reasoning to coding and agentic tasks) and various LLMs (from open-source 1B to cutting-edge commercial ones) confirm that VF with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies.
zh

[NLP-97] RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models AAAI’-26

【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在实际应用中因忽略模型组件的差异化角色和层间重要性异质性而导致的适应效率受限问题。其核心解决方案是提出RoPE-aware Selective Adaptation (RoSA),关键在于两个创新机制:一是RoPE-aware Attention Enhancement (RoAE)模块,通过选择性增强受旋转位置编码(Rotary Position Embeddings, RoPE)影响的注意力状态中的低频维度成分,实现维度层面的精准干预;二是Dynamic Layer Selection (DLS)策略,基于LayerNorm梯度范数动态识别并更新对任务最关键的层,实现层级别的自适应调整。二者结合使RoSA在保持可训练参数量相当的前提下显著提升微调效果。

链接: https://arxiv.org/abs/2511.21733
作者: Dayan Pan,Jingyuan Wang,Yilong Zhou,Jiawei Cheng,Pengyue Jia,Xiangyu Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI’ 26

点击查看摘要

Abstract:Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency. Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner. RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms. By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning. Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility at this https URL.
zh

[NLP-98] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

【速读】: 该论文旨在解决当前生成式 AI 在多模态幽默内容生成中存在的认知机制缺失问题,即现有数据驱动方法虽能生成流畅的图像描述,但缺乏对幽默背后复杂认知推理与社会理解的建模,导致生成结果缺乏真实幽默感和认知深度。其解决方案的关键在于提出 HUMORCHAIN(HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning),这是一个基于幽默理论引导的多阶段推理框架,首次将幽默理论中的认知结构显式嵌入到多模态幽默生成流程中,通过视觉语义解析、基于幽默与心理学的推理以及微调后的幽默判别器构成可解释且可控的认知推理链,从而实现从视觉理解到幽默创作的结构化推理过程。

链接: https://arxiv.org/abs/2511.21732
作者: Jiajun Zhang,Shijia Luo,Ruikang Zhang,Qi Su
机构: Peking University (北京大学); Ocean University of China (中国海洋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.
zh

[NLP-99] Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

【速读】: 该论文试图解决的问题是:大型语言模型(LLM)在处理概念组合和语言分布时是否表现出类似人类认知中的量子结构特征,以及这种现象是否具有跨生物与人工认知系统的普遍性。解决方案的关键在于通过实证测试发现,LLM在概念组合中显著违反贝尔不等式(Bell’s inequalities),表明存在类“量子纠缠”现象;同时,在大规模文本词频分布中观察到类“玻色-爱因斯坦统计”而非经典“麦克斯韦-玻尔兹曼统计”,这与人类认知实验和语料库信息检索结果一致。研究进一步指出,这种一致性源于LLM中由神经网络构建的向量空间语义结构所具有的分布式意义组织方式,它与人类通过生物进化形成的认知结构之间存在演化趋同,从而提出一个统一框架来解释语义领域中普遍存在的量子组织机制。

链接: https://arxiv.org/abs/2511.21731
作者: Diederik Aerts,Jonito Aerts Arguëlles,Lester Beltran,Suzette Geriente,Roberto Leporini,Massimiliano Sassoli de Bianchi,Sandro Sozzo
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); University of Bergamo (贝加莫大学); University of Udine (乌迪内大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell’s inequalities are significantly violated, which indicates the presence of ‘quantum entanglement’ in the tested concepts. In the second test, also performed using ChatGPT and Gemini, we instead identify the presence of ‘Bose-Einstein statistics’, rather than the intuitively expected ‘Maxwell-Boltzmann statistics’, in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the ‘systematic emergence of quantum structures in conceptual-linguistic domains’, regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.
zh

[NLP-100] A Benchmark for Procedural Memory Retrieval in Language Agents

【速读】: 该论文旨在解决当前人工智能代理在面对新颖任务与未见词汇时表现出的显著性能下降问题,这本质上是程序记忆(procedural memory)系统在跨情境泛化能力上的局限。其解决方案的关键在于构建了一个首个能够将程序记忆检索与任务执行相分离的基准测试框架,通过ALFWorld环境生成专家轨迹与大语言模型(LLM)生成的轨迹双语料库,并采用系统性分层查询评估六种检索方法。研究发现,基于嵌入的方法在熟悉场景下表现优异,但在新情境中急剧退化;而LLM生成的程序抽象则展现出稳定的跨情境迁移能力。进一步控制变量实验表明,嵌入方法仅能捕捉词法层面的抽象,忽视了程序的时间结构,从而无法实现真正的程序理解;同时,数据规模带来的收益远超表示维度增强,揭示出现有编码器架构存在性能天花板。该工作首次提供了诊断程序理解是否超越表面记忆的分析框架,并为开发具备可靠泛化能力的检索系统提供工具支持。

链接: https://arxiv.org/abs/2511.21730
作者: Ishant Kohar,Aswanth Krishnan
机构: Qpi AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies – a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM-generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM-generated procedural abstractions demonstrate reliable cross-context transfer. Controlled ablations show that although embeddings capture some lexical-level abstraction, they fundamentally treat procedures as unordered bags of words, discarding temporal structure necessary for cross-context transfer. Corpus scale delivers far larger gains than representation enrichment, revealing an architectural ceiling in current encoders. Our benchmark offers the first diagnostic framework separating genuine procedural understanding from surface-level memorization and gives tools for developing retrieval systems capable of dependable generalization. Resources available at our GitHub repository (this https URL).
zh

[NLP-101] Beyond Component Strength: Synergistic Integration and Adaptive Calibration in Multi-Agent RAG Systems

【速读】: 该论文旨在解决生成式 AI (Generative AI) 系统中检索增强生成(Retrieval-Augmented Generation, RAG)模块的可靠性问题,即如何在不引入更多幻觉(hallucination)的前提下显著降低模型对无法回答的问题选择“放弃回答”(abstention)的比例。其解决方案的关键在于组件间的协同集成而非单一模块的强化:通过系统性消融实验发现,单独使用混合检索、集成验证或自适应阈值等技术几乎无益,但当它们协同工作时,可实现从40%到2%的弃答率下降,且不增加幻觉率;同时强调标准化评估指标与标签定义的重要性,以避免因标签不一致导致的误判,并指出需引入自适应校准机制防止高检索质量下仍出现过度自信的回答。

链接: https://arxiv.org/abs/2511.21729
作者: Jithin Krishnan
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Building reliable retrieval-augmented generation (RAG) systems requires more than adding powerful components; it requires understanding how they interact. Using ablation studies on 50 queries (15 answerable, 10 edge cases, and 25 adversarial), we show that enhancements such as hybrid retrieval, ensemble verification, and adaptive thresholding provide almost no benefit when used in isolation, yet together achieve a 95% reduction in abstention (from 40% to 2%) without increasing hallucinations. We also identify a measurement challenge: different verification strategies can behave safely but assign inconsistent labels (for example, “abstained” versus “unsupported”), creating apparent hallucination rates that are actually artifacts of labeling. Our results show that synergistic integration matters more than the strength of any single component, that standardized metrics and labels are essential for correctly interpreting performance, and that adaptive calibration is needed to prevent overconfident over-answering even when retrieval quality is high.
zh

[NLP-102] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)驱动的对话系统在情感丰富且目标导向的场景(如营销对话)中普遍存在的反应式行为局限,即缺乏主动的情感推理与动态知识锚定能力,导致交互缺乏情绪一致性与说服力。解决方案的关键在于提出AffectMind——一个融合三模块的多模态情感对话代理:1)主动知识锚定网络(Proactive Knowledge Grounding Network, PKGN)持续整合文本、视觉和韵律信息以更新事实与情感上下文;2)情绪-意图对齐模型(Emotion–Intent Alignment Model, EIAM)联合建模用户情绪与购买意图,从而自适应调整说服策略;3)强化话语循环(Reinforced Discourse Loop, RDL)通过用户反馈的强化信号优化情感连贯性与参与度。实验证明,该架构显著提升了情绪一致性、说服成功率及长期用户参与度,验证了情感锚定的主动性是商业多模态代理的核心能力。

链接: https://arxiv.org/abs/2511.21728
作者: Lin Yu,Xiaofei Han,Yifei Kang,Chiung-Yi Tseng,Danyang Zhang,Ziqian Bi,Zhimo Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion–Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26%), persuasive success rate (+19%), and long-term user engagement (+23%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.
zh

[NLP-103] Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中人类般的长期记忆能力不足的问题,以提升模型在少样本泛化等通用能力上的表现。现有方法多聚焦于设计最优的记忆压缩算法,但此类方法往往引入人类偏见,因依赖特定基准测试的提示(prompt)和记忆架构优化,难以适应其他数据分布。论文提出的关键解决方案是SUMER(Search in Uncompressed Memory via Experience Replay),一个基于可验证奖励的强化学习代理(Reinforcement Learning with Verifiable Reward, RLVR),其直接在未压缩的原始信息上进行目标导向的搜索,从而避免了压缩带来的信息损失。实验表明,SUMER在LoCoMo长上下文对话理解数据集上优于所有有偏的记忆压缩方法及全上下文基线,达到当前最优性能(较前人最佳提升43%),证明了对原始数据进行简单搜索的方法在长上下文记忆任务中更具优势,呼吁建立更动态、自主扩展的新范式与评估基准。

链接: https://arxiv.org/abs/2511.21726
作者: Yicong Zheng,Kevin L. McKee,Thomas Miconi,Zacharie Bugaud,Mick van Gelderen,Jed McCaleb
机构: Astera Institute (Astera 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at this https URL.
zh

[NLP-104] PromptTailor: Multi-turn Intent-Aligned Prompt Synthesis for Lightweight LLM s EMNLP2025

【速读】: 该论文旨在解决轻量级语言模型在开放域文本生成中因用户提示(prompt)质量不高而导致输出效果不佳的问题,尤其针对非专业用户难以持续生成高质量提示的挑战。解决方案的关键在于提出 PromptTailor 系统,其通过一个经过微调的轻量化 LoRA 适配器(LoRA adapter)驱动的 Llama3-8B 模型,在保持用户原始意图一致性的前提下,将简短的用户指令扩展为结构丰富、领域感知的优化提示。该系统基于来自三个更强大型语言模型(LLM)的12,300条提示优化对话数据进行蒸馏训练,使得边缘部署成为可能,并在人类与LLM评判中显著优于链式思维提示(chain-of-thought prompting)且达到或超越当前最优提示优化方法,同时大幅减少模型调用次数(如3次 vs. 9次)。

链接: https://arxiv.org/abs/2511.21725
作者: Yizhou Xu,Janet Davis
机构: Whitman College (惠特曼学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Workshop PALS. Additional note: There is a citation error on Evoke. The paper we are referring to is “Evoking critical thinking abilities in LLMs via reviewer-author prompt editing.”

点击查看摘要

Abstract:Lightweight language models remain attractive for on-device and privacy-sensitive applications, but their responses are highly sensitive to prompt quality. For open-ended generation, non-expert users often lack the knowledge or time to consistently craft high-quality prompts, leading them to rely on prompt optimization tools. However, a key challenge is ensuring the optimized prompts genuinely align with users’ original intents and preferences. We introduce PromptTailor, a system for controllable prompt generation for open-ended text that improves model output quality by intent-aligned prompt synthesis. PromptTailor expands minimal user instructions into rich, domain-aware prompts while preserving the user’s stated preferences. The system is a quantized Llama3-8B model fine-tuned with a lightweight LoRA adapter on 12,300 prompt-refinement dialogues spanning 41 everyday domains, distilled from three stronger LLMs. The adapter attaches to any Llama3-8B base, enabling edge deployment. In human and LLM-judge evaluations across multiple target models and optimization baselines, PromptTailor yields higher preference rates than chain-of-thought prompting and matches or surpasses state-of-the-art prompt optimization methods while requiring fewer model calls (e.g., 3 vs. 9). These results show that a compact student, guided by powerful teachers, can learn effective prompt-generation strategies that enhance response quality while maintaining alignment with user intent.
zh

[NLP-105] AD-CDO: A Lightweight Ontology for Representing Eligibility Criteria in Alzheimers Disease Clinical Trials

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)临床试验中入选标准概念表达不一致、难以标准化和整合的问题,从而阻碍了多中心数据共享与自动化分析。其解决方案的关键在于构建了一个轻量级语义丰富本体(Alzheimer’s Disease Common Data Element Ontology for Clinical Trials, AD-CDO),通过从1500余项AD临床试验中提取高频概念并归类为七大语义类别(如疾病、药物、诊断测试等),结合UMLS、OMOP、DrugBank等标准化生物医学词汇库进行标注,并采用Jenks自然断点法优化代表性概念集,在保证超过63%覆盖率的同时维持本体的可解释性与紧凑性。此方法有效实现了AD入选标准实体的结构化表示与标准化映射,支撑了虚拟试验模拟与电子健康记录(EHR)文本规范化等下游应用。

链接: https://arxiv.org/abs/2511.21724
作者: Zenan Sun,Rashmie Abeysinghe,Xiaojin Li,Xinyue Hu,Licong Cui,Guo-Qiang Zhang,Jiang Bian,Cui Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective This study introduces the Alzheimer’s Disease Common Data Element Ontology for Clinical Trials (AD-CDO), a lightweight, semantically enriched ontology designed to represent and standardize key eligibility criteria concepts in Alzheimer’s disease (AD) clinical trials. Materials and Methods We extracted high-frequency concepts from more than 1,500 AD clinical trials on this http URL and organized them into seven semantic categories: Disease, Medication, Diagnostic Test, Procedure, Social Determinants of Health, Rating Criteria, and Fertility. Each concept was annotated with standard biomedical vocabularies, including the UMLS, OMOP Standardized Vocabularies, DrugBank, NDC, and NLM VSAC value sets. To balance coverage and manageability, we applied the Jenks Natural Breaks method to identify an optimal set of representative concepts. Results The optimized AD-CDO achieved over 63% coverage of extracted trial concepts while maintaining interpretability and compactness. The ontology effectively captured the most frequent and clinically meaningful entities used in AD eligibility criteria. We demonstrated AD-CDO’s practical utility through two use cases: (a) an ontology-driven trial simulation system for formal modeling and virtual execution of clinical trials, and (b) an entity normalization task mapping raw clinical text to ontology-aligned terms, enabling consistency and integration with EHR data. Discussion AD-CDO bridges the gap between broad biomedical ontologies and task-specific trial modeling needs. It supports multiple downstream applications, including phenotyping algorithm development, cohort identification, and structured data integration. Conclusion By harmonizing essential eligibility entities and aligning them with standardized vocabularies, AD-CDO provides a versatile foundation for ontology-driven AD clinical trial research. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.21724 [cs.CL] (or arXiv:2511.21724v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.21724 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zenan Sun [view email] [v1] Thu, 20 Nov 2025 18:21:41 UTC (1,314 KB) Full-text links: Access Paper: View a PDF of the paper titled AD-CDO: A Lightweight Ontology for Representing Eligibility Criteria in Alzheimer’s Disease Clinical Trials, by Zenan Sun and 7 other authorsView PDF view license Current browse context: cs.CL prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-106] German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在计算社会科学研究中因缺乏高质量、实证基础的个体角色设定(persona)集合而导致模拟人类视角准确性不足的问题。其解决方案的关键在于构建了德国通用角色设定(German General Personas, GGP)集合,该集合基于德国综合社会调查(ALLBUS)数据精心设计,能够作为标准化提示模板嵌入各类大语言模型(Large Language Models, LLMs)的任务中,引导模型生成与德国人口特征一致的回答。实验表明,GGP指导下的LLM在多种主题上的模拟响应分布优于现有最先进分类器,尤其在数据稀缺场景下表现显著提升,验证了其在实现群体对齐角色提示方面的有效性。

链接: https://arxiv.org/abs/2511.21722
作者: Jens Rupprecht,Leon Fröhling,Claudia Wagner,Markus Strohmaier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for simulating human perspectives via persona prompting is gaining traction in computational social science. However, well-curated, empirically grounded persona collections remain scarce, limiting the accuracy and representativeness of such simulations. Here we introduce the German General Personas (GGP) collection, a comprehensive and representative persona prompt collection built from the German General Social Survey (ALLBUS). The GGP and its persona prompts are designed to be easily plugged into prompts for all types of LLMs and tasks, steering models to generate responses aligned with the underlying German population. We evaluate GGP by prompting various LLMs to simulate survey response distributions across diverse topics, demonstrating that GGP-guided LLMs outperform state-of-the-art classifiers, particularly under data scarcity. Furthermore, we analyze how the representativity and attribute selection within persona prompts affect alignment with population responses. Our findings suggest that GGP provides a potentially valuable resource for research on LLM-based social simulations that enables more systematic explorations of population-aligned persona prompting in NLP and social science research.
zh

[NLP-107] PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations

【速读】: 该论文旨在解决Peer-run behavioral health organizations (PROs)因资源有限而难以全面满足服务用户需求的问题,特别是peer providers在日常工作中面临的信息获取与规划支持不足的困境。解决方案的关键在于开发并部署PeerCoPilot——一个基于大语言模型(Large Language Model, LLM)的辅助系统,通过检索增强生成(Retrieval-Augmented Generation, RAG)管道接入超过1300个经过审核的组织资源数据库,从而为peer providers提供可靠、具体且个性化的心理健康支持方案制定、目标分解和资源匹配功能。实证评估表明,该工具显著提升了信息准确性,并获得90%以上用户认可,已在实际机构中落地应用并持续扩展。

链接: https://arxiv.org/abs/2511.21721
作者: Gao Mo,Naveen Raman,Megan Chai,Cindy Peng,Shannon Pagdon,Nev Jones,Hong Shen,Peggy Swarbrick,Fei Fang
机构: *1 Department of Computer Science, University of Chicago (芝加哥大学计算机科学系); *2 Department of Electrical Engineering and Computer Science, University of Michigan (密歇根大学电气工程与计算机科学系); *3 Department of Statistics, University of Chicago (芝加哥大学统计系); *4 Department of Medicine, University of Chicago (芝加哥大学医学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted at IAAI’26

点击查看摘要

Abstract:Behavioral health conditions, which include mental health and substance use disorders, are the leading disease burden in the United States. Peer-run behavioral health organizations (PROs) critically assist individuals facing these conditions by combining mental health services with assistance for needs such as income, employment, and housing. However, limited funds and staffing make it difficult for PROs to address all service user needs. To assist peer providers at PROs with their day-to-day tasks, we introduce PeerCoPilot, a large language model (LLM)-powered assistant that helps peer providers create wellness plans, construct step-by-step goals, and locate organizational resources to support these goals. PeerCoPilot ensures information reliability through a retrieval-augmented generation pipeline backed by a large database of over 1,300 vetted resources. We conducted human evaluations with 15 peer providers and 6 service users and found that over 90% of users supported using PeerCoPilot. Moreover, we demonstrated that PeerCoPilot provides more reliable and specific information than a baseline LLM. PeerCoPilot is now used by a group of 5-10 peer providers at CSPNJ, a large behavioral health organization serving over 10,000 service users, and we are actively expanding PeerCoPilot’s use.
zh

[NLP-108] When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)安全机制中存在的一个关键盲点问题:即现有针对模型越狱(jailbreak)的研究主要聚焦于直接绕过显式安全防护以生成有害内容,而忽视了通过操纵模型隐含的社会价值结构来诱导不当输出的隐蔽攻击方式。这种价值层面的攻击难以被传统过滤机制识别,构成对模型对齐策略的重大威胁。解决方案的关键在于提出一种名为MICM(Model-agnostic Implicit Concept Manipulation)的新颖越狱方法,其核心是基于概念形态学理论(conceptual morphology theory),将一系列预定义的语义短语编码为固定提示模板,作为概念触发器(conceptual triggers),从而在不触发常规安全检测的前提下,引导模型输出朝向特定价值立场。实验表明,MICM在多个先进LLM(如GPT-4o、Deepseek-R1和Qwen3-8B)上均显著优于现有最先进技术,验证了商业LLM在价值对齐层面存在可被隐蔽利用的脆弱性。

链接: https://arxiv.org/abs/2511.21718
作者: Zhaoxin Zhang,Borui Chen,Yiming Hu,Youyang Qu,Tianqing Zhu,Longxiang Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model’s capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.
zh

[NLP-109] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution AAAI2026

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对真实世界中视觉与文本信息不一致时,缺乏有效检测和推理能力的问题。现有模型主要基于对齐的图像-文本对进行训练和评估,难以应对开放域场景下跨模态矛盾的复杂推理需求。解决方案的关键在于提出CrossCheck-Bench——一个诊断性基准测试框架,涵盖感知、整合与推理三个层次的推理复杂度,并定义了七种原子级能力以系统评估模型在识别和解决跨模态冲突方面的能力。该基准包含15,000个来自真实人工制品并注入合成矛盾的问答对,通过多阶段标注流程确保语义有效性与难度均衡。实验表明,主流模型在孤立实体识别上表现良好,但在多线索融合与逻辑推理任务中显著退化;进一步分析揭示了模型在多步推理或规则验证等高阶能力上的学习不均衡问题,而结合符号推理与具身视觉处理的方法则展现出更稳定的性能提升,凸显了当前多模态推理中的瓶颈所在,并为未来鲁棒跨模态验证模型的设计指明方向。

链接: https://arxiv.org/abs/2511.21717
作者: Baoliang Tian,Yuxuan Si,Jilong Wang,Lingyao Li,Zhongyuan Bao,Zineng Zhou,Tao Wang,Sixu Li,Ziyao Xu,Mingze Wang,Zhouzhuo Zhang,Zhihao Wang,Yike Yun,Ke Tian,Ning Yang,Minghui Qiu
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Peking University (北京大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.
zh

[NLP-110] An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features

【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成的虚假评论对在线购物可信度造成的威胁,特别是识别那些伪装成人类撰写的计算机生成(Computer-Generated, CG)评论。其解决方案的关键在于构建一个基于机器学习的集成系统,该系统融合了先进的文本预处理、多模态特征提取、基于哈里斯鹰优化(Harris Hawks Optimization, HHO)的特征选择方法以及堆叠集成分类器(stacking ensemble classifier)。通过在包含40,432条原始(Original, OR)与CG评论的公开数据集上验证,HHO从13,539个初始特征中筛选出1,368个最具判别力的特征,实现89.9%的维度压缩,最终堆叠模型达到95.40%准确率、92.81%精确率、95.01%召回率和93.90% F1分数,表明生物启发式优化与集成学习相结合是有效识别机器生成文本的重要策略。

链接: https://arxiv.org/abs/2511.21716
作者: Shabbir Anees,Anshuman,Ayush Chaurasia,Prathmesh Bogar
机构: Indian Institute of Information Technology Vadodara (印度信息技术学院瓦多拉分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.
zh

[NLP-111] GPS: General Per-Sample Prompter

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在提示工程(prompting)中面临的三大挑战:(i)现有自动提示方法需为每个新任务提供大量标注数据进行训练;(ii)依赖昂贵的优化循环,推理耗时长;(iii)生成单一任务级提示,无法根据具体输入动态调整。其解决方案的关键在于提出GPS(General-purpose Per-sample Prompting),一种无需任务特定调优即可为每个未见输入生成定制化提示的通用方法。GPS通过强化学习在多种任务上训练提示生成器,并引入新颖的正则化策略以增强对样本级提示的适应能力,同时采用最小贝叶斯风险解码(Minimum Bayes Risk decoding)稳定推理过程。实验表明,GPS在多个下游任务上表现优异,且无需任务特定训练数据,显著优于传统自动提示方法。

链接: https://arxiv.org/abs/2511.21714
作者: Pawel Batorski,Paul Swoboda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.21714 [cs.CL] (or arXiv:2511.21714v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.21714 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-112] EulerESG: Automating ESG Disclosure Analysis with LLM s

【速读】: 该论文旨在解决ESG(环境、社会与治理)报告分析中的自动化难题,即当前ESG报告多以长篇且格式不一的PDF文档形式发布,导致难以系统性地提取和理解关键信息,尤其在面对标准化指标时缺乏高效、准确的解析工具。解决方案的关键在于提出EulerESG系统,该系统通过双通道检索与大语言模型(LLM)驱动的披露分析相结合的方式,显式地建模ESG报告所依据的披露框架(如SASB标准),从而实现高保真度的标准对齐指标表自动生成(最高平均准确率达0.95),同时兼顾端到端运行效率,并提供交互式仪表盘与聊天机器人用于探索、基准比较与解释。

链接: https://arxiv.org/abs/2511.21712
作者: Yi Ding,Xushuo Tang,Zhengyi Yang,Wenqian Zhang,Simin Wu,Yuxin Huang,Lingjing Lan,Weiyuan Li,Yin Chen,Mingchen Ju,Wenke Yang,Thong Hoang,Mykhailo Klymenko,Xiwei Zu,Wenjie Zhang
机构: UNSW Sydney (新南威尔士大学); Eigenflow AI; Euler AI; UTS Sydney (悉尼科技大学); Data61, CSIRO (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Environmental, Social, and Governance (ESG) reports have become central to how companies communicate climate risk, social impact, and governance practices, yet they are still published primarily as long, heterogeneous PDF documents. This makes it difficult to systematically answer seemingly simple questions. Existing tools either rely on brittle rule-based extraction or treat ESG reports as generic text, without explicitly modelling the underlying reporting standards. We present \textbfEulerESG, an LLM-powered system for automating ESG disclosure analysis with explicit awareness of ESG frameworks. EulerESG combines (i) dual-channel retrieval and LLM-driven disclosure analysis over ESG reports, and (ii) an interactive dashboard and chatbot for exploration, benchmarking, and explanation. Using four globally recognised companies and twelve SASB sub-industries, we show that EulerESG can automatically populate standard-aligned metric tables with high fidelity (up to 0.95 average accuracy) while remaining practical in end-to-end runtime, and we compare several recent LLM models in this setting. The full implementation, together with a demonstration video, is publicly available at this https URL.
zh

[NLP-113] Addressing Stereotypes in Large Language Models : A Critical Examination and Mitigation

【速读】: 该论文旨在解决生成式人工智能(Generative AI)中大型语言模型(Large Language Models, LLMs)存在的偏见问题,包括社会、伦理、文化、宗教等隐性和显性偏见,这些偏见源自训练数据并可能导致有害的刻板印象和错误信息传播。为应对这一挑战,研究提出了一种三管齐下的分析方法,结合专用偏见基准测试(如StereoSet和CrowSPairs)对BERT、GPT 3.5和ADA等模型进行系统评估,并通过微调(fine-tuning)、提示工程(prompting techniques)与偏见基准数据增强相结合的强化学习策略来提升模型性能。关键解决方案在于:首先识别偏见的存在及来源,其次利用跨数据集微调增强模型泛化能力,最终在隐性偏见基准上实现最高达20%的性能提升,从而显著改善模型输出的公平性和准确性。

链接: https://arxiv.org/abs/2511.21711
作者: Fatima Kazi
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.
zh

[NLP-114] Quantifying and Mitigating Selection Bias in LLM s: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach AACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多项选择题(Multiple Choice Question, MCQ)任务中普遍存在的选择偏差(selection bias)问题,即模型的作答倾向受选项位置或符号等非语义因素影响,而非基于内容本身,从而削弱了MCQ作为评估框架的可靠性。解决方案的关键在于:提出一种无需标签的排列偏差度量(Permutation Bias Metric, PBM),直接量化模型在不同选项排列下的预测不一致性;设计高效的批量问答上下文键值缓存(Batch Question-Context KV caching, BaQCKV)机制,在保持多数投票(majority voting)效果的同时显著降低计算开销;并基于PBM和BaQCKV开发了一种无监督的低秩适应(Low-Rank Adaptation, LoRA-1)微调策略,有效缓解选择偏差且保持模型跨数据集的泛化能力。

链接: https://arxiv.org/abs/2511.21709
作者: Blessed Guda,Lawrence Francis,Gabrial Zencha Ashungafac,Carlee Joe-Wong,Moise Busogi
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted into IJCNLP-AACL 2026

点击查看摘要

Abstract:Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model’s predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.
zh

[NLP-115] Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?

【速读】: 该论文旨在解决数据准备(data preparation)这一在数据驱动流程中关键但劳动密集型步骤的自动化问题,特别是评估大型语言模型(Large Language Models, LLMs)是否能够有效支持用户完成数据选择与自动化处理任务。其解决方案的关键在于:首先,采用通用和微调后的表格型大语言模型对低质量数据集进行提示(prompting),以测试其执行数据剖析(data profiling)和清洗(data cleaning)等任务的能力;其次,通过构建一个经用户研究验证的定制化质量评估模型,系统性地衡量LLMs相较于传统数据准备工具所提供的支持效果,从而揭示LLMs在实际应用中的潜力与局限。

链接: https://arxiv.org/abs/2511.21708
作者: Matteo Spreafico,Ludovica Tassini,Camilla Sancricca,Cinzia Cappiello
机构: Politecnico di Milano (米兰理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been validated through a user study to gain insights into practitioners’ expectations.
zh

[NLP-116] A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks

【速读】: 该论文旨在解决目标导向对话(goal-oriented dialogue)任务中,如何在有限交互轮次内高效引导对话走向预定目标的问题。现有方法要么依赖人工设计的提示工程(prompt engineering),效果受经验限制;要么采用预训练策略网络(policy network),但难以适应新场景且训练成本高。解决方案的关键在于提出一种无需模型训练的新型策略规划方法——嵌套回溯策略自适应(Nested Rollout Policy Adaptation for Goal-oriented Dialogue, NRPA-GD),其核心是利用大语言模型(Large Language Model, LLM)同时模拟用户和系统的交互行为,并构建完整的对话轨迹评估机制,结合嵌套蒙特卡洛模拟与策略自我适应优化框架,在对话过程中动态调整策略,从而实现高效、灵活且高性能的目标导向对话决策。

链接: https://arxiv.org/abs/2511.21706
作者: Hui Wang,Fafa Zhang,Xiaoyu Zhang,Chaoxu Mu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns. Existing approaches either rely on elaborate prompt engineering, whose effectiveness is heavily dependent on human experience, or integrate policy networks and pre-trained policy models, which are usually difficult to adapt to new dialogue scenarios and costly to train. Therefore, in this paper, we present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method that completely avoids specific model training by utilizing a Large Language Model (LLM) to simulate behaviors of user and system at the same time. Specifically, NRPA-GD constructs a complete evaluation mechanism for dialogue trajectories and employs an optimization framework of nested Monte Carlo simulation and policy self-adaptation to dynamically adjust policies during the dialogue process. The experimental results on four typical goal-oriented dialogue datasets show that NRPA-GD outperforms both existing prompt engineering and specifically pre-trained model-based methods. Impressively, NRPA-GD surpasses ChatGPT and pre-trained policy models with only a 0.6-billion-parameter LLM. The proposed approach further demonstrates the advantages and novelty of employing planning methods on LLMs to solve practical planning tasks.
zh

[NLP-117] Insight-A: Attribution-aware for Multimodal Misinformation Detection

【速读】: 该论文旨在解决生成式 AI (Generative AI, AIGC) 技术在社交媒体平台上引发的多模态虚假信息(multimodal misinformation)检测难题,特别是现有标准提示方法在利用多模态大语言模型(Multimodal Large Language Models, MLLMs)识别虚假信息时忽略伪造来源归属的问题。其解决方案的关键在于提出 Insight-A 框架,通过两个核心创新实现:一是引入交叉归属提示(Cross-Attribution Prompting, CAP),基于生成模式建模感知与推理间的复杂关联,以精准归因虚假信息至伪造源;二是设计自动去偏归属提示(Automatic Attribution-Debiased Prompting, ADP),减少人工标注提示带来的主观性,并结合图像描述(Image Captioning, IC)增强跨模态一致性检查能力,从而构建具有层次化推理机制的有效检测流程。

链接: https://arxiv.org/abs/2511.21705
作者: Junjie Wu,Yumeng Fu,Chen Gong,Guohong Fu
机构: Soochow University (苏州大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-generated content (AIGC) technology has emerged as a prevalent alternative to create multimodal misinformation on social media platforms, posing unprecedented threats to societal safety. However, standard prompting leverages multimodal large language models (MLLMs) to identify the emerging misinformation, which ignores the misinformation attribution. To this end, we present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation. Insight-A makes two efforts: I) attribute misinformation to forgery sources, and II) an effective pipeline with hierarchical reasoning that detects distortions across modalities. Specifically, to attribute misinformation to forgery traces based on generation patterns, we devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning. Meanwhile, to reduce the subjectivity of human-annotated prompts, automatic attribution-debiased prompting (ADP) is used for task adaptation on MLLMs. Additionally, we design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking. Extensive experiments demonstrate the superiority of our proposal and provide a new paradigm for multimodal misinformation detection in the era of AIGC.
zh

[NLP-118] On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models

【速读】: 该论文旨在解决基于wav2vec 2.0架构的预训练模型在跨语言场景下的知识迁移能力问题,特别是当目标语言与预训练语言不一致时,模型性能如何变化。其解决方案的关键在于系统性地评估15个大型预训练模型在18种语言上的语音识别任务表现,发现预训练数据的多样性比数据规模对下游任务性能影响更大,并且观察到单语模型存在正向跨语言迁移效应——尤其在预训练语言与目标任务语言相似时更为显著。这一发现为科学界更有效地利用现有预训练模型及设计新的跨语言语音模型提供了实证依据。

链接: https://arxiv.org/abs/2511.21704
作者: Jonatas Grosman,Cassio Almeida,Guilherme Schardong,Hélio Lopes
机构: Pontifical Catholic University of Rio de Janeiro (里约热内卢天主教联邦大学); Brazilian Institute of Geography and Statistics (巴西地理统计局); Institute of Systems and Robotics (系统与机器人研究所); University of Coimbra (科英布拉大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Using representations provided by a large pre-trained model has become the primary strategy for achieving state-of-the-art results in a wide range of tasks. A recently proposed large pre-trained model, wav2vec 2.0, was seminal for several other works on pre-training large models on speech data. Many models are being pre-trained using the same architecture as wav2vec 2.0 and are getting state-of-the-art in various speech-related tasks. Previous work has demonstrated that the data used during the pre-training of these wav2vec2-based models can impact the model’s performance in downstream tasks, and this should be taken into consideration before utilizing these models. However, few works have proposed investigating further how the transfer knowledge of these pre-trained models behaves in different languages, even when the target language differs from the one used during the model’s pre-training. Our work aims to investigate the cross-lingual transferability of these wav2vec2-based models. We performed several fine-tuning experiments on the speech recognition task in 18 languages using 15 large pre-trained models. The results of our experiments showed us that the size of data used during the pre-training of these models is not as important to the final performance as the diversity. We noticed that the performance of Indo-European languages is superior to non-Indo-European languages in the evaluated models. We have observed a positive cross-lingual transfer of knowledge using monolingual models, which was evident in all the languages we used, but more pronounced when the language used during pre-training was more similar to the downstream task language. With these findings, we aim to assist the scientific community in utilizing existing wav2vec2-based pre-trained models, as well as facilitate the pre-training of new ones.
zh

[NLP-119] Evaluating Embedding Generalization: How LLM s LoRA and SLERP Shape Representational Geometry

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)作为文本嵌入(text embeddings)主干时,因任务特定微调(如LoRA)导致的过专化(over-specialization)问题,以及非LLM编码器在捕捉高阶、组合性数值模式方面的局限性。其核心解决方案在于采用球面线性插值(spherical linear interpolation, SLERP)进行模型融合(model merging),以缓解LoRA适配带来的参数偏移对通用表征能力的破坏;实验表明,SLERP能够有效恢复基础模型结构并保留大部分任务性能增益,在聚类可分性和鲁棒性方面显著优于模型汤(model souping)或未融合的模型,从而实现更好的泛化平衡。

链接: https://arxiv.org/abs/2511.21703
作者: Siyaxolisa Kabane
机构: University of Fort Hare (福特哈雷大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 16 figures

点击查看摘要

Abstract:We investigate the generalization properties of dense text embeddings when the embedding backbone is a large language model (LLM) versus when it is a non-LLM encoder, and we study the extent to which spherical linear interpolation (SLERP) model-merging mitigates over-specialization introduced by task-specific adaptation (e.g., LoRA). To make the comparison concrete and domain-agnostic, we design a controlled suite of experiments in which models embed short numerical sequences and are evaluated on their ability to cluster and classify those sequences according to well-defined number-theoretic properties. Our experimental protocol compares four families of models: (1) non-LLM encoders trained from scratch or fine-tuned for embeddings, (2) LLM-based encoders adapted with parameter-efficient methods (LoRA), (3) LLM-based encoders with LoRA followed by model souping merging into the base weights, and (4) the same LoRA-adapted LLMs merged using SLERP across checkpoints or stages. We evaluate representational quality with clustering indices (Silhouette and Davies Bouldin). We additionally analyze the use of kmeans labels to see if the embeddings encode any other information besides the one we are testing for. Empirically, we find that LLM-based backbones produce embeddings that better capture higher-order, compositional numeric patterns, but are prone to adapter dominance that degrades balanced generalization; SLERP merging consistently recovers base-model structure while retaining most task gains, yielding superior tradeoffs in clustering separability, and robustness compared to model souping or models that were not merged.
zh

[NLP-120] CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因输出层计算涉及大规模词汇表而导致的显著计算瓶颈问题。解决方案的关键在于提出一种名为CSV-Decode的新方法,其核心思想是利用几何上界构造每个解码步骤的小型子词汇表(sub-vocabulary),从而实现高效的稀疏计算。该方法通过离线聚类词汇嵌入,并基于质心加半径的边界约束识别出可安全省略计算的词元,同时保障双重正确性:精确的top-k认证与ε-有界的softmax近似。

链接: https://arxiv.org/abs/2511.21702
作者: Dong Liu,Yanxuan Yu,Ben Lengerich
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top- k certification and \varepsilon -certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \hrefthis https URLthis https URL.
zh

[NLP-121] 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在中文医学考试题中的评估缺乏系统性和专业性的问题,以明确其在不同医学专科和执业水平下的实际表现。解决方案的关键在于构建一个涵盖7个医学专科、2种执业难度层级(主治医师与副主任医师水平)的综合性基准测试框架,对27个前沿LLMs进行标准化评测,共使用2800道精心筛选的医学题目。该框架不仅量化了模型性能差异(如Mixtral-8x7B达到74.25%准确率),还揭示了模型规模与性能之间无显著相关性、各专科间表现不均衡以及顶级模型在不同难度级别间具有稳定泛化能力等关键发现,为LLMs在医学教育和临床决策支持系统中的落地应用提供了实证依据和优化方向。

链接: https://arxiv.org/abs/2511.21701
作者: Chiung-Yi Tseng,Danyang Zhang,Tianyang Wang,Hongying Luo,Lu Chen,Junming Huang,Jibin Guan,Junfeng Hao,Junhao Song,Ziqian Bi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.
zh

[NLP-122] JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

【速读】: 该论文旨在解决现有语法错误修正(Grammatical Error Correction, GEC)系统因参考答案多样性不足而导致评估结果偏低及模型泛化能力受限的问题。其解决方案的关键在于提出一种名为“编辑级有效性裁判”(Judge of Edit-Level Validity, JELV)的自动化框架,该框架通过语法正确性、忠实性和流畅性三个维度对修正编辑进行验证,并基于人工标注的成对编辑级有效性数据集(Pair-wise Edit-level Validity Dataset, PEVData)构建两种实现方式:一是多轮大语言模型(LLM-as-Judges)流水线,与人类标注者达成90%的一致性;二是轻量化的DeBERTa分类器,在有效编辑上达到85%的精确率。JELV进一步用于重新分类评估中的误判假阳性,并结合假阳性解耦和流畅度评分构建新的综合评估指标,显著提升与人类判断的相关性,同时通过过滤大语言模型生成的修正候选句扩展BEA19单参考数据集,从而在重训练顶级GEC系统后带来可测量的性能提升,为提升参考多样性、评估准确性和模型泛化能力提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2511.21700
作者: Yuhao Zhan,Yuqing Zhang,Jing Yuan,Qixiang Ma,Zhiqi Yang,Yu Gu,Zemin Liu,Fei Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19’s single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.
zh

[NLP-123] Cacheback: Speculative Decoding With Nothing But Cache

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理速度慢的问题,特别是在生成式 AI (Generative AI) 应用中对高效推理的迫切需求。解决方案的关键在于提出了一种无需训练且与模型无关的推测解码方法——Cacheback Decoding,其核心创新是利用 token n-gram 的最近最少使用(Least Recently Used, LRU)缓存表来生成候选序列,从而在不改变原模型结构的前提下显著加速推理过程。该方法通过捕捉语言中的局部性(locality)特性实现高效推测,同时具备部署简单、适配新领域潜力大的优势。

链接: https://arxiv.org/abs/2511.21699
作者: Zhiyao Ma,In Gim,Lin Zhong
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.
zh

[NLP-124] EvalCards: A Framework for Standardized Evaluation Reporting

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中评估(evaluation)报告实践存在的三大持续性缺陷:可复现性(reproducibility)、可访问性(accessibility)和治理(governance)。当前标准化努力仍显不足,无法有效应对开放模型快速发布带来的透明度挑战。论文提出的关键解决方案是引入评估披露卡(Evaluation Disclosure Cards, EvalCards),其核心在于通过结构化、标准化的信息披露框架,提升研究者与实践者对模型评估过程的理解与信任,同时为新兴的治理需求提供可操作的基础支撑。

链接: https://arxiv.org/abs/2511.21695
作者: Ruchira Dhar,Danae Sanchez Villegas,Antonia Karamolegkou,Alice Schiavone,Yifei Yuan,Xinyi Chen,Jiaang Li,Stella Frank,Laura De Grazia,Monorama Swain,Stephanie Brandl,Daniel Hershcovich,Anders Søgaard,Desmond Elliott
机构: University of Copenhagen (哥本哈根大学); ETH Zurich (苏黎世联邦理工学院); University of Amsterdam (阿姆斯特丹大学); University of Barcelona (巴塞罗那大学); Johannes Kepler University Linz (林茨约翰尼斯开普勒大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under review

点击查看摘要

Abstract:Evaluation has long been a central concern in NLP, and transparent reporting practices are more critical than ever in today’s landscape of rapidly released open-access models. Drawing on a survey of recent work on evaluation and documentation, we identify three persistent shortcomings in current reporting practices: reproducibility, accessibility, and governance. We argue that existing standardization efforts remain insufficient and introduce Evaluation Disclosure Cards (EvalCards) as a path forward. EvalCards are designed to enhance transparency for both researchers and practitioners while providing a practical foundation to meet emerging governance requirements.
zh

[NLP-125] On the Role of Preference Variance in Preference Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中依赖高成本人类偏好数据的问题,提出通过优化偏好数据选择策略来提升训练效率。其解决方案的关键在于引入参考方差(Preference Variance, PVar)——即模型在比较响应对时偏好分布的波动程度,并证明PVar决定了DPO(Direct Preference Optimization)梯度范数的上界:低PVar的提示(prompt)仅能产生微弱梯度更新,因此对学习贡献有限。实验表明,基于PVar筛选出的高方差提示在多个基准测试中显著优于随机或低方差提示,且该方法在使用较小奖励模型时仍具鲁棒性,甚至在UltraFeedback数据集上仅用Top 10%高PVar提示即可超越全量训练效果,凸显了PVar作为高效标注样本筛选指标的核心价值。

链接: https://arxiv.org/abs/2510.13022
作者: Jiacheng Guo,Zihao Li,Jiahao Qiu,Yue Wu,Mengdi Wang
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emphpreference variance (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.
zh

[NLP-126] mporal Consistency for LLM Reasoning Process Error Identification

【速读】: 该论文旨在解决数学推理过程中错误识别准确率不足的问题,尤其是在大语言模型(Large Language Models, LLMs)生成解题过程后如何有效验证其正确性。传统方法如单轮验证或多方模型辩论存在判断不稳定或效率低下的局限。本文提出一种基于时间一致性的验证方法,其核心在于通过一系列自省(self-reflection)动作迭代优化验证器的判断,利用前后评估的一致性来提升验证准确性。实验表明,该方法在多个数学过程错误检测基准(Mathcheck、ProcessBench 和 PRM800K)上均显著优于基线模型,并使小型蒸馏模型(如7B/8B)性能超越大型模型(如70B/72B),实现了高精度与高效性的平衡。

链接: https://arxiv.org/abs/2503.14495
作者: Jiacheng Guo,Yue Wu,Jiahao Qiu,Kaixuan Huang,Xinzhe Juan,Ling Yang,Mengdi Wang
机构: Princeton University (普林斯顿大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at this https URL
zh

[NLP-127] How Does A Text Preprocessing Pipeline Affect Ontology Matching?

【速读】: 该论文旨在解决文本预处理流程(text preprocessing pipeline)在本体匹配(ontology matching, OM)中缺乏标准化所导致的映射结果多样性问题。研究表明,第一阶段的分词(Tokenisation)和归一化(Normalisation)比第二阶段的停用词去除(Stop Words Removal)与词干提取/词形还原(Stemming/Lemmatisation)更有效。解决方案的关键在于提出两种新颖的修复方法:一是基于规则的前置修复策略,在文本预处理前利用本体特定检查识别引发错误映射的常见词;二是基于大语言模型(Large Language Model, LLM)的后置修复策略,在文本预处理后借助LLM强大的背景知识修正不存在或不合逻辑的虚假映射。实验表明,这两种方法能显著提升匹配正确性和整体性能。

链接: https://arxiv.org/abs/2411.03962
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang
机构: Australian National University (澳大利亚国立大学); Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 16 figures, 3 tables

点击查看摘要

Abstract:The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in the mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Tokenisation and Normalisation (categorised as Phase 1 text preprocessing) are more effective than Stop Words Removal and Stemming/Lemmatisation (categorised as Phase 2 text preprocessing). We propose two novel approaches to repair unwanted false mappings that occur in Phase 2 text preprocessing. One is an ad hoc logic-based repair approach used before text preprocessing, employing an ontology-specific check to find common words that cause false mappings. The other repair approach is the post hoc large language model (LLM)-based approach, used after text preprocessing, which utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings. The experimental results indicate that these two approaches can significantly improve the matching correctness and the overall matching performance.
zh

计算机视觉

[CV-0] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

【速读】:该论文旨在解决多模态大语言模型在动态视觉内容推理中存在的逻辑不一致性和视觉证据依赖不足的问题,即当前模型虽能生成看似合理的推理链(reasoning traces),但其推理过程往往与答案不一致或过度依赖语言先验而非真实视觉信息。解决方案的关键在于提出一种基于强化学习的双阶段后训练方法:首先采用时间戳感知的监督微调(timestamp-aware supervised fine-tuning)提升时序精度,随后通过由新型时间对齐奖励(Temporal Alignment Reward, TAR)引导的组相对策略优化(Group Relative Policy Optimization, GRPO)增强推理的一致性与因果连贯性。该方法显著提升了Think Answer Consistency (TAC) 和Video Attention Score (VAS),从而实现了更准确、可信的视频理解。

链接: https://arxiv.org/abs/2511.23478
作者: Muhammad Maaz,Hanoona Rasheed,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Linköping University (林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Video-R2 Technical Report

点击查看摘要

Abstract:Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
zh

[CV-1] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中存在“被动视觉处理”的局限性问题,即模型在将视频编码后仅依赖文本推理,无法对视觉信息进行动态交互(如重看、聚焦或验证证据),从而导致细粒度时空理解能力不足。其解决方案的关键在于提出一种全新的“交互式视频推理”(Interactive Video Reasoning)范式,通过引入Video CoM模型,使模型能够以“操作链”(Chain of Manipulations, CoM)的形式执行迭代视觉动作来收集和精炼证据,从而实现“与视频共同思考”。该方法的核心创新包括:构建包含18K指令的Video CoM Instruct数据集用于多步操作推理训练,并采用基于推理感知的Group Relative Policy Optimization(GRPO)强化学习策略,引入步骤级推理奖励而非仅答案奖励,显著提升推理的准确性与可解释性。

链接: https://arxiv.org/abs/2511.23477
作者: Hanoona Rasheed,Mohammed Zumri,Muhammad Maaz,Ming-Hsuan Yang,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); University of California Merced (加州大学默塞德分校); Google Research (谷歌研究院); Linköping University (林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still “think about videos” ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to “think with videos”. Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: this https URL
zh

[CV-2] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

【速读】:该论文旨在解决多人群体视频生成中面临的两大核心挑战:一是高质量多人群体数据收集成本高昂,二是如何实现多个身份驱动下的自然互动性(interactivity)。解决方案的关键在于提出AnyTalker框架,其核心创新包括:1)设计了一种新型的身份感知注意力机制(identity-aware attention mechanism),嵌入扩散Transformer(Diffusion Transformer)的注意力模块中,通过迭代处理身份-音频对,实现可扩展的多身份驱动;2)构建仅依赖单人视频进行训练的高效流水线,结合少量真实多人视频片段优化交互性,显著降低数据需求。该方法在保持高唇同步精度与视觉质量的同时,实现了自然的群体互动效果,有效平衡了数据成本与身份扩展能力。

链接: https://arxiv.org/abs/2511.23475
作者: Zhizhou Zhong,Yicheng Ji,Zhe Kong,Yiying Liu,Jiarui Wang,Jiasun Feng,Lupeng Liu,Xiangyi Wang,Yanjia Li,Yuqing She,Ying Qin,Huan Li,Shuiyang Mao,Wei Liu,Wenhan Luo
机构: Hong Kong University of Science and Technology (香港科技大学); Video Rebirth; Zhejiang University (浙江大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer’s attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
zh

[CV-3] Visual Generation Tuning

【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, VLMs)在预训练过程中虽已具备丰富的多模态理解能力,但其潜在的视觉生成能力尚未被充分挖掘的问题。针对这一挑战,作者提出了一种名为视觉生成调优(Visual Generation Tuning, VGT)的新范式,其关键在于通过高效地对预训练VLM进行微调,将语义编码器与像素解码器的潜在表示对齐,从而释放模型内在的视觉生成潜力。具体而言,VGT-AE摒弃了传统扩散Transformer中复杂的像素级变分自编码器(VAE),转而利用预训练VLM的语义编码器与像素解码器的潜在空间进行对齐,显著降低了对齐成本并加速连续空间中的自回归建模(提升20倍收敛速度)。实验表明,该方法在图像重建和视觉生成任务上均达到先进性能,验证了其在统一多模态基础模型方向上的潜力。

链接: https://arxiv.org/abs/2511.23469
作者: Jiahao Guo,Sinan Du,Jingfeng Yao,Wenyu Liu,Bo Li,Haoxiang Cao,Kun Gai,Chun Yuan,Kai Wu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Tsinghua University (清华大学); School of Artificial Intelligence, South China Normal University (华南师范大学人工智能学院); Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at this https URL.
zh

[CV-4] Object-Centric Data Synthesis for Category-level Object Detection

【速读】:该论文旨在解决在数据受限条件下,如何有效扩展目标检测模型对新类别物体的检测能力的问题。当前深度学习方法在特定类别上表现优异,但迁移至新类别时需大量标注数据,尤其对于长尾分布中的稀有类别而言成本高昂。解决方案的关键在于利用对象中心的数据(如多视角图像或3D模型)进行数据合成,通过四种不同技术——简单图像处理、3D渲染和图像扩散模型——生成具有复杂背景和语境一致性的逼真图像,从而在有限标注数据下提升模型在真实世界场景中的类别级泛化能力。

链接: https://arxiv.org/abs/2511.23450
作者: Vikhyat Agarwal,Jiayi Cora Guo,Declan Hoban,Sissi Zhang,Nicholas Moran,Peter Cho,Srilakshmi Pattabiraman,Shantanu Joshi
机构: University of Richmond (里士满大学); University of California, Los Angeles (洛杉矶加州大学); University of California, Berkeley (伯克利加州大学); University of Texas at Austin (奥斯汀德克萨斯大学); Analog Devices, Inc (模拟设备公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model’s detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.
zh

[CV-5] Physics-Informed Neural Networks for Thermophysical Property Retrieval

【速读】:该论文旨在解决建筑外墙热工性能评估中的逆热传导问题(inverse heat problems),即通过现场非侵入式温度数据估计墙体材料的热导率(thermal conductivity, k),以量化建筑立面改造对热传导系数(thermal transmittance)的影响。传统方法存在测量周期长、环境敏感或需破坏性取样等局限,难以在真实条件下实现可靠估计。其解决方案的关键在于提出一种基于物理信息神经网络(Physics-Informed Neural Networks, PINN)的迭代框架:该框架交替执行两个步骤——固定k时求解正向热传导问题(forward heat problem)并预测表面温度,再通过比较预测温度与实测热图(thermographs)优化k值,直至收敛。实验表明,即使在偏离稳态假设的情况下,该方法仍能保持较高精度(最大平均绝对误差MAE为4.0851),验证了PINN在复杂环境下的鲁棒性和实用性,为现场材料物性无损估计提供了新路径。

链接: https://arxiv.org/abs/2511.23449
作者: Ali Waseem,Malcolm Mielle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have wide-ranging uses, but a critical application lies in quantifying how building facade renovation reduces thermal transmittance, a key determinant of building energy efficiency. However, solving inverse heat problems with non-invasive data collected in situ is error-prone due to environmental variability or deviations from theoretically assumed conditions. Hence, current methods for measuring thermal conductivity are either invasive, require lengthy observation periods, or are sensitive to environmental and experimental conditions. Here, we present a PINN-based iterative framework to estimate the thermal conductivity k of a wall from a set of thermographs; our framework alternates between estimating the forward heat problem with a PINN for a fixed k, and optimizing k by comparing the thermographs and surface temperatures predicted by the PINN, repeating until the estimated k’s convergence. Using both environmental data captured by a weather station and data generated from Finite-Volume-Method software simulations, we accurately predict k across different environmental conditions and data collection sampling times, given the temperature profile of the wall at dawn is close to steady state. Although violating the steady-state assumption impacts the accuracy of k’s estimation, we show that our proposed framework still only exhibits a maximum MAE of 4.0851. Our work demonstrates the potential of PINN-based methods for reliable estimation of material properties in situ and under realistic conditions, without lengthy measurement campaigns. Given the lack of research on using machine learning, and more specifically on PINNs, for solving in-situ inverse problems, we expect our work to be a starting point for more research on the topic.
zh

[CV-6] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

【速读】:该论文旨在解决当前生成式游戏世界建模中因固定动作模板(action schema)和高标注成本导致的交互多样性不足与玩家驱动动态建模能力受限的问题。其核心解决方案是提出一种指令驱动的交互范式——Hunyuan-GameCraft-2,通过自然语言、键盘或鼠标信号实现对生成游戏视频内容的灵活控制,并构建了因果对齐的交互式视频数据集;模型基于14B参数的图像到视频Mixture-of-Experts(MoE)基础架构,引入文本驱动的交互注入机制,实现对摄像机运动、角色行为及环境动态的细粒度控制,从而生成时间连贯且符合因果逻辑的交互式游戏视频。

链接: https://arxiv.org/abs/2511.23429
作者: Junshu Tang,Jiacheng Liu,Jiaqi Li,Longhuang Wu,Haoyu Yang,Penghao Zhao,Siruis Gong,Xiang Yuan,Shuai Shao,Qinglin Lu
机构: Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report, Project page: this https URL

点击查看摘要

Abstract:Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as “open the door”, “draw a torch”, or “trigger an explosion”.
zh

[CV-7] DisMo: Disentangled Motion Representations for Open-World Motion Transfer NEURIPS2025

【速读】:该论文旨在解决当前文本到视频(text-to-video, T2V)和图像到视频(image-to-video, I2V)生成模型中运动信息与内容信息难以分离的问题,从而限制了其在内容创作中的灵活性与可控性。现有方法通常将运动与外观、物体身份或姿态耦合,导致无法实现跨语义实体的运动迁移,且易出现运动保真度低、提示遵循差或动作漂移等问题。解决方案的关键在于提出DisMo框架,通过图像空间重建目标直接从原始视频数据中学习抽象的、与静态信息无关的运动表示(motion representation),从而实现运动语义与外观的解耦;该表示可独立于物体类别、姿态等静态特征,支持开放世界下的运动迁移,并能以轻量级适配器形式无缝集成至任意现有视频生成模型中,兼顾运动传递准确性与条件控制忠实度,同时在零样本动作分类任务上显著优于当前最优视频表征模型如V-JEPA。

链接: https://arxiv.org/abs/2511.23428
作者: Thomas Ressler-Antal,Frank Fundel,Malek Ben Alaya,Stefan Andreas Baumann,Felix Krause,Ming Gui,Björn Ommer
机构: CompVis @ LMU Munich (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: this https URL
zh

[CV-8] MANTA: Physics-Informed Generalized Underwater Object Tracking WACV2026

【速读】:该论文旨在解决水下目标跟踪中因波长依赖的衰减与散射导致的外观严重失真问题,现有基于陆地数据训练的跟踪器难以泛化到此类由物理机制驱动的退化场景。解决方案的关键在于提出MANTA框架,其核心创新包括:(1)采用双正样本对比学习策略,结合时间一致性与Beer-Lambert增强,使特征对时间和水下退化均具有鲁棒性;(2)设计多阶段流水线,将基于运动的跟踪与物理信息驱动的二次关联算法相结合,利用几何一致性与外观相似性实现遮挡和漂移下的重识别;(3)引入Center-Scale Consistency (CSC) 和 Geometric Alignment Score (GAS) 作为几何保真度评估指标,超越传统IoU指标。实验表明,该方法在四个水下基准测试中均达到最先进性能,Success AUC提升最高达6%,并具备稳定长期泛化能力和高效运行效率。

链接: https://arxiv.org/abs/2511.23405
作者: Suhas Srinath,Hemang Jamadagni,Aditya Chadrasekar,Prathosh AP
机构: Indian Institute of Science (印度科学研究所); National Institute of Technology Karnataka (卡纳塔克邦国立技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF WACV 2026

点击查看摘要

Abstract:Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.
zh

[CV-9] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding Generation and Reconstruction

【速读】:该论文旨在解决多模态理解、生成与重建表示在单一分词器(tokenizer)中统一建模的挑战。传统方法多采用双编码器范式,如分别使用独立编码器进行理解与生成,或通过对比损失平衡语义表示与低层特征。本文提出VQRAE(Vector Quantization Representation AutoEncoders),其关键在于首次实现统一表示:在同一个分词器内生成连续语义特征用于图像理解,同时输出离散token用于视觉生成和精细重建。方案核心为两阶段训练策略——首先冻结编码器,以像素重建目标学习高维语义向量量化(VQ)码本;随后联合优化编码器并引入自蒸馏约束,从而在保持多模态理解能力的同时,获得适配生成任务的离散表示。此外,研究发现高维码本(如1536维)能实现100%利用率,显著优于以往低维码本在图像重建中的应用。

链接: https://arxiv.org/abs/2511.23386
作者: Sinan Du,Jiahao Guo,Bo Li,Shuhao Cui,Zhengzhuo Xu,Yifu Luo,Yongxian Wei,Kun Gai,Xinggang Wang,Kai Wu,Chun Yuan
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学); Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
zh

[CV-10] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

【速读】:该论文旨在解决基于扩散模型(diffusion-based)的图像编辑区域定位问题,即如何精准识别和定位由生成式 AI(Generative AI)工具产生的、与原图融合自然且难以察觉的局部篡改区域。现有基准数据集主要关注生成图像的二分类检测或手动编辑区域的标注,无法反映扩散模型编辑中平滑过渡、语义一致性强的特点。解决方案的关键在于构建大规模、像素级标注的数据集 DEAL-300K,其通过多模态大语言模型生成编辑指令、无掩码扩散编辑器生成篡改图像,并结合主动学习变化检测流水线实现高效标注;在此基础上提出一种利用冻结视觉基础模型(Visual Foundation Model, VFM)与多频段提示调优(Multi Frequency Prompt Tuning, MFPT)相结合的定位框架,能够同时捕捉编辑区域的语义特征与频率域差异,从而在像素级别实现高精度定位,在自建测试集和外部 CoCoGlide 基准上分别达到 82.56% 和 80.97% 的 F1 分数。

链接: https://arxiv.org/abs/2511.23377
作者: Rui Zhang,Hongxia Wang,Hangqing Liu,Yang Zhou,Qiang Zeng
机构: Sichuan University (四川大学); Key Laboratory of Data Protection and Intelligent Management, Ministry of Education (教育部数据保护与智能管理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages,12 figures

点击查看摘要

Abstract:Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML this http URL dataset can be accessed via this https URL.
zh

[CV-11] SimScale: Learning to Drive via Real-World Simulation at Scale

【速读】:该论文旨在解决自动驾驶系统在安全关键和分布外(out-of-distribution)场景下决策能力不足的问题,这类场景在人类专家采集的真实驾驶数据中严重缺失。为弥补数据多样性不足,论文提出一种可扩展的仿真框架(SimScale),其核心在于利用先进的神经渲染技术与反应式环境相结合,基于已有驾驶日志生成大量未见过的状态,并通过伪专家轨迹生成机制为这些新模拟状态提供动作监督信号。关键创新点在于:1)通过扰动自车轨迹控制高保真多视角观测合成;2)设计有效的伪专家策略以实现动作监督;3)验证了仅增加仿真数据即可平滑提升策略性能,无需额外真实数据流,显著增强了规划方法在复杂真实基准上的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.23369
作者: Haochen Tian,Tianyu Li,Haochen Liu,Jiazhi Yang,Yihang Qiu,Guang Li,Junli Wang,Yinfeng Gao,Zhang Zhang,Liang Wang,Hangjun Ye,Tieniu Tan,Long Chen,Hongyang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.
zh

[CV-12] A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors

【速读】:该论文旨在解决低资源医疗环境中床边监护仪因缺乏网络连接而导致的生理数据无法与电子健康记录(EHR)系统无缝集成的问题,即“互操作性鸿沟”。其解决方案的关键在于提出了一种基于计算机视觉的轻量级流水线,通过YOLOv11实现对监护仪屏幕及关键区域(ROI)的精准定位,并结合PaddleOCR进行鲁棒的文字识别;同时引入几何校正模块以标准化不同拍摄角度和光照条件下的屏幕视角,从而提升字符识别的稳定性。该方法无需更换硬件即可将屏幕上的非结构化信息自动转化为结构化数字数据,为改善低资源环境下的临床信息可及性和文档效率提供了可行且可扩展的技术路径。

链接: https://arxiv.org/abs/2511.23355
作者: Vinh Chau,Khoa Le Dinh Van,Hon Huynh Ngoc,Binh Nguyen Thien,Hao Nguyen Thien,Vy Nguyen Quang,Phuc Vo Hong,Yen Lam Minh,Kieu Pham Tieu,Trinh Nguyen Thi Diem,Louise Thwaites,Hai Ho Bich
机构: Oxford University Clinical Research Unit (牛津大学临床研究单位); Trung Vuong Hospital (中沃医院); Nuffield Department of Medicine, University of Oxford (牛津大学诺菲尔德医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.
zh

[CV-13] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories

【速读】:该论文旨在解决流模型(Flow-based generative models)中采样效率低的问题,尤其是传统方法依赖昂贵的常微分方程(ODE)数值积分进行采样,以及现有的一步采样方法如Rectified Flow和MeanFlow在训练稳定性与收敛速度上的不足。其关键解决方案是提出Rectified MeanFlow(Re-MeanFlow),该框架通过在单次重流(reflow)步骤中建模沿校正轨迹的平均速度场,避免了对完美直线路径的强依赖,从而实现高效训练;同时引入一种简单有效的截断启发式策略以降低残余曲率,进一步提升生成质量和训练效率。

链接: https://arxiv.org/abs/2511.23342
作者: Xinxi Zhang,Shiwei Tan,Quang Nguyen,Quan Dao,Ligong Han,Xiaoxiao He,Tunyu Zhang,Alen Mrdovic,Dimitris Metaxas
机构: Rutgers University (罗格斯大学); Red Hat AI Innovation (红帽人工智能创新)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at this https URL.
zh

[CV-14] Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

【速读】:该论文旨在解决视觉自回归建模(Visual AutoRegressive modeling, VAR)中因全上下文依赖(full-context dependency)导致的计算效率低下与内存开销过大的问题,从而限制了VAR在实际应用中的可扩展性。其解决方案的关键在于将VAR重新建模为一种非全上下文的马尔可夫过程(Markov process),提出Markov-VAR:通过引入滑动窗口机制,将部分历史尺度压缩为紧凑的历史向量(history vector),以补偿由于放弃全上下文依赖所造成的历史信息损失;该历史向量与当前尺度状态结合形成动态状态,驱动模型在马尔可夫框架下演化,从而在保持生成质量的同时显著提升效率。

链接: https://arxiv.org/abs/2511.23334
作者: Yu Zhang,Jingyi Liu,Yiwei Shi,Qi Zhang,Duoqian Miao,Changwei Wang,Longbing Cao
机构: Tongji University (同济大学); University of Bristol (布里斯托大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR’s practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 \times 256) and decreases peak memory consumption by 83.8% (1024 \times 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.
zh

[CV-15] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

【速读】:该论文旨在解决遥感领域中指令驱动的分割(instruction-driven segmentation)任务中存在的问题,即现有方法因任务定义碎片化和指令数据稀缺而导致模型对语义理解与泛化能力不足。其解决方案的关键在于构建了首个百万级遥感指令驱动分割数据集GeoSeg-1M,通过自动化掩码筛选与指令生成流程,从多个公开数据集中合成指代、交互和推理类分割指令,形成包含590K图像、117类目标及1.1M个图像-掩码-指令三元组的数据资源;同时提出统一框架UniGeoSeg,融合任务感知文本增强、潜在知识记忆机制与渐进式训练策略,显著提升多任务学习效果与零样本泛化性能。

链接: https://arxiv.org/abs/2511.23332
作者: Shuo Ni,Di Wang,He Chen,Haonan Guo,Ning Zhang,Jing Zhang
机构: Beijing Institute of Technology (北京理工大学); Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Datasets and source code were released at this https URL

点击查看摘要

Abstract:Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at this https URL.
zh

[CV-16] A Perceptually Inspired Variational Framework for Color Enhancement

【速读】:该论文旨在解决现有显式色彩校正算法在处理图像显著特征(如对比度和分散度)时行为难以表征的问题。其核心解决方案是提出一种受人类颜色感知基本现象启发的变分色彩对比度增强框架,通过定义一组“感知启发”的能量泛函需满足的基本要求,识别出一类符合这些条件的显式函数形式,并从中选取三个具有基础研究价值的具体泛函进行分析与比较。关键创新在于将感知心理学机制转化为数学上的能量最小化问题,并采用梯度下降法求解其极小值;同时设计了一种通用方法,将算法计算复杂度从 O(N2)\cal O(N^2) 降低至 O(NlogN)\cal O(N\log N),其中 NN 为输入像素数量,从而显著提升效率。

链接: https://arxiv.org/abs/2511.23329
作者: Rodrigo Palma-Amestoy,Edoardo Provenzi,Marcelo Bertalmío,Vicent Caselles
机构: Universidad de Chile (智利大学); Università di Milano (米兰大学); Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Basic phenomenology of human color vision has been widely taken as an inspiration to devise explicit color correction algorithms. The behavior of these models in terms of significative image features (such as contrast and dispersion) can be difficult to characterize. To cope with this, we propose to use a variational formulation of color contrast enhancement that is inspired by the basic phenomenology of color perception. In particular, we devise a set of basic requirements to be fulfilled by an energy to be considered as `perceptually inspired’, showing that there is an explicit class of functionals satisfying all of them. We single out three explicit functionals that we consider of basic interest, showing similarities and differences with existing models. The minima of such functionals is computed using a gradient descent approach. We also present a general methodology to reduce the computational cost of the algorithms under analysis from \cal O(N^2) to \cal O(N\log N) , being N the number of input pixels.
zh

[CV-17] FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting

【速读】:该论文旨在解决基于纹理的高斯溅射(Texture-based Gaussian Splatting)在处理局部视觉复杂度差异时存在的采样效率低下问题:传统方法采用统一的每高斯采样网格,导致高频区域欠采样、平滑区域资源浪费,从而造成图像模糊和细节丢失。解决方案的关键在于提出FACT-GS框架,其核心创新是将纹理参数化重构为可微的采样密度分配问题,通过引入一个由变形场(deformation field)控制的频率感知采样策略,使纹理采样密度随局部频域复杂度自适应调整;该方法在固定分辨率纹理网格上实现非均匀采样,在不增加参数量的前提下显著提升了高频细节的恢复能力并保持实时渲染性能。

链接: https://arxiv.org/abs/2511.23292
作者: Tianhao Xie,Linlian Jiang,Xinxin Zuo,Yang Wang,Tiberiu Popa
机构: Concordia University (康考迪亚大学); Mila (蒙特利尔学习算法研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 11 pages, 6 figures, preprint

点击查看摘要

Abstract:Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.
zh

[CV-18] Machine Learning for Scientific Visualization: Ensemble Data Analysis

【速读】:该论文旨在解决科学模拟与实验测量中产生的海量时空数据在高维性、复杂结构及缺失信息背景下难以提取有效洞察的问题。传统分析方法在此类场景下表现不足,因此亟需更鲁棒且数据驱动的解决方案。其关键解决方案包括:(1)基于自编码器(Autoencoder)的降维方法,通过评估投影指标在部分标签下的稳定性并引入帕累托最优选择策略,确保低维嵌入的表达能力和可靠性;(2)提出FLINT模型,在有监督和无监督流场设置下实现高质量流场估计与时间插值,无需领域特定假设即可重建缺失速度场并生成高保真标量场插值结果;(3)进一步引入HyperFLINT,利用超网络(Hypernetwork)机制根据模拟参数进行条件化建模,提升跨不同科学领域的适应性和泛化能力,尤其在稀疏或不完整数据条件下仍能获得更精确的重构效果。

链接: https://arxiv.org/abs/2511.23290
作者: Hamid Gadirov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: PhD thesis, University of Groningen, 2025

点击查看摘要

Abstract:Scientific simulations and experimental measurements produce vast amounts of spatio-temporal data, yet extracting meaningful insights remains challenging due to high dimensionality, complex structures, and missing information. Traditional analysis methods often struggle with these issues, motivating the need for more robust, data-driven approaches. This dissertation explores deep learning methodologies to improve the analysis and visualization of spatio-temporal scientific ensembles, focusing on dimensionality reduction, flow estimation, and temporal interpolation. First, we address high-dimensional data representation through autoencoder-based dimensionality reduction for scientific ensembles. We evaluate the stability of projection metrics under partial labeling and introduce a Pareto-efficient selection strategy to identify optimal autoencoder variants, ensuring expressive and reliable low-dimensional embeddings. Next, we present FLINT, a deep learning model for high-quality flow estimation and temporal interpolation in both flow-supervised and flow-unsupervised settings. FLINT reconstructs missing velocity fields and generates high-fidelity temporal interpolants for scalar fields across 2D+time and 3D+time ensembles without domain-specific assumptions or extensive finetuning. To further improve adaptability and generalization, we introduce HyperFLINT, a hypernetwork-based approach that conditions on simulation parameters to estimate flow fields and interpolate scalar data. This parameter-aware adaptation yields more accurate reconstructions across diverse scientific domains, even with sparse or incomplete data. Overall, this dissertation advances deep learning techniques for scientific visualization, providing scalable, adaptable, and high-quality solutions for interpreting complex spatio-temporal ensembles.
zh

[CV-19] Simultaneous Image Quality Improvement and Artefacts Correction in Accelerated MRI

【速读】:该论文旨在解决磁共振成像(MRI)中因加速采集导致的图像质量下降以及噪声和运动伪影共同存在时难以同时校正的问题。现有方法通常仅针对采样不足或单一类型伪影进行优化,无法应对多种退化因素叠加的情况,从而限制了实际应用中的性能表现。解决方案的关键在于提出一种名为USArt(Under-Sampling and Artifact correction model)的双子模型架构,能够同步恢复欠采样数据并校正噪声与运动伪影,在保持高信噪比(SNR)和对比度的同时实现最高达5倍的加速比,且在多种欠采样策略下表现出鲁棒性,尤其以梯度欠采样策略效果最佳。

链接: https://arxiv.org/abs/2511.23274
作者: Georgia Kanli,Daniele Perlo,Selma Boudissa,Radovan Jirik,Olivier Keunen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:MR data are acquired in the frequency domain, known as k-space. Acquiring high-quality and high-resolution MR images can be time-consuming, posing a significant challenge when multiple sequences providing complementary contrast information are needed or when the patient is unable to remain in the scanner for an extended period of time. Reducing k-space measurements is a strategy to speed up acquisition, but often leads to reduced quality in reconstructed images. Additionally, in real-world MRI, both under-sampled and full-sampled images are prone to artefacts, and correcting these artefacts is crucial for maintaining diagnostic accuracy. Deep learning methods have been proposed to restore image quality from under-sampled data, while others focused on the correction of artefacts that result from the noise or motion. No approach has however been proposed so far that addresses both acceleration and artefacts correction, limiting the performance of these models when these degradation factors occur simultaneously. To address this gap, we present a method for recovering high-quality images from under-sampled data with simultaneously correction for noise and motion artefact called USArt (Under-Sampling and Artifact correction model). Customized for 2D brain anatomical images acquired with Cartesian sampling, USArt employs a dual sub-model approach. The results demonstrate remarkable increase of signal-to-noise ratio (SNR) and contrast in the images restored. Various under-sampling strategies and degradation levels were explored, with the gradient under-sampling strategy yielding the best outcomes. We achieved up to 5x acceleration and simultaneously artefacts correction without significant degradation, showcasing the model’s robustness in real-world settings.
zh

[CV-20] Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes

【速读】:该论文旨在解决森林中地上生物量(Aboveground Biomass, AGB)估算难题,传统方法依赖费时费力的实地测量或在密集植被中受限的遥感技术。其解决方案的关键在于提出一种基于学习的方法,将AGB估计建模为密集预测任务,引入“AGB密度图”(AGB density maps),其中每个像素代表单位面积内树木生物量的归一化值;利用合成3D SPREAD数据集提供的真实森林场景、每棵树的属性(高度、树干与冠层直径)及实例分割掩膜,通过所有ometric方程计算AGB并训练模型预测密度图,最终整合得到整幅图像的AGB估计。该方法首次实现了仅用一张地面RGB图像直接估算AGB,具备可扩展性、可解释性和低成本优势,为森林监测提供了新路径,并支持公民科学参与。

链接: https://arxiv.org/abs/2511.23249
作者: Silvia Zuffi
机构: Institute for Applied Mathematics “Enrico Magenes”, CNR-IMATI, Milan, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at STAG 2025

点击查看摘要

Abstract:Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree’s image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.
zh

[CV-21] Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods

【速读】:该论文旨在解决工业和机器人场景中机器学习模型部署时数据生成与标注成本高昂的问题,尤其是如何有效缩小仿真到现实(sim-to-real)之间的差距。其核心解决方案在于系统性地评估多种域随机化(Domain Randomization, DR)和域适应(Domain Adaptation, DA)技术,包括基于特征的方法、生成式 AI(Generative AI, GenAI)以及传统渲染方法,以在无需人工标注的前提下生成具有情境感知能力的合成数据。关键发现是:若初始合成数据具备足够多样性,简单的特征对齐方法(如亮度调整和感知哈希过滤)在准确性和资源效率上均优于复杂的GenAI方法,其中感知哈希在工业和机器人数据集上分别实现了98%和67%的mAP50指标,且GenAI未带来显著的sim-to-real性能提升却存在明显的时间开销。

链接: https://arxiv.org/abs/2511.23241
作者: Jose Moises Araya-Martinez,Adrián Sanchis Reig,Gautham Mohan,Sarvenaz Sardari,Jens Lambrecht,Jörg Krüger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.
zh

[CV-22] Unlocking Multilingual Reasoning Capability of LLM s and LVLMs through Representation Engineering

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)和多模态视觉-语言模型(Large Vision-Language Models, LVLMs)在低资源语言中推理能力显著弱于英语的问题,从而缓解多语言应用中的公平性挑战。现有方法依赖昂贵的多语言训练或外部翻译工具进行提示工程,存在资源消耗高且对翻译质量敏感的缺陷。论文提出一种无需训练的推理时方法——表示工程增强(Multilingual Reasoning via Representation Engineering, MRRE),其核心在于推理过程中在特定层注入两类预计算向量:跨语言推理增强向量(cross-lingual reasoning enhancement vectors),用于将非英语推理表示引导至英语空间以激活多语言推理能力;目标语言输出锚定向量(target-language output anchoring vectors),用于恢复目标语言分布以保持输入与输出的语言一致性。该方法不依赖额外训练数据或翻译工具,在多个主流LLM和LVLM上实现显著提升,尤其在低资源语言(如泰语和斯瓦希里语)中平均提升达5.48%,最高达7.54%。

链接: https://arxiv.org/abs/2511.23231
作者: Qiming Li,Xiaocheng Feng,Yixuan Ma,Zekai Ye,Ruihan Chen,Xiachong Feng,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.
zh

[CV-23] Language-guided 3D scene synthesis for fine-grained functionality understanding

【速读】:该论文旨在解决3D场景中功能理解(Functionality Understanding in 3D)因真实世界数据稀缺而受限的问题,尤其在需要识别特定功能元素以完成动作(如“打开床边柜子的第二个抽屉”)时,由于数据采集与标注成本高昂,导致模型训练困难。解决方案的关键在于提出SynthFun3D,一种基于任务的3D场景合成方法:它根据动作描述自动构建室内环境,利用带部件级标注的家具资产数据库生成可执行该动作的场景,并通过推理确定正确功能元素的3D掩码,从而实现低成本、大规模生成高质量标注数据。

链接: https://arxiv.org/abs/2511.23230
作者: Jaime Corsetti,Francesco Giuliari,Davide Boscaini,Pedro Hermosilla,Andrea Pilzer,Guofeng Mei,Alexandros Delitzas,Francis Engelmann,Fabio Poiesi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. 24 pages, 19 figures, 2 tables

点击查看摘要

Abstract:Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to “Open the second drawer of the cabinet near the bed”), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: this http URL.
zh

[CV-24] PointCNN: Performant Convolution on Native Points

【速读】:该论文旨在解决3D点云数据中现有卷积学习方法在几何精度与计算性能之间存在的权衡问题:点基方法(point-based methods)虽能保持高几何精度,但效率较低;而体素基方法(voxel-based methods)通过量化实现高效计算,却牺牲了几何保真度,成为点云配准等任务的关键瓶颈。其解决方案的核心在于提出PointCNN++,一种将稀疏卷积从体素推广至点的新型架构设计,本质上将体素卷积视为点卷积的一个退化特例。关键创新包括:(1) 提出以原始高精度点坐标为中心的点中心卷积(point-centric convolution),确保几何细节保留;(2) 设计原生基于点的计算策略,将卷积形式化为矩阵-向量乘法与归约(MVMR)问题,并开发高度优化的GPU内核,显著提升运算效率。实验表明,PointCNN++在内存消耗上减少一个数量级、速度提升数倍,且作为体素骨干网络的替代方案时,在点云配准任务中显著提升精度,同时兼具更高效率和更低内存占用。

链接: https://arxiv.org/abs/2511.23227
作者: Lihan Li(1),Haofeng Zhong(1),Rui Bu(2),Mingchao Sun(3),Wenzheng Chen(1),Baoquan Chen(1),Yangyan Li(2) ((1) Peking University, (2) Ant Group, (3) AMAP)
机构: Peking University (北京大学); Ant Group (蚂蚁集团); AMAP
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It \textbfgeneralizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates \textbfnatively on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ \textbfuses an order of magnitude less memory and is several times faster than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it \textbfsignificantly improves point cloud registration accuracies while proving both more memory-efficient and faster. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency. Our code will be open sourced.
zh

[CV-25] DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection

【速读】:该论文旨在解决茶园中茶叶病虫害检测难题,尤其是在复杂背景、光照变化和密集枝叶遮挡条件下,现有目标检测模型易出现漏检与误检的问题。其核心解决方案为提出DAONet-YOLOv8架构,关键创新包括:(1)双注意力融合模块(Dual-Attention Fusion Module, DAFM),通过结合卷积局部特征提取与自注意力全局上下文建模,增强对细微病斑区域的感知并抑制背景噪声;(2)遮挡感知检测头(Detect-OAHead),学习可见与被遮挡部分之间的关联以补偿缺失的病斑特征;(3)C2f-DSConv模块,采用多核形状动态合成卷积,更有效地捕捉不规则病斑边界。实验表明,该方法在真实茶园数据集上显著优于YOLOv8n基线模型,并具备参数量减少的优势。

链接: https://arxiv.org/abs/2511.23222
作者: Yefeng Wu,Shan Wan,Ling Wu,Yecheng Zhao
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of tea leaf pests and diseases in real plantations remains challenging due to complex backgrounds, variable illumination, and frequent occlusions among dense branches and leaves. Existing detectors often suffer from missed detections and false positives in such scenarios. To address these issues, we propose DAONet-YOLOv8, an enhanced YOLOv8 variant with three key improvements: (1) a Dual-Attention Fusion Module (DAFM) that combines convolutional local feature extraction with self-attention based global context modeling to focus on subtle lesion regions while suppressing background noise; (2) an occlusion-aware detection head (Detect-OAHead) that learns the relationship between visible and occluded parts to compensate for missing lesion features; and (3) a C2f-DSConv module employing dynamic synthesis convolutions with multiple kernel shapes to better capture irregular lesion boundaries. Experiments on our real-world tea plantation dataset containing six pest and disease categories demonstrate that DAONet-YOLOv8 achieves 92.97% precision, 92.80% recall, 97.10% mAP@50 and 76.90% mAP@50:95, outperforming the YOLOv8n baseline by 2.34, 4.68, 1.40 and 1.80 percentage points respectively, while reducing parameters by 16.7%. Comparative experiments further confirm that DAONet-YOLOv8 achieves superior performance over mainstream detection models.
zh

[CV-26] Robust 3DGS-based SLAM via Adaptive Kernel Smoothing

【速读】:该论文旨在解决3D高斯散射SLAM(3DGS-SLAM)中一个关键问题:传统方法认为渲染质量是决定跟踪精度的主要因素,但作者指出,提升光栅化过程对参数误差的鲁棒性更为重要,以确保相机位姿跟踪的稳定性。其解决方案的关键在于提出一种名为“修正模糊K近邻”(Corrective Blurry KNN, CB-KNN)的新策略,通过引入平滑核机制,使每个高斯分布对更广泛、更平滑的像素区域产生影响,从而降低异常高斯参数对整体图像的干扰。该方法在不改变现有3DGS框架结构的前提下,动态调整局部区域内K近邻高斯的RGB值和位置,生成更平滑的局部渲染结果,实现对位姿优化的正则化作用,显著提升了跟踪鲁棒性和准确性,同时保持场景重建质量。

链接: https://arxiv.org/abs/2511.23221
作者: Shouhe Zhang,Dayong Ren,Sensen Song,Wenjie Li,Piaopiao Yu,Yurong Qian
机构: Xinjiang University (新疆大学); Nanjing University (南京大学); Southwest University of Political Science and Law (西南政法大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we challenge the conventional notion in 3DGS-SLAM that rendering quality is the primary determinant of tracking accuracy. We argue that, compared to solely pursuing a perfect scene representation, it is more critical to enhance the robustness of the rasterization process against parameter errors to ensure stable camera pose tracking. To address this challenge, we propose a novel approach that leverages a smooth kernel strategy to enhance the robustness of 3DGS-based SLAM. Unlike conventional methods that focus solely on minimizing rendering error, our core insight is to make the rasterization process more resilient to imperfections in the 3DGS parameters. We hypothesize that by allowing each Gaussian to influence a smoother, wider distribution of pixels during rendering, we can mitigate the detrimental effects of parameter noise from outlier Gaussians. This approach intentionally introduces a controlled blur to the rendered image, which acts as a regularization term, stabilizing the subsequent pose optimization. While a complete redesign of the rasterization pipeline is an ideal solution, we propose a practical and effective alternative that is readily integrated into existing 3DGS frameworks. Our method, termed Corrective Blurry KNN (CB-KNN), adaptively modifies the RGB values and locations of the K-nearest neighboring Gaussians within a local region. This dynamic adjustment generates a smoother local rendering, reducing the impact of erroneous GS parameters on the overall image. Experimental results demonstrate that our approach, while maintaining the overall quality of the scene reconstruction (mapping), significantly improves the robustness and accuracy of camera pose tracking.
zh

[CV-27] Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day ICML2025

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在表格数据生成任务中能力不足的问题,尤其是在有限数据和计算资源条件下如何有效提升其生成性能。现有研究主要聚焦于表格数据的问答与推理任务,而忽视了表格生成这一关键能力。为应对这一挑战,作者提出了一种基于高质量指令数据集的轻量级指令微调方案:首先构建了一个高质量的表格数据生成指令数据集,以增强模型对表格结构的理解;随后在仅7K条指令上对开源模型Llama3.1-8B-Instruct进行微调,仅使用A100 GPU训练不足6小时,便实现了与商业领先模型GPT-4o相当的表格生成性能。该方案的关键在于通过精心设计的数据构造策略和高效的微调机制,在资源受限场景下显著提升了LLMs的表格生成能力。

链接: https://arxiv.org/abs/2511.23220
作者: Milad Abdollahzadeh,Abdul Raheem,Zilong Zhao,Uzair Javaid,Kevin Yee,Nalam Venkata Abhishek,Tram Truong-Huu,Biplab Sikdar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted International Conference on Machine Learning (ICML 2025), 1st Workshop on Foundation Models for Structured Data

点击查看摘要

Abstract:Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.
zh

[CV-28] Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation

【速读】:该论文旨在解决现代工业环境中早期视觉质量检测的难题,尤其是在半受控场景下,传统视觉检测系统因复杂性和高数据需求而难以广泛应用的问题。其核心解决方案是提出一种姿态无关(pose-agnostic)、零样本(zero-shot)的质量检测框架,通过在RGB-D空间中将真实场景与实时数字孪生(Digital Twin, DT)进行对比来实现高效检测。关键创新在于利用已知计算机辅助设计(Computer-Aided Design, CAD)模型的对象检测与位姿估计对工业场景进行语义描述,从而支持高效的实时DT渲染;同时引入可扩展的分层标注策略,统一姿态标签与逻辑/结构缺陷标注,提升多标准缺陷识别能力。基于轴向磁通电机的汽车制造案例验证了该方法的有效性,在仅使用简单距离测量的情况下仍实现了高达63.3%的交并比(IoU)检测精度。

链接: https://arxiv.org/abs/2511.23214
作者: Jose Moises Araya-Martinez,Gautham Mohan,Kenichi Hayakawa Bolaños,Roberto Mendieta,Sarvenaz Sardari,Jens Lambrecht,Jörg Krüger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.
zh

[CV-29] Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

【速读】:该论文旨在解决病理学基础模型(Pathology Foundation Models, FMs)参数量庞大、嵌入维度高导致在计算资源受限场景下难以部署的问题。其解决方案的关键在于提出Pathryoshka,一个受RADIO蒸馏和Matryoshka表示学习启发的多教师蒸馏框架,通过压缩模型规模(减少86–92%参数量)同时保持与大型教师模型相当的性能,并支持灵活调整嵌入维度,从而在不牺牲准确性或表征丰富性的情况下实现高效本地部署,推动病理学基础模型在科研和临床中的普及应用。

链接: https://arxiv.org/abs/2511.23204
作者: Christian Grashei,Christian Brechenmacher,Rao Muhammad Umer,Jingsong Liu,Carsten Marr,Ewa Szczurek,Peter J. Schüffler
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (慕尼黑亥姆霍兹研究中心); Munich Data Science Institute (慕尼黑数据科学研究所); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.
zh

[CV-30] Vision Bridge Transformer at Scale

【速读】:该论文旨在解决传统扩散模型在条件生成任务中效率较低的问题,特别是图像和视频翻译任务中数据到数据的转换效率不足。其解决方案的关键在于引入Vision Bridge Transformer (ViBT),这是一种基于布朗桥模型(Brownian Bridge Models)的大规模实例化架构,直接建模输入与输出之间的轨迹而非通过噪声逐步重构;同时采用Transformer架构并提出方差稳定的速度匹配目标(variance-stabilized velocity-matching objective),以支持大规模训练并提升模型鲁棒性,从而实现高效且高质量的指令驱动图像编辑与复杂视频翻译。

链接: https://arxiv.org/abs/2511.23199
作者: Zhenxiong Tan,Zeqing Wang,Xingyi Yang,Songhua Liu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); The Hong Kong Polytechnic University (香港理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
zh

[CV-31] GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation

【速读】:该论文旨在解决图像到三维场景生成中常见的几何失真和内容模糊问题(geometric distortions and blurry content)。其核心解决方案是重构生成流程,提出GeoWorld框架,关键在于通过先生成连续视频帧,再利用几何模型提取全帧几何特征(full-frame geometry features),这些特征相比以往方法使用的单帧深度图或相机嵌入(camera embeddings)包含更丰富的几何信息,并作为几何条件引导视频生成模型。此外,为提升几何结构一致性,论文进一步引入几何对齐损失(geometry alignment loss)以施加真实世界几何约束,并设计几何适配模块(geometry adaptation module)确保几何特征的有效利用。

链接: https://arxiv.org/abs/2511.23191
作者: Yuhao Wan,Lijuan Liu,Jingzhi Zhou,Zihan Zhou,Xuying Zhang,Dongbo Zhang,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: this https URL.
zh

[CV-32] Obstruction reasoning for robotic grasping

【速读】:该论文旨在解决复杂场景中机器人抓取任务的障碍推理与可达性规划问题,即在杂乱环境中不仅需要视觉定位目标物体,还需识别并推理出必须清除的障碍物顺序,以实现有效抓取。解决方案的关键在于提出UNOGrasp模型,其核心创新是基于目标物体生成的障碍路径设计多步推理流程,并通过障碍感知的视觉线索锚定每一步推理,从而增强模型对障碍关系的理解能力;同时结合监督学习与强化微调,利用可验证的推理奖励机制提升性能,最终显著提升了合成与真实环境中的障碍推理准确性和抓取成功率。

链接: https://arxiv.org/abs/2511.23186
作者: Runyu Jiao,Matteo Bortolon,Francesco Giuliari,Alice Fasoli,Sergio Povoli,Guofeng Mei,Yiming Wang,Fabio Poiesi
机构: Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); University of Trento (特伦托大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: this https URL.
zh

[CV-33] Fast Multi-view Consistent 3D Editing with Video Priors

【速读】:该论文旨在解决文本驱动的3D编辑中多视角一致性不足的问题。现有方法通常依赖2D生成或编辑模型对每个视图单独处理,并通过迭代的2D-3D-2D更新过程实现3D编辑,但这一过程不仅效率低下,还容易因不同视图编辑信号在迭代中被平均而导致结果过度平滑。解决方案的关键在于利用预训练视频生成模型中的时序一致性先验(temporal consistency priors),提出基于生成式视频先验的3D编辑方法(ViP3DE),其核心思想是将视频生成模型条件化于单个已编辑视图,直接生成其他一致性的编辑视图以完成3D更新,从而跳过传统迭代范式;同时引入保持运动特性的噪声混合(motion-preserved noise blending)确保生成视图匹配指定相机位姿,并结合几何感知去噪(geometry-aware denoising)将3D几何先验融入视频模型,进一步提升多视角一致性。

链接: https://arxiv.org/abs/2511.23172
作者: Liyi Chen,Ruihuang Li,Guowen Zhang,Pengfei Wang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
zh

[CV-34] PowerCLIP: Powerset Alignment for Contrastive Pre-Training CVPR2026

【速读】:该论文旨在解决当前对比视觉语言预训练框架(如CLIP)在处理跨多个图像区域的组合语义时能力不足的问题,尤其是在捕捉细粒度、多区域协同的语义关系方面存在局限。其解决方案的关键在于提出PowerCLIP框架,通过引入幂集对齐(powerset alignment)机制,系统性地优化图像区域与文本短语之间的组合对应关系:具体而言,该方法定义了图像区域幂集与文本解析树(textual parse trees)之间的损失函数,并利用高效的非线性聚合器(Non-linear Aggregators, NLAs)将原本指数级复杂度(O(2^M))的计算降低至线性复杂度(O(M)),同时可任意逼近精确损失值,从而在保持高效性的同时显著提升模型对复杂语义组合的理解能力。

链接: https://arxiv.org/abs/2511.23170
作者: Masaki Kawamura,Nakamasa Inoue,Rintaro Yanagi,Hirokatsu Kataoka,Rio Yokota
机构: Institute of Science Tokyo(东京科学研究所); AIST(国立研究开发法人信息・技术综合研究所); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2026

点击查看摘要

Abstract:Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.
zh

[CV-35] REVEAL: Reasoning -enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

【速读】:该论文旨在解决当前AI生成图像检测方法缺乏可验证因果解释和泛化能力不足的问题。现有方法多依赖后验推理或视觉判别,难以提供可信的证据链,导致解释性弱且跨模型泛化性能差。其解决方案的关键在于提出REVEAL-Bench基准与REVEAL框架:前者构建了一个基于多个轻量级专家模型的多模态推理增强基准,记录逐步推理轨迹与证据支撑;后者引入一种专家驱动的强化学习机制,通过联合优化检测准确率、解释保真度与逻辑一致性,生成细粒度、可解释且可验证的推理链条,从而实现更可靠、更具因果基础的图像伪造检测。

链接: https://arxiv.org/abs/2511.23158
作者: Huangsen Cao,Qin Mei,Zhiheng Li,Yuxi Li,Ying Zhang,Chen Li,Zhimeng Zhang,Xin Ding,Yongwei Wang,Jing Lyu,Fei Wu
机构: Zhejiang University (浙江大学); WeChat Vision, Tencent Inc. (腾讯微信视觉团队); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbfREVEAL-Bench, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbfREVEAL (\underlineReasoning-\underlineenhanced Forensic E\underlinevid\underlineence \underlineAna\underlinelysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
zh

[CV-36] Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding

【速读】:该论文旨在解决视频时间定位(Video Temporal Grounding, VTG)任务中模型对“硬不相关”查询(hard-irrelevant queries)无法有效拒绝的问题,即当自然语言查询与视频内容语义相似但实际无关时,现有模型仍会错误地输出一个时间片段,导致误判。解决方案的关键在于提出一种拒绝感知的强化微调方法(Refusal-Aware Reinforcement Fine-Tuning, RA-RFT),其基于Group Relative Policy Optimization (GRPO) 框架,并引入四个奖励目标——格式一致性、拒绝IoU(refuse-IoU)、解释性(explain)和查询修正(query correction),以增强模型在细粒度语义层面的区分能力与拒绝决策能力。此外,作者构建了首个专门用于训练和评估该能力的Hard-Irrelevant VTG (HI-VTG) 数据集,从而显著提升了VTG模型在复杂真实场景下的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2511.23151
作者: Jin-Seop Lee,SungJoon Lee,SeongJun Jung,Boyang Li,Jee-Hyong Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at this https URL.
zh

[CV-37] Cascaded Robust Rectification for Arbitrary Document Images

【速读】:该论文旨在解决真实场景下文档矫正(document rectification)所面临的挑战,即由于相机视角差异和物理变形导致的复杂几何失真问题。其解决方案的关键在于提出一种多阶段渐进式矫正框架,按从粗到细的顺序依次处理不同类型的失真:首先通过全局仿射变换校正因拍摄视角引起的透视畸变,其次纠正由纸张弯曲和折叠引发的几何形变,最后采用内容感知的迭代过程消除细粒度的内容失真。该方法通过分阶段分解与逐步求解复杂变换,显著提升了文档矫正的准确性与鲁棒性。

链接: https://arxiv.org/abs/2511.23150
作者: Chaoyun Wang,Quanxin Huang,I-Chao Shen,Takeo Igarashi,Nanning Zheng,Caigui Jiang
机构: Xi’an Jiaotong University (西安交通大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document rectification in real-world scenarios poses significant challenges due to extreme variations in camera perspectives and physical distortions. Driven by the insight that complex transformations can be decomposed and resolved progressively, we introduce a novel multi-stage framework that progressively reverses distinct distortion types in a coarse-to-fine manner. Specifically, our framework first performs a global affine transformation to correct perspective distortions arising from the camera’s viewpoint, then rectifies geometric deformations resulting from physical paper curling and folding, and finally employs a content-aware iterative process to eliminate fine-grained content distortions. To address limitations in existing evaluation protocols, we also propose two enhanced metrics: layout-aligned OCR metrics (AED/ACER) for a stable assessment that decouples geometric rectification quality from the layout analysis errors of OCR engines, and masked AD/AAD (AD-M/AAD-M) tailored for accurately evaluating geometric distortions in documents with incomplete boundaries. Extensive experiments show that our method establishes new state-of-the-art performance on multiple challenging benchmarks, yielding a substantial reduction of 14.1%–34.7% in the AAD metric and demonstrating superior efficacy in real-world applications. The code will be publicly available at this https URL.
zh

[CV-38] InstanceV: Instance-Level Video Generation

【速读】:该论文旨在解决当前文本到视频扩散模型(text-to-video diffusion models)在生成过程中缺乏细粒度控制能力的问题,尤其是难以实现对特定实例(instance-level)的空间定位与属性控制,以及全局语义一致性不足的挑战。解决方案的关键在于提出一个名为InstanceV的视频生成框架,其核心创新包括:1)引入实例感知掩码交叉注意力机制(Instance-aware Masked Cross-Attention),以充分利用额外的实例级定位信息,在指定空间位置上生成正确归属的实例;2)设计共享时间自适应提示增强模块(Shared Timestep-Adaptive Prompt Enhancement),通过参数高效方式将局部实例与全局语义关联,提升整体一致性;3)在训练和推理阶段均采用空间感知无条件引导策略(Spatially-Aware Unconditional Guidance),缓解小尺度实例在生成中消失的问题。这些技术共同实现了高可控性、高质量且具实例一致性的视频生成能力。

链接: https://arxiv.org/abs/2511.23146
作者: Yuheng Chen,Teng Hu,Jiangning Zhang,Zhucun Xue,Ran Yi,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.
zh

[CV-39] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

【速读】:该论文旨在解决当前基于扩散模型的相机控制视频生成方法中缺乏足够场景理解与几何感知的问题。现有方法虽将相机位姿表示为基于射线的条件,但在建模外观与几何信息时存在耦合,导致生成视频难以严格遵循指定相机轨迹。其解决方案的关键在于提出DualCamCtrl框架,该框架采用双分支结构,协同生成一致的RGB序列与深度序列,并引入语义引导的相互对齐机制(Semantic Guided Mutual Alignment, SIGMA),实现跨模态的语义引导融合与相互增强。此设计有效解耦了外观与几何建模,显著提升相机运动一致性,实验表明相较先前方法可降低超过40%的相机运动误差。

链接: https://arxiv.org/abs/2511.23127
作者: Hongfei Zhang,Kanghao Chen,Zixin Zhang,Harold Haodong Chen,Yuanhuiyi Lyu,Yuqi Zhang,Shuai Yang,Kun Zhou,Yingcong Chen
机构: HKUST (GZ); HKUST; Fudan University; Shenzhen University; Knowin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40% reduction in camera motion errors compared with prior methods. Our project page: this https URL-page/
zh

[CV-40] DNA-Prior: Unsupervised Denoise Anything via Dual-Domain Prior

【速读】:该论文旨在解决医学影像处理中去噪方法对大规模标注数据和监督学习的依赖问题,从而限制其在临床环境中多模态、低标注数据场景下的应用。解决方案的关键在于提出一种通用的无监督去噪框架DNA-Prior,其核心是通过数学上严谨的混合先验机制实现从噪声观测中直接重建干净图像:一方面利用深度网络参数化引入隐式结构先验,另一方面结合显式的频域保真项与空间正则化项构成谱-空域联合先验,从而在无需外部训练数据或模态特异性调参的情况下,同时保持全局频域特征与局部解剖结构完整性,实现稳定且一致的去噪性能。

链接: https://arxiv.org/abs/2511.23124
作者: Yanqi Cheng,Chun-Wun Cheng,Jim Denholm,Thiago Lima,Javier A. Montoya-Zegarra,Richard Goodwin,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical imaging pipelines critically rely on robust denoising to stabilise downstream tasks such as segmentation and reconstruction. However, many existing denoisers depend on large annotated datasets or supervised learning, which restricts their usability in clinical environments with heterogeneous modalities and limited ground-truth data. To address this limitation, we introduce DNA-Prior, a universal unsupervised denoising framework that reconstructs clean images directly from corrupted observations through a mathematically principled hybrid prior. DNA-Prior integrates (i) an implicit architectural prior, enforced through a deep network parameterisation, with (ii) an explicit spectral-spatial prior composed of a frequency-domain fidelity term and a spatial regularisation functional. This dual-domain formulation yields a well-structured optimisation problem that jointly preserves global frequency characteristics and local anatomical structure, without requiring any external training data or modality-specific tuning. Experiments across multiple modalities show that DNA achieves consistent noise suppression and structural preservation under diverse noise conditions.
zh

[CV-41] Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning

【速读】:该论文旨在解决图像情感分类(Image Emotion Classification, IEC)中因“情感鸿沟”(affective gap)导致预训练视觉模型知识难以有效迁移的问题。其解决方案的关键在于提出一种基于纯文本的新型方法——情感描述生成图像情感分类(Affective Captioning for Image Emotion Classification, ACIEC),通过层次化多级对比损失从图像中检测情感概念,并结合情感属性思维链推理生成富含情感信息的句子;随后利用预训练语言模型融合情感概念与情感句,实现精准的情感分类。此外,引入基于语义相似性采样的对比损失以缓解情感数据集中类内差异大、类间差异小的问题,并首次考虑了含嵌入文本的图像,从而更全面地建模图像情感特征。

链接: https://arxiv.org/abs/2511.23115
作者: Zibo Zhou,Zhengjun Zhai,Huimin Chen,Wei Dai,Hansen Yang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the “affective gap” , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the “affective gap”. Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.
zh

[CV-42] db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

【速读】:该论文旨在解决在使用序列并行(sequence parallelism)加速基于块稀疏注意力(block-wise sparse attention)的扩散 Transformer(DiT)推理时,因注意力头间稀疏性差异和稀疏掩码中密集块分布不均导致的负载不平衡问题。解决方案的关键在于提出一种名为 db-SP 的稀疏感知序列并行技术,其核心是采用双层分区策略,在头维度和块维度上实现近乎完美的负载均衡,且开销可忽略;同时,db-SP 还能在运行时动态调整头和块维度的并行度,以适应去噪步骤与网络层之间不断变化的稀疏模式,从而显著提升整体推理效率。

链接: https://arxiv.org/abs/2511.23113
作者: Siqi Chen,Ke Hong,Tianchen Zhao,Ruiqi Xie,Zhenhua Zhu,Xudong Zhang,Yu Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at this https URL.
zh

[CV-43] MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning ?

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在多模态数学推理任务中对视觉信息依赖程度不明确的问题,即现有基准测试虽表现出较高整体性能,却未有效分离图像模态的贡献,难以判断模型是否真正利用了视觉理解而非仅依赖语言先验。解决方案的关键在于提出MathSight这一大学水平的多模态数学推理基准,通过设计同一问题的多种视觉变体(原始图、手绘图、照片捕捉)及纯文本条件,实现对视觉输入影响的解耦与量化评估。实验表明,随着题目难度增加,视觉信息的贡献逐渐减弱,且某些模型在无图像输入时表现优于其多模态版本,凸显了构建此类可控基准对于推动真正视觉 grounded 推理研究的重要性。

链接: https://arxiv.org/abs/2511.23112
作者: Yuandong Wang,Yao Cui,Yuxin Zhao,Zhen Yang,Yangfu Zhu,Zhenzhou Shao
机构: Capital Normal University (首都师范大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Comments: 32 pages, 15 figures, 9 tables, includes appendix. Project page: this https URL

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants – original, hand-drawn, photo-captured – and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.
zh

[CV-44] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

【速读】:该论文旨在解决基于文本指令的图像编辑中缺乏精细控制的问题,即仅依靠自然语言难以实现对编辑强度的精确调节。其解决方案的关键在于提出NumeriKontrol框架,通过引入一种有效的Numeric Adapter来编码连续标量值(如亮度、颜色强度等)作为编辑尺度,并以即插即用的方式注入扩散模型中;同时采用任务分离设计支持零样本多条件编辑,使用户可按任意顺序指定多个属性修改指令。此外,研究构建了高质量的Common Attribute Transform (CAT)数据集,利用高保真渲染引擎和单反相机合成带精确标注尺度的训练数据,从而实现稳定、连续且精准的图像属性调控。

链接: https://arxiv.org/abs/2511.23105
作者: Zhenyu Xu,Xiaoqi Shen,Haotian Nan,Xinyu Zhang
机构: East China Normal University(华东师范大学); South China University of Technology(华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.
zh

[CV-45] Implementation of a Skin Lesion Detection System for Managing Children with Atopic Dermatitis Based on Ensemble Learning

【速读】:该论文旨在解决特应性皮炎(Atopic Dermatitis)等皮肤疾病在临床诊断中缺乏客观标准、易与银屑病(Psoriasis)混淆导致误诊率高的问题,同时应对实际临床场景中难以获取高质量皮肤镜图像的挑战。解决方案的关键在于提出一种基于集成学习(Ensemble Learning)的皮肤病变检测系统(ENSEL),通过融合多种深度学习模型提升诊断准确性,并确保在真实用户拍摄的图像上实现高召回率和低于1秒的处理速度,从而推动数字医疗在皮肤科领域的应用落地。

链接: https://arxiv.org/abs/2511.23082
作者: Soobin Jeon,Sujong Kim,Dongmahn Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16pages, 14 figures, 7 tables

点击查看摘要

Abstract:The amendments made to the Data 3 Act and impact of COVID-19 have fostered the growth of digital healthcare market and promoted the use of medical data in artificial intelligence in South Korea. Atopic dermatitis, a chronic inflammatory skin disease, is diagnosed via subjective evaluations without using objective diagnostic methods, thereby increasing the risk of misdiagnosis. It is also similar to psoriasis in appearance, further complicating its accurate diagnosis. Existing studies on skin diseases have used high-quality dermoscopic image datasets, but such high-quality images cannot be obtained in actual clinical settings. Moreover, existing systems must ensure accuracy and fast response times. To this end, an ensemble learning-based skin lesion detection system (ENSEL) was proposed herein. ENSEL enhanced diagnostic accuracy by integrating various deep learning models via an ensemble approach. Its performance was verified by conducting skin lesion detection experiments using images of skin lesions taken by actual users. Its accuracy and response time were measured using randomly sampled skin disease images. Results revealed that ENSEL achieved high recall in most images and less than 1s s processing speed. This study contributes to the objective diagnosis of skin lesions and promotes the advancement of digital healthcare.
zh

[CV-46] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

【速读】:该论文旨在解决大规模视觉语言模型(VLMs)在3D空间推理任务中表现不足的问题,如距离估计、尺寸比较和跨视角一致性等。现有方法通常依赖辅助的3D信息或通过浅层特征融合增强RGB-only VLMs,但效果有限。解决方案的关键在于提出SpaceMind,一种仅基于RGB输入设计的多模态大语言模型,其核心创新是引入轻量级“相机引导的模态融合”模块(Camera-Guided Modality Fusion),将相机表示作为主动引导模态而非被动元数据,在语言模型前对空间token施加相机条件偏置、分配反映几何重要性的查询无关权重,并利用相机嵌入门控融合后的表示,从而有效赋予VLM真正空间感知能力。实验证明该方法在VSI-Bench、SQA3D和SPBench上均达到新SOTA性能。

链接: https://arxiv.org/abs/2511.23075
作者: Ruosen Zhao,Zhikang Zhang,Jialei Xu,Jiahao Chang,Dong Chen,Lingyun Li,Weijian Sun,Zizhuang Wei
机构: Huawei(华为); The Chinese University of Hong Kong, Shenzhen(深圳大学); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
zh

[CV-47] Buffer replay enhances the robustness of multimodal learning under missing-modality

【速读】:该论文旨在解决多模态模型在缺失模态时性能显著下降的问题。现有方法要么计算成本高,要么仅依赖相邻层特征进行提示微调,忽视了长距离上下文信息对容错能力的潜在提升。其解决方案的关键在于提出REplay Prompting (REP):通过残差旁路构建模态专属特征缓冲区,将浅层表示重放至深层以缓解信息丢失;采用私有-共享特征解耦策略,使私有缓冲区保留模态特异性信号,共享缓冲区编码跨模态语义;并设计任务感知的动态初始化机制,根据不同缺失场景配置缓冲区,从而增强模型在多种缺失条件下的稳定性和泛化能力。

链接: https://arxiv.org/abs/2511.23070
作者: Hongye Zhu,Xuan Liu,Yanwen Ba,Jingye Xue,Shigeng Zhang
机构: Hunan University (湖南大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
zh

[CV-48] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

【速读】:该论文试图解决的问题是:生成式基础模型(Generative Foundation Models)在医学影像中去除视觉伪影(如非解剖标记)时,是否会影响下游任务(如骨龄和性别预测)的性能,即其临床可靠性如何。解决方案的关键在于通过系统性评估生成式图像修复(inpainting)对真实儿科手部X光片的影响,使用RSNA骨龄挑战数据集中的200张原始图像生成600张基于自然语言提示的修复图像,并利用深度学习集成模型分别评估骨龄估计(以平均绝对误差MAE衡量)和性别分类(以AUC衡量),同时分析像素强度分布变化以检测结构改变。结果表明,尽管修复图像在视觉上逼真,但显著降低了任务性能并引入了潜在偏差,揭示了生成式AI在医疗场景中应用前必须进行任务特定验证的重要性。

链接: https://arxiv.org/abs/2511.23066
作者: Felipe Akio Matsuoka,Eduardo Moreno J. M. Farina,Augusto Sarquis Serpa,Soraya Monteiro,Rodrigo Ragazzini,Nitamar Abdala,Marcelo Straus Takahashi,Felipe Campos Kitamura
机构: Universidade Federal de São Paulo (圣保罗联邦大学); Dasa; Universidade de São Paulo (圣保罗大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Bunkerhill Health
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.
zh

[CV-49] Image Valuation in NeRF-based 3D reconstruction

【速读】:该论文旨在解决在基于神经辐射场(NeRF)的三维场景重建中,输入图像对最终重建质量贡献不均的问题。由于现实场景中的图像质量参差不齐、存在遮挡和瞬时物体等干扰因素,传统方法往往将所有输入图像同等对待,导致资源浪费与重建效率低下。解决方案的关键在于提出一种量化每张图像对NeRF重建贡献度的方法,通过峰值信噪比(PSNR)和均方误差(MSE)等重建质量指标评估各图像的边际效用,并据此筛选出低贡献图像进行剔除,从而提升训练效率与重建保真度。

链接: https://arxiv.org/abs/2511.23052
作者: Grigorios Aris Cheimariotis,Antonis Karakottas,Vangelis Chatzis,Angelos Kanlis,Dimitrios Zarpalas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published In International Conference on Computer Analysis of Images and Patterns (pp. 375-385). Cham: Springer Nature Switzerland

点击查看摘要

Abstract:Data valuation and monetization are becoming increasingly important across domains such as eXtended Reality (XR) and digital media. In the context of 3D scene reconstruction from a set of images – whether casually or professionally captured – not all inputs contribute equally to the final output. Neural Radiance Fields (NeRFs) enable photorealistic 3D reconstruction of scenes by optimizing a volumetric radiance field given a set of images. However, in-the-wild scenes often include image captures of varying quality, occlusions, and transient objects, resulting in uneven utility across inputs. In this paper we propose a method to quantify the individual contribution of each image to NeRF-based reconstructions of in-the-wild image sets. Contribution is assessed through reconstruction quality metrics based on PSNR and MSE. We validate our approach by removing low-contributing images during training and measuring the resulting impact on reconstruction fidelity.
zh

[CV-50] GOATex: Geometry Occlusion-Aware Texturing NEURIPS2025

【速读】:该论文旨在解决现有3D网格纹理生成方法在处理遮挡内部区域时存在的局限性,即这些方法通常只能生成可见表面的高质量纹理,而无法有效处理被遮挡的内部区域,导致纹理不完整或出现明显接缝。解决方案的关键在于提出一种基于击中层级(hit levels)的遮挡感知纹理框架,通过多视角射线投射量化网格面片的相对深度,从而将面片划分为从外到内的有序可见性层;随后采用两阶段可见性控制策略逐步揭示内部区域并保持结构一致性,再利用预训练扩散模型对每一层进行纹理生成,并结合软UV空间混合技术根据视点依赖的可见性置信度加权融合各层纹理,实现无缝且高保真的整体纹理效果。

链接: https://arxiv.org/abs/2511.23051
作者: Hyunjin Kim,Kunho Kim,Adam Lee,Wonkwang Lee
机构: KRAFTON AI; NC AI; UC Berkeley; Seoul National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025; Project Page: this https URL

点击查看摘要

Abstract:We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture’s contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: this https URL.
zh

[CV-51] Geometry-Consistent 4D Gaussian Splatting for Sparse-Input Dynamic View Synthesis

【速读】:该论文旨在解决动态场景下基于稀疏输入视图的高斯点渲染(4D Gaussian Splatting, 4DGS)方法在几何一致性方面表现显著下降的问题,这限制了其在AIoT等实际应用中的可用性。其关键解决方案是提出GC-4DGS框架,通过引入几何一致性约束来增强稀疏输入下的4D空间几何学习:一方面设计动态一致性检查策略以降低多视角立体视觉(Multi-View Stereo, MVS)在时空域中的估计不确定性;另一方面采用全局-局部深度正则化方法,从单目深度估计(Monocular Depth Estimators, MDEs)中蒸馏出时空一致的几何信息,从而提升4D体积内几何与外观的一致性学习能力。该方案在保持实时性的同时显著提升了重建质量,在N3DV和Technicolor数据集上优于最新动态辐射场方法RF-DeRF和原始4DGS。

链接: https://arxiv.org/abs/2511.23044
作者: Yiwei Li,Jiannong Cao,Penghui Ruan,Divya Saxena,Songye Zhu,Yinfeng Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting has been considered as a novel way for view synthesis of dynamic scenes, which shows great potential in AIoT applications such as digital twins. However, recent dynamic Gaussian Splatting methods significantly degrade when only sparse input views are available, limiting their applicability in practice. The issue arises from the incoherent learning of 4D geometry as input views decrease. This paper presents GC-4DGS, a novel framework that infuses geometric consistency into 4D Gaussian Splatting (4DGS), offering real-time and high-quality dynamic scene rendering from sparse input views. While learning-based Multi-View Stereo (MVS) and monocular depth estimators (MDEs) provide geometry priors, directly integrating these with 4DGS yields suboptimal results due to the ill-posed nature of sparse-input 4D geometric optimization. To address these problems, we introduce a dynamic consistency checking strategy to reduce estimation uncertainties of MVS across spacetime. Furthermore, we propose a global-local depth regularization approach to distill spatiotemporal-consistent geometric information from monocular depths, thereby enhancing the coherent geometry and appearance learning within the 4D volume. Extensive experiments on the popular N3DV and Technicolor datasets validate the effectiveness of GC-4DGS in rendering quality without sacrificing efficiency. Notably, our method outperforms RF-DeRF, the latest dynamic radiance field tailored for sparse-input dynamic view synthesis, and the original 4DGS by 2.62dB and 1.58dB in PSNR, respectively, with seamless deployability on resource-constrained IoT edge devices.
zh

[CV-52] From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning

【速读】:该论文旨在解决当前视觉-语言推理模型中“伪视觉推理”(illusion of thinking with images)的问题,即模型看似基于图像进行推理,实则依赖于与上下文无关的视觉操作,未能真正通过视觉信息精炼感知或引导正确答案。其核心解决方案是将视觉动作重新定义为推理的基本原语,提出视觉理性化(Visual Rationalization)——类比于文本中的思维链(Chain-of-Thought),强调每一步视觉操作都必须对推理链条产生实质性贡献。关键创新在于提出的视觉理性学习(ViRL)框架,包含三个机制:(1)过程监督(Process Supervision)使用真实视觉理由指导训练;(2)目标对齐(Objective Alignment)通过步骤级奖励塑造优化方向;(3)细粒度信用分配(Fine-Grained Credit Assignment)区分正确、冗余和错误的动作。这一方法确保模型不仅获得正确答案,而且基于正确的视觉依据,从而提升模型的透明性、可验证性和可信度。

链接: https://arxiv.org/abs/2511.23031
作者: Changpeng Wang,Haozhe Wang,Xi Chen,Junhan Liu,Taofeng Xue,Chong Peng,Donglian Qi,Fangzhen Lin,Yunfeng Yan
机构: Zhejiang University (浙江大学); The Hong Kong Univerisity of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to “get the right answer for the right visual reason”. Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
zh

[CV-53] DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)与SLAM系统集成时面临的GPU内存容量限制问题,该限制使得现有方法难以在大规模场景中实现全局一致的重建。其解决方案的关键在于提出一种基于磁盘外存(out-of-core)的架构——DiskChunGS,通过将场景空间划分为若干块(spatial chunks),仅将当前活跃区域保留在GPU内存中,而将非活跃区域存储于磁盘上,从而突破了内存瓶颈。该设计可无缝集成至现有SLAM框架以支持位姿估计和回环检测,实现了在室内(Replica, TUM-RGBD)、城市驾驶(KITTI)等复杂场景下的可扩展、全局一致重建,并在资源受限的Nvidia Jetson平台上成功完成全部11个KITTI序列,且未发生内存溢出。

链接: https://arxiv.org/abs/2511.23030
作者: Casimir Feldmann,Maximum Wilder-Smith,Vaishakh Patil,Michael Oechsle,Michael Niemeyer,Keisuke Tateno,Marco Hutter
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.
zh

[CV-54] Geodiffussr: Generative Terrain Texturing with Elevation Fidelity

【速读】:该论文旨在解决大规模地形生成在计算机图形学中仍为劳动密集型任务的问题,提出了一种基于流匹配(flow-matching)的文本引导纹理图合成方法 Geodiffussr,其关键创新在于多尺度内容聚合(Multi-scale Content Aggregation, MCA)机制:通过将预训练编码器提取的数字高程模型(Digital Elevation Map, DEM)特征注入到UNet块的多个分辨率层级中,以强制实现从全局到局部的高程一致性。该设计显著提升了视觉保真度并增强了高度与外观之间的耦合关系,相较无MCA基线,FID降低49.16%,LPIPS降低32.33%,ΔdCor降至0.0016,从而为粗粒度构思和前期预览提供可控的2.5D景观生成能力。

链接: https://arxiv.org/abs/2511.23029
作者: Tai Inui,Alexander Matsumura,Edgar Simo-Serra
机构: Waseda University (早稻田大学); Rikka Inc. (株式会社里卡)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale terrain generation remains a labor-intensive task in computer graphics. We introduce Geodiffussr, a flow-matching pipeline that synthesizes text-guided texture maps while strictly adhering to a supplied Digital Elevation Map (DEM). The core mechanism is multi-scale content aggregation (MCA): DEM features from a pretrained encoder are injected into UNet blocks at multiple resolutions to enforce global-to-local elevation consistency. Compared with a non-MCA baseline, MCA markedly improves visual fidelity and strengthens height-appearance coupling (FID \downarrow 49.16%, LPIPS \downarrow 32.33%, \Delta dCor \downarrow to 0.0016). To train and evaluate Geodiffussr, we assemble a globally distributed, biome- and climate-stratified corpus of triplets pairing SRTM-derived DEMs with Sentinel-2 imagery and vision-grounded natural-language captions that describe visible land cover. We position Geodiffussr as a strong baseline and step toward controllable 2.5D landscape generation for coarse-scale ideation and previz, complementary to physically based terrain and ecosystem simulators.
zh

[CV-55] JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

【速读】:该论文旨在解决代理式图像编辑模型中的两个关键问题:一是指令幻觉(instruction hallucination),即仅依赖文本链式思维(text-only chain-of-thought, CoT)推理难以避免事实性错误,源于信息瓶颈;二是奖励劫持(reward hacking),即动态策略优化在静态奖励模型下会利用奖励函数缺陷。解决方案的核心在于提出 JarvisEvo,一个模拟专家设计师行为的统一图像编辑代理,其关键创新包括:(1) 交错式多模态链式思维(interleaved multimodal chain-of-thought, iMCoT)机制,提升指令遵循能力和编辑质量;(2) 编辑器-评估器协同策略优化(synergistic editor-evaluator policy optimization, SEPO)框架,实现无需外部奖励的自我改进,有效缓解奖励劫持;(3) 通过无缝集成 Adobe Lightroom 支持全局与局部细粒度编辑。

链接: https://arxiv.org/abs/2511.23002
作者: Yunlong Lin,Linqing Wang,Kunjie Lin,Zixu Lin,Kaixiong Gong,Wenbo Li,Bin Lin,Zhenxi Li,Shiyi Zhang,Yuyang Peng,Wenxun Dai,Xinghao Ding,Chunyu Wang,Qinglin Lu
机构: Tencent Hunyuan (腾讯混元); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 18 figures

点击查看摘要

Abstract:Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.
zh

[CV-56] MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis ICRA2025

【速读】:该论文旨在解决多模态场景重建中热红外(thermal infrared)图像与可见光(RGB)图像融合不足的问题,尤其是现有方法普遍忽视热传导特性及朗伯反射(Lambertian property)等物理规律对热成像的影响。其解决方案的关键在于提出MrGS——一种基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的多模态辐射场模型,通过正交特征提取从单一外观特征中分离RGB与热红外信息,并根据各模态的朗伯反射程度采用视图相关或视图无关嵌入策略;同时引入两个物理先验:一是将傅里叶热传导定律(Fourier’s law of heat conduction)融入alpha混合过程以建模相邻高斯点间的热传导引起的强度插值,二是结合斯特藩-玻尔兹曼定律(Stefan-Boltzmann law)和平方反比定律构建深度感知的热辐射映射,从而在热渲染中施加几何约束,实现高保真RGB-T场景重建并减少高斯点数量。

链接: https://arxiv.org/abs/2511.22997
作者: Minseong Kweon,Janghyun Kim,Ukcheol Shin,Jinsun Park
机构: Minnesota Robotics Institute (MnRI)(明尼苏达机器人研究所); University of Minnesota, Twin Cities (明尼苏达大学双城分校); Department of Information Convergence Engineering (Artificial Intelligence Major) (信息融合工程系(人工智能专业)); Pusan National University (釜山国立大学); Department of Energy Engineering, Korea Institute of Energy Technology (KENTECH)(能源工程系,韩国能源技术研究所); School of Computer Science and Engineering (计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at Thermal Infrared in Robotics (TIRO) Workshop, ICRA 2025 (Best Poster Award)

点击查看摘要

Abstract:Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier’s law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.
zh

[CV-57] Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop

【速读】:该论文旨在解决深度学习模型中优化器选择对基于视觉Transformer(Vision Transformer, ViT)的虹膜识别系统性能影响不明确的问题。当前虽然ViT在视觉识别任务中表现优异,但其在生物特征识别场景下的优化策略尚缺乏系统研究,尤其在准确性与稳定性方面。解决方案的关键在于通过实验评估多种优化器对ViT架构在虹膜识别任务中的影响,从而为提升生物特征识别模型的鲁棒性提供实证依据和优化指导。

链接: https://arxiv.org/abs/2511.22994
作者: Moh Imam Faiz,Aviv Yuniar Rahman,Rangga Pahlevi Putra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:The security of biometric authentication is increasingly critical as digital identity systems expand. Iris recognition offers high reliability due to its distinctive and stable texture patterns. Recent progress in deep learning, especially Vision Transformers ViT, has improved visual recognition performance. Yet, the effect of optimizer choice on ViT-based biometric systems remains understudied. This work evaluates how different optimizers influence the accuracy and stability of ViT for iris recognition, providing insights to enhance the robustness of biometric identification models.
zh

[CV-58] Guiding Visual Autoregressive Models through Spectrum Weakening

【速读】:该论文旨在解决当前条件生成模型在无条件生成质量与条件对齐之间难以兼顾的问题,尤其针对视觉自回归(AR)模型缺乏通用且无需重训练的引导机制这一挑战。解决方案的关键在于提出一种谱弱化(spectrum-weakening)框架,通过在频域构造可控的弱模型实现引导增强:利用可逆谱变换保持信息完整性,同时选择性保留部分频谱成分以引入可控的信息削减;并通过通道维度上的谱选择避免扩散模型特有的结构约束,结合两种谱归一化策略确保数值稳定性。该方法无需重新训练、不依赖特定条件或架构修改,即可在离散和连续AR模型中实现高质量无条件生成与强提示对齐的条件生成。

链接: https://arxiv.org/abs/2511.22991
作者: Chaoyang Wang,Tianmeng Yang,Jingdong Wang,Yunhai Tong
机构: Peking University (北京大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.
zh

[CV-59] MIMM-X: Disentangling Spurious Correlations for Medical Image Analysis

【速读】:该论文旨在解决深度学习模型在医学任务中因存在多个伪相关(spurious correlations)而导致的“捷径学习”(shortcut learning)问题,进而影响模型在新环境中的泛化能力。特别是在医学影像领域,多种伪相关可能共存,导致误分类并带来严重后果。解决方案的关键在于提出MIMM-X框架,通过最小化因果特征与多个伪相关之间的互信息(mutual information),实现对因果特征的有效解耦,从而使得预测基于真实的潜在因果关系而非数据集特定的捷径。

链接: https://arxiv.org/abs/2511.22990
作者: Louisa Fay,Hajer Reguigui,Bin Yang,Sergios Gatidis,Thomas Küstner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models can excel on medical tasks, yet often experience spurious correlations, known as shortcut learning, leading to poor generalization in new environments. Particularly in medical imaging, where multiple spurious correlations can coexist, misclassifications can have severe consequences. We propose MIMM-X, a framework that disentangles causal features from multiple spurious correlations by minimizing their mutual information. It enables predictions based on true underlying causal relationships rather than dataset-specific shortcuts. We evaluate MIMM-X on three datasets (UK Biobank, NAKO, CheXpert) across two imaging modalities (MRI and X-ray). Results demonstrate that MIMM-X effectively mitigates shortcut learning of multiple spurious correlations.
zh

[CV-60] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

【速读】:该论文旨在解决当前文本到图像生成模型在多参考图像(multi-reference generation)场景下缺乏系统性评估基准的问题。现有数据集通常仅支持单参考或少量参考图像的生成与编辑任务,难以准确衡量模型在不同多参考条件下的性能进展及潜在缺陷,且任务定义模糊,无法捕捉多参考设置中的内在复杂性。为此,作者提出了 MultiBanana 基准,其关键在于通过五个维度全面覆盖多参考特有问题:(1)参考图像数量变化,(2)参考图像域不匹配(如照片 vs. 动漫),(3)参考与目标场景尺度不一致,(4)参考图像包含罕见概念(如红色香蕉),(5)多语言文本提示驱动渲染。该设计使得对模型能力边界和失败模式的分析成为可能,为多参考图像生成领域提供了标准化、可扩展的评测体系。

链接: https://arxiv.org/abs/2511.22989
作者: Yuta Oshima,Daiki Miyake,Kohsei Matsutani,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
机构: The University of Tokyo (东京大学); Google DeepMind (谷歌深度心智)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as “what to edit” or “how many references are given”, and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce \textbfMultiBanana , which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at this https URL .
zh

[CV-61] Convolutional Feature Noise Reduction for 2D Cardiac MR Image Segmentation

【速读】:该论文旨在解决卷积特征在分割网络中处理时因噪声未被有效抑制而导致的“蝴蝶效应”问题,进而影响整个特征系统的后续性能。其解决方案的关键在于将遵循高斯分布的卷积特征视为特征信号矩阵,并提出一种简明有效的卷积特征滤波器(Convolutional Feature Filter, CFF),该滤波器本质上是一种低幅值通滤波器,用于最小化特征信号输入中的噪声。实验验证表明,CFF能显著降低特征信号矩阵中的噪声水平,并通过自定义的二值化方程量化信息熵变化,实现对噪声减少效果的数值分析。

链接: https://arxiv.org/abs/2511.22983
作者: Hong Zheng,Nan Mu,Han Su,Lin Feng,Xiaoning Li
机构: Sichuan Normal University (四川师范大学); Southwest Jiaotong University (西南交通大学); Sichuan Province (四川省)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Noise reduction constitutes a crucial operation within Digital Signal Processing. Regrettably, it frequently remains neglected when dealing with the processing of convolutional features in segmentation networks. This oversight could trigger the butterfly effect, impairing the subsequent outcomes within the entire feature system. To complete this void, we consider convolutional features following Gaussian distributions as feature signal matrices and then present a simple and effective feature filter in this study. The proposed filter is fundamentally a low-amplitude pass filter primarily aimed at minimizing noise in feature signal inputs and is named Convolutional Feature Filter (CFF). We conducted experiments on two established 2D segmentation networks and two public cardiac MR image datasets to validate the effectiveness of the CFF, and the experimental findings demonstrated a decrease in noise within the feature signal matrices. To enable a numerical observation and analysis of this reduction, we developed a binarization equation to calculate the information entropy of feature signals.
zh

[CV-62] Ovis-Image Technical Report

【速读】:该论文旨在解决高保真文本渲染(text rendering)在计算资源受限场景下的部署难题,即如何在不依赖超大规模模型或专有模型的前提下,实现与顶尖闭源系统相当的文本生成质量。其解决方案的关键在于:基于强大的多模态骨干网络Ovis 2.5,结合扩散视觉解码器,并采用以文本为中心的训练策略——包括大规模预训练与精细化后训练优化,从而在仅需单张高端GPU且内存适中的条件下,实现高质量双语文本渲染能力。

链接: https://arxiv.org/abs/2511.22982
作者: Guo-Hua Wang,Liangfu Cao,Tianyu Cui,Minghao Fu,Xiaohao Chen,Pengxin Zhan,Jianshan Zhao,Lan Li,Bowen Fu,Jiaqi Liu,Qing-Guo Chen
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is released at this https URL

点击查看摘要

Abstract:We introduce \textbfOvis-Image , a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
zh

[CV-63] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

【速读】:该论文旨在解决文本到视频(Text-to-video, T2V)生成模型在对齐人类偏好时面临的两大挑战:一是现有方法依赖昂贵的人工标注或代理指标,缺乏对人类偏好逻辑的深入理解;二是当前策略通常仅对整体偏好分布进行对齐,忽略了运动动态与视觉质量等潜在冲突维度,导致模型偏向低运动内容。解决方案的关键在于提出一个三阶段强化学习框架——Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc),其核心创新包括:(1) 自批判维度推理(Self-critic Dimensional Reasoning, ScDR)训练生成式奖励模型(Generative Reward Model, RM),通过自批判推理链实现偏好分解为多维评估;(2) 分层比较推理(Hierarchical Comparative Reasoning, HCR)引入结构化多层次推理机制与分层奖励监督,提升整体视频对比能力;(3) 运动校正直接偏好优化(Motion-corrective Direct Preference Optimization, McDPO)利用RM优选视频进行模型优化,并动态调整权重以缓解对低运动内容的偏差。实验证明,McSc在人类偏好对齐和高动态视频生成方面均表现优异。

链接: https://arxiv.org/abs/2511.22974
作者: Qiushi Yang,Yingjie Chen,Yuan Yao,Yifang Men,Huaizhuo Liu,Miaomiao Cui
机构: Tongyi Lab, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.
zh

[CV-64] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

【速读】:该论文旨在解决当前半自回归(semi-autoregressive)视频生成方法在生成分钟级长视频时面临的两大挑战:一是KV缓存(KV cache)导致的长期误差累积问题,二是缺乏细粒度的长视频基准测试集与一致性感知的评估指标。其解决方案的关键在于提出BlockVid框架,通过引入语义感知稀疏KV缓存机制、块强制训练策略(Block Forcing)、以及分段噪声调度与打乱策略,有效抑制误差传播并提升时间一致性;同时构建了LV-Bench这一面向分钟级视频的细粒度基准测试集,并设计了新的长程连贯性评估指标,从而显著提升了生成视频的质量与连贯性,在多个指标上优于现有最先进方法。

链接: https://arxiv.org/abs/2511.22973
作者: Zeyu Zhang,Shuning Chang,Yuanyu He,Yizeng Han,Jiasheng Tang,Fan Wang,Bohan Zhuang
机构: Alibaba Inc.(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: this https URL. Inferix (Code): this https URL.
zh

[CV-65] aming the Light: Illumination-Invariant Semantic 3DGS-SLAM

【速读】:该论文旨在解决极端光照条件下3D地图重建与语义分割精度下降的问题,尤其针对紧密耦合的语义SLAM(Simultaneous Localization and Mapping)系统性能退化问题。其解决方案的关键在于提出两种创新设计:一是引入内在外观归一化(Intrinsic Appearance Normalization, IAN)模块,主动解耦场景的固有属性(如反照率)与瞬时光照,学习标准化的、光照不变的外观模型,从而为每个高斯原语赋予稳定一致的颜色表示;二是设计动态辐射平衡损失(Dynamic Radiance Balancing Loss, DRB-Loss),在图像曝光极端不良时被动激活,直接作用于辐射场进行针对性优化,避免误差累积而不影响正常光照下的性能。IAN的主动不变性与DRB-Loss的反应式修正形成协同效应,显著提升了系统的鲁棒性。

链接: https://arxiv.org/abs/2511.22968
作者: Shouhe Zhang,Dayong Ren,Sensen Song,Yurong Qian,Zhenhong Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extreme exposure degrades both the 3D map reconstruction and semantic segmentation accuracy, which is particularly detrimental to tightly-coupled systems. To achieve illumination invariance, we propose a novel semantic SLAM framework with two designs. First, the Intrinsic Appearance Normalization (IAN) module proactively disentangles the scene’s intrinsic properties, such as albedo, from transient lighting. By learning a standardized, illumination-invariant appearance model, it assigns a stable and consistent color representation to each Gaussian primitive. Second, the Dynamic Radiance Balancing Loss (DRB-Loss) reactively handles frames with extreme exposure. It activates only when an image’s exposure is poor, operating directly on the radiance field to guide targeted optimization. This prevents error accumulation from extreme lighting without compromising performance under normal conditions. The synergy between IAN’s proactive invariance and DRB-Loss’s reactive correction endows our system with unprecedented robustness. Evaluations on public datasets demonstrate state-of-the-art performance in camera tracking, map quality, and semantic and geometric accuracy.
zh

[CV-66] HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

【速读】:该论文旨在解决当前基于视觉-语言模型(VLM)的3D场景理解方法中因3D数据稀缺和三维空间关系复杂性导致的隐式对齐效果不佳的问题。其解决方案的关键在于提出一种层次化多模态表示框架,通过在输入空间显式对齐VLM:一方面利用文本描述中的物体3D坐标来捕捉空间关系,另一方面结合俯视图与四个方向视角(前、左、右、后)的多视图图像以实现场景全覆盖;同时引入层级特征表示机制,将图像块级特征聚合为视图级和场景级表征,从而支持局部与全局场景上下文的联合推理。

链接: https://arxiv.org/abs/2511.22961
作者: Chen Li,Eric Peh,Basura Fernando
机构: Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore (新加坡科技研究局高性能计算研究所); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore (新加坡科技研究局前沿人工智能研究中心); College of Computing and Data Science, Nanyang Technological University, Singapore (南洋理工大学计算机与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM’s embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D QA and general 3D QA benchmarks demonstrate the effectiveness of our approach.
zh

[CV-67] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records

【速读】:该论文旨在解决太阳图像分析中因多仪器模态差异、类间区分度弱及类内变异性大而导致的深度学习模型性能受限问题,尤其在标注数据稀缺场景下表现不佳。其关键解决方案是提出SolarCHIP——一种针对SDO多仪器观测数据的对比预训练视觉骨干网络,通过多粒度对比目标联合优化三个层面:(1) 对齐同时间窗口下AIA与HMI图像的全局类别标记以增强时间判别力;(2) 固定空间位置的局部patch标记以建立位置一致且模态不变的特征表示;(3) 同样本内不同空间位置patch间的对比以保留细粒度空间结构。该方法显著提升了跨模态翻译与全盘耀斑分类任务的性能,尤其在低资源条件下优势明显。

链接: https://arxiv.org/abs/2511.22958
作者: Shiyu Shen,Zhe Gao,Taifeng Chai,Yang Huang,Bin Pan
机构: Nankai University (南开大学); Zhejiang Lab (浙江省实验室); Nanjin University (南京大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.
zh

[CV-68] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

【速读】:该论文旨在解决机器人分割(robot segmentation)在图像和视频中准确识别的难题,其核心挑战源于机器人本体形态多样性、外观模糊性、结构复杂性以及快速形变等因素。为应对这些问题,作者提出了一种名为RobotSeg的基础模型,其关键创新在于:1)引入结构增强的记忆关联器(structure-enhanced memory associator),以提升对关节式机器人(articulated robots)的结构感知能力;2)设计机器人提示生成器(robot prompt generator),实现无需人工标注的自动提示生成;3)采用标签高效训练策略(label-efficient training strategy),减少逐帧标注mask的需求。这些改进使RobotSeg成为一种结构感知、全自动且标签高效的机器人分割解决方案,并在自建的包含2.8k视频(138k帧)的视频机器人分割(VRS)数据集上实现了最先进性能。

链接: https://arxiv.org/abs/2511.22950
作者: Haiyang Mei,Qiming Huang,Hai Ci,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
zh

[CV-69] Do We Need Perfect Data? Leverag ing Noise for Domain Generalized Segmentation AAAI2026

【速读】:该论文旨在解决语义分割中的域泛化(domain generalization)问题,特别是应对因域偏移(domain shift)导致的性能下降,尤其是在恶劣条件下的挑战。现有基于扩散模型的数据生成方法虽具潜力,但其生成图像与语义掩码之间存在固有错位(inherent misalignment),通常被视为缺陷。本文提出FLEX-Seg框架,将这一“缺陷”转化为提升鲁棒性的机会:其核心创新在于三个关键组件——粒度自适应原型(Granular Adaptive Prototypes)用于多尺度边界特征建模、不确定性边界强调(Uncertainty Boundary Emphasis)依据预测熵动态调整学习权重、以及硬度感知采样(Hardness-Aware Sampling)逐步聚焦于困难样本。通过主动利用生成数据的不完美性而非强制对齐,FLEX-Seg在五个真实世界数据集上显著优于当前最优方法,验证了适应性策略处理不完美合成数据可有效提升域泛化能力。

链接: https://arxiv.org/abs/2511.22948
作者: Taeyeong Kim,SeungJoon Lee,Jung Uk Kim,MyeongAh Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at this https URL.
zh

[CV-70] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe

【速读】:该论文旨在解决当前基于扩散模型的姿势驱动角色动画方法在处理空间布局不一致(spatially misaligned)参考图像时的局限性,尤其是当参考姿态对之间骨骼结构不匹配或存在部分可见区域时,现有方法难以生成高质量动画的问题。其解决方案的关键在于提出一个统一框架 One-to-All Animation:首先将训练重构为自监督的外绘(outpainting)任务,将任意布局的参考图像转换为统一的遮挡输入格式以应对空间错位;其次设计参考特征提取器(reference extractor)以实现对部分可见参考图像的身份特征完整捕捉;进一步引入混合参考融合注意力机制(hybrid reference fusion attention)以适应不同分辨率和动态序列长度;最后通过身份鲁棒的姿态控制(identity-robust pose control)解耦外观与骨骼结构,缓解姿态过拟合,并采用标记替换策略(token replace strategy)提升长视频生成的一致性。

链接: https://arxiv.org/abs/2511.22940
作者: Shijun Shi,Jing Xu,Zhihang Li,Chunli Peng,Xiaoda Yang,Lijing Lu,Kai Hu,Jiangning Zhang
机构: Jiangnan University (江南大学); University of Science and Technology of China (中国科学技术大学); Chinese Academy of Sciences (中国科学院); Beijing University of Posts and Telecommunications (北京邮电大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at this https URL.
zh

[CV-71] DenoiseGS: Gaussian Reconstruction Model for Burst Denoising

【速读】:该论文旨在解决手持设备拍摄图像中因运动模糊和噪声导致的图像质量下降问题,尤其是传统去噪方法在大运动场景下性能受限或计算成本过高的难题。其解决方案的关键在于提出首个利用3D高斯溅射(3D Gaussian Splatting)高效架构的去噪框架 DenoiseGS,通过引入高斯自一致性(Gaussian Self-Consistency, GSC)损失来约束从噪声输入中预测的高斯点云几何结构,并结合对数加权频域(Log-weighted Frequency, LWF)损失以增强频域监督,从而有效保留细节信息并显著提升推理速度(比基于NeRF的方法快250倍)。

链接: https://arxiv.org/abs/2511.22939
作者: Yongsen Cheng,Yuanhao Cai,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Burst denoising methods are crucial for enhancing images captured on handheld devices, but they often struggle with large motion or suffer from prohibitive computational costs. In this paper, we propose DenoiseGS, the first framework to leverage the efficiency of 3D Gaussian Splatting for burst denoising. Our approach addresses two key challenges when applying feedforward Gaussian reconsturction model to noisy inputs: the degradation of Gaussian point clouds and the loss of fine details. To this end, we propose a Gaussian self-consistency (GSC) loss, which regularizes the geometry predicted from noisy inputs with high-quality Gaussian point clouds. These point clouds are generated from clean inputs by the same model that we are training, thereby alleviating potential bias or domain gaps. Additionally, we introduce a log-weighted frequency (LWF) loss to strengthen supervision within the spectral domain, effectively preserving fine-grained details. The LWF loss adaptively weights frequency discrepancies in a logarithmic manner, emphasizing challenging high-frequency details. Extensive experiments demonstrate that DenoiseGS significantly exceeds the state-of-the-art NeRF-based methods on both burst denoising and novel view synthesis under noisy conditions, while achieving \textbf250 \times faster inference speed. Code and models are released at this https URL.
zh

[CV-72] Barcode and QR Code Object Detection: An Experimental Study on YOLOv8 Models

【速读】:该论文旨在解决基于YOLOv8(You Only Look Once)算法在条形码和二维码识别任务中检测精度与实时性能之间的平衡问题。其解决方案的关键在于通过在Kaggle数据集上进行大规模训练和高质量调优,系统性地优化不同规模的YOLOv8模型(Nano、Small、Medium),并以精确率(precision)、召回率(recall)和F1分数作为评估指标,验证模型扩展对检测性能的提升效果。实验结果表明,随着模型规模增大,检测准确率显著提高,其中Small版本达到97.10%的最高精度,证明了模型缩放策略在增强深度学习驱动的计算机视觉系统鲁棒性和泛化能力方面的有效性。

链接: https://arxiv.org/abs/2511.22937
作者: Kushagra Pandya,Heli Hathi,Het Buch,Ravikumar R N,Shailendrasinh Chauhan,Sushil Kumar Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 16 figures, Presented at 2024 International Conference on Emerging Innovations and Advanced Computing (INNOCOMP) Conference

点击查看摘要

Abstract:This research work dives into an in-depth evaluation of the YOLOv8 (You Only Look Once) algorithm’s efficiency in object detection, specially focusing on Barcode and QR code recognition. Utilizing the real-time detection abilities of YOLOv8, we performed a study aimed at enhancing its talent in swiftly and correctly figuring out objects. Through large training and high-quality-tuning on Kaggle datasets tailored for Barcode and QR code detection, our goal became to optimize YOLOv8’s overall performance throughout numerous situations and environments. The look encompasses the assessment of YOLOv8 throughout special version iterations: Nano, Small, and Medium, with a meticulous attention on precision, recall, and F1 assessment metrics. The consequences exhibit large improvements in object detection accuracy with every subsequent model refinement. Specifically, we achieved an accuracy of 88.95% for the nano model, 97.10% for the small model, and 94.10% for the medium version, showcasing the incremental improvements finished via model scaling. Our findings highlight the big strides made through YOLOv8 in pushing the limits of computer vision, ensuring its function as a milestone within the subject of object detection. This study sheds light on how model scaling affects object recognition, increasing the concept of deep learning-based computer creative and prescient techniques.
zh

[CV-73] Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling

【速读】:该论文旨在解决人工智能生成内容(AIGC)激增背景下数字媒体真实性受威胁的问题,特别是针对图像篡改后难以准确恢复原始内容的挑战。现有自恢复方法在还原篡改区域时效果不佳,无法有效实现图像自恢复的核心目标。其解决方案的关键在于提出ReImage框架,该框架基于神经水印技术,在目标图像中嵌入一个打乱后的版本作为水印,并设计了优化的水印生成器与图像增强模块,从而显著提升恢复质量;同时,论文还系统分析并解决了打乱水印在自恢复场景中的关键局限性,使其能够高效应用于多种篡改情境。

链接: https://arxiv.org/abs/2511.22936
作者: Minyoung Kim,Paul Hongsuck Seo
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 12 figures, 14 tables

点击查看摘要

Abstract:The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker’s intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.
zh

[CV-74] NeuMatC: A General Neural Framework for Fast Parametric Matrix Operation

【速读】:该论文旨在解决在无线通信和信号处理等实际应用场景中,针对参数连续变化的矩阵进行重复运算(如矩阵求逆和奇异值分解(SVD))时,传统方法因独立处理每次运算而未充分利用矩阵结果在参数维度上的低秩性和连续性,导致大量冗余计算的问题。解决方案的关键在于提出神经矩阵计算框架(NeuMatC),通过无监督学习方式,从参数到矩阵运算结果之间建立一个低秩且连续的映射关系;训练完成后,仅需少量基础运算(如矩阵乘法和非线性激活)即可在任意参数点高效计算结果,从而显著减少冗余计算量,在保持可接受精度的同时实现超过3倍(矩阵求逆)和10倍(SVD)的速度提升。

链接: https://arxiv.org/abs/2511.22934
作者: Chuan Wang,Xi-le Zhao,Zhilong Han,Liang Li,Deyu Meng,Michael K. Ng
机构: Hong Kong Baptist University (香港浸会大学); UESTC (电子科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Matrix operations (e.g., inversion and singular value decomposition (SVD)) are fundamental in science and engineering. In many emerging real-world applications (such as wireless communication and signal processing), these operations must be performed repeatedly over matrices with parameters varying continuously. However, conventional methods tackle each matrix operation independently, underexploring the inherent low-rankness and continuity along the parameter dimension, resulting in significantly redundant computation. To address this challenge, we propose \textbf\textitNeural Matrix Computation Framework (NeuMatC), which elegantly tackles general parametric matrix operation tasks by leveraging the underlying low-rankness and continuity along the parameter dimension. Specifically, NeuMatC unsupervisedly learns a low-rank and continuous mapping from parameters to their corresponding matrix operation results. Once trained, NeuMatC enables efficient computations at arbitrary parameters using only a few basic operations (e.g., matrix multiplications and nonlinear activations), significantly reducing redundant computations. Experimental results on both synthetic and real-world datasets demonstrate the promising performance of NeuMatC, exemplified by over 3\times speedup in parametric inversion and 10\times speedup in parametric SVD compared to the widely used NumPy baseline in wireless communication, while maintaining acceptable accuracy.
zh

[CV-75] ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance WACV2026

【速读】:该论文旨在解决RGB-D点云配准中因仅依赖几何信息或简单融合图像特征而导致的鲁棒性不足与实用性受限问题。其解决方案的关键在于提出一种基于相互引导(mutual guidance)的ViGG方法:一方面通过视觉-几何联合形式求解团块对齐(clique alignment),利用几何引导机制抑制模糊团块;另一方面设计视觉引导的几何匹配策略,借助视觉先验限定搜索空间,从而提取抗噪的高质量对应关系。这种双向引导机制显著提升了方法在多种RGB-D配准任务中的鲁棒性和准确性。

链接: https://arxiv.org/abs/2511.22908
作者: Congjia Chen,Shen Yan,Yufu Qu
机构: Beihang University (北京航空航天大学); China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026

点击查看摘要

Abstract:Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at this https URL.
zh

[CV-76] See Rank and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

【速读】:该论文旨在解决视频片段检索(Video Moment Retrieval, MR)与关键片段检测(Highlight Detection, HD)任务中因忽视查询文本中个体词汇重要性而导致的语义理解不足问题。现有方法将整个文本查询和视频片段视为黑箱处理,难以实现细粒度的语义对齐。解决方案的关键在于引入一种基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的图像-文本场景理解机制,并设计两个核心模块:特征增强模块(Feature Enhancement Module, FEM)用于识别并优先提取查询中的关键词汇,以及基于排序的过滤模块(Ranking-based Filtering Module, RFM)通过迭代方式依据关键词汇的相关性逐步优化候选视频片段。该方法显著提升了MR与HD任务的性能表现。

链接: https://arxiv.org/abs/2511.22906
作者: YuEun Lee,Jung Uk Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: this https URL.
zh

[CV-77] Leverag ing Textual Compositional Reasoning for Robust Change Captioning AAAI2026

【速读】:该论文旨在解决图像变化描述(change captioning)任务中因仅依赖视觉特征而导致的细微但重要变化难以捕捉的问题,其根源在于现有方法缺乏对对象关系和组合语义等结构化信息的显式表示能力。解决方案的关键在于提出CORTEX框架,该框架通过引入互补的文本线索增强变化理解:首先利用视觉语言模型(VLMs)提取场景级文本知识以生成隐含在视觉特征中的组合推理描述(即Reasoning-aware Text Extraction模块),进而通过图像-文本双路对齐模块(Image-Text Dual Alignment, ITDA)实现视觉与文本特征的细粒度关联,从而提升对仅凭视觉特征难以识别的变化的推理能力。

链接: https://arxiv.org/abs/2511.22903
作者: Kyu Ri Park,Jiyoung Park,Seong Tae Kim,Hong Joo Lee,Jung Uk Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.
zh

[CV-78] From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts

【速读】:该论文旨在解决当前多模态提示学习(Multimodal Prompt Learning, MPL)方法因依赖单一静态点表示而带来的局限性,包括对基础类别过拟合、在新类别或模糊类别上泛化能力差等问题。其核心解决方案是摒弃传统的点表示范式,提出一种基于扩散模型思想的“Points-to-Clouds”(P2C)框架,关键在于将提示学习重构为一个动态去噪任务:通过引入双重去噪机制——动态提示去噪(Dynamic Prompt Denoising, DPD)和辅助视觉-语言映射器去噪损失(V-L Mapper denoising loss),促使模型从噪声文本提示中重建干净的视觉提示,从而学习嵌入空间中的语义分布(semantic cloud),实现更鲁棒的跨模态对齐与泛化能力。

链接: https://arxiv.org/abs/2511.22897
作者: Weiran Li,Yeqiang Liu,Yijie Wei,Mina Han,Xin Liu,Zhenbo Li
机构: China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at this https URL.
zh

[CV-79] DM3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

【速读】:该论文旨在解决多模态目标跟踪(Multimodal MOT)中因可见光与热红外模态间特征分布差异显著而导致的融合困难问题,此类非线性分布差距常引发模态冲突并降低跟踪精度。其解决方案的关键在于将多模态融合重新建模为一个迭代特征对齐过程,提出Cross-Modal Diffusion Fusion (C-MDF)模块,通过跨模态互引导机制,逐步将两类模态特征投影至共享的一致特征流形上,从而实现互补信息的深度学习与融合;同时引入可插拔的Diffusion Refiner (DR)模块进一步优化统一特征表示,并设计分层跟踪器自适应处理置信度估计,最终构建一个无需复杂后处理的端到端在线跟踪框架。

链接: https://arxiv.org/abs/2511.22896
作者: Weiran Li,Yeqiang Liu,Yijie Wei,Mina Han,Qiannan Guo,Zhenbo Li
机构: China Agricultural University (中国农业大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM ^3 T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM ^3 T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at this https URL.
zh

[CV-80] ClearGCD: Mitigating Shortcut Learning For Robust Generalized Category Discovery

【速读】:该论文针对开放世界场景下通用类别发现(Generalized Category Discovery, GCD)中存在的原型混淆问题展开研究,该问题通常由捷径学习(shortcut learning)引发,导致模型对已知类别的遗忘以及泛化能力下降。解决方案的关键在于提出ClearGCD框架,通过两个互补机制实现:一是语义视图对齐(Semantic View Alignment, SVA),利用跨类别图像块替换生成强增强数据,并通过弱增强保持语义一致性;二是捷径抑制正则化(Shortcut Suppression Regularization, SSR),构建自适应原型库,在对齐已知类别的同时促进潜在新类别的分离。该方法可无缝集成至参数化GCD方法中,并在多个基准上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2511.22892
作者: Kailin Lyu,Jianwei He,Long Xiao,Jianing Zeng,Liang Fan,Lin Shu,Jie Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:In open-world scenarios, Generalized Category Discovery (GCD) requires identifying both known and novel categories within unlabeled data. However, existing methods often suffer from prototype confusion caused by shortcut learning, which undermines generalization and leads to forgetting of known classes. We propose ClearGCD, a framework designed to mitigate reliance on non-semantic cues through two complementary mechanisms. First, Semantic View Alignment (SVA) generates strong augmentations via cross-class patch replacement and enforces semantic consistency using weak augmentations. Second, Shortcut Suppression Regularization (SSR) maintains an adaptive prototype bank that aligns known classes while encouraging separation of potential novel ones. ClearGCD can be seamlessly integrated into parametric GCD approaches and consistently outperforms state-of-the-art methods across multiple benchmarks.
zh

[CV-81] CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections

【速读】:该论文旨在解决城市交叉路口行人安全监测中缺乏实时人口统计学信息的问题,尤其是在低收入和中等收入国家,这些地区交通模式复杂、基础设施薄弱,传统监控系统难以捕捉年龄和性别等关键风险因素。解决方案的关键在于提出一种基于卷积神经网络(Convolutional Neural Networks, CNNs)的深度学习框架,通过远距离视频画面中的全身视觉特征对行人进行六类分类(成年/青少年/儿童男性与女性),无需依赖面部识别或高分辨率图像,从而实现高效、低成本的实时行人 demographics 监测。该方法利用 ResNet50 和轻量级自定义 CNN 架构,在保证高准确率(最高达 86.19%)的同时,兼顾计算效率与部署可行性,为交通规划与针对性安全干预提供数据支持。

链接: https://arxiv.org/abs/2511.22873
作者: Shisir Shahriar Arif,Md. Muhtashim Shahrier,Nazmul Haque,Md Asif Raihan,Md. Hadiuzzaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted for poster presentation at the 105th Annual Meeting of the Transportation Research Board

点击查看摘要

Abstract:Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersection footage using convolutional neural networks (CNNs), without relying on facial recognition or high-resolution imagery. The classification is structured as a unified six-class problem, distinguishing adult, teenager, and child pedestrians for both males and females, based on full-body visual cues. Video data was collected from three high-risk intersections in Dhaka, Bangladesh. Two CNN architectures were implemented: ResNet50, a deep convolutional neural network pretrained on ImageNet, and a custom lightweight CNN optimized for computational efficiency. Eight model variants explored combinations of pooling strategies and optimizers. ResNet50 with Max Pooling and SGD achieved the highest accuracy (86.19%), while the custom CNN performed comparably (84.15%) with fewer parameters and faster training. The model’s efficient design enables real-time inference on standard surveillance feeds. For practitioners, this system provides a scalable, cost-effective tool to monitor pedestrian demographics at intersections using existing camera infrastructure. Its outputs can shape intersection design, optimize signal timing, and enable targeted safety interventions for vulnerable groups such as children or the elderly. By offering demographic insights often missing in conventional traffic data, the framework supports more inclusive, data-driven planning in mixed-traffic environments.
zh

[CV-82] Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis NEURIPS2025

【速读】:该论文旨在解决基于认知任务条件生成全脑4D功能磁共振成像(fMRI)序列的挑战,这一问题源于跨被试/采集间的高维异质血氧水平依赖(BOLD)动态特性以及缺乏神经科学验证标准。其解决方案的关键在于提出首个用于体素级4D fMRI条件生成的扩散Transformer模型,该模型结合3D矢量量化生成对抗网络(VQ-GAN)潜空间压缩与CNN-Transformer主干结构,并通过AdaLN-Zero和交叉注意力机制实现强任务条件控制。该方法在人类连接组计划(HCP)任务fMRI数据上成功重现了任务诱发激活图、保留真实数据中的任务间表示相似性(RSA),并实现了完美的条件特异性,同时ROI时间序列与典型血流动力学响应高度一致,展现出随规模扩展性能持续提升的可预测性。

链接: https://arxiv.org/abs/2511.22870
作者: Jungwoo Seo,David Keetae Park,Shinjae Yoo,Jiook Cha
机构: Seoul National University (首尔国立大学); Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: Accepted at NeurIPS 2025 Workshop: Foundation Models for the Brain and Body. 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.
zh

[CV-83] SUPER-AD: Semantic Uncertainty-aware Planning for End-to-End Robust Autonomous Driving

【速读】:该论文旨在解决当前端到端(End-to-End, E2E)自动驾驶规划系统普遍存在的“不确定性盲视”问题,即现有方法假设感知输出完全可靠,忽视了在模糊或观测不良场景下感知结果的不确定性,导致规划模块缺乏对风险的显式度量。其解决方案的关键在于提出一种纯摄像头输入的E2E框架,通过在鸟瞰图(Bird’s-Eye View, BEV)空间中直接估计随机不确定性(aleatoric uncertainty),并将其融入轨迹规划过程;同时引入车道跟随正则化项以编码车道结构和交通规则先验,从而在保证常规行驶稳定性的同时保留超车、变道等复杂操作的灵活性。该方法生成像素级分辨率的不确定性感知可行驶性地图,显著提升了系统在高不确定性条件下的鲁棒性和可解释性,并在NAVSIM基准测试中取得最优性能,尤其在NAVHARD和NAVSAFE子集上表现突出。

链接: https://arxiv.org/abs/2511.22865
作者: Wonjeong Ryu,Seungjun Yu,Seokha Moon,Hojun Choi,Junsung Park,Jinkyu Kim,Hyunjung Shim
机构: KAIST AI (韩国科学技术院人工智能); Korea University (韩国大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-End (E2E) planning has become a powerful paradigm for autonomous driving, yet current systems remain fundamentally uncertainty-blind. They assume perception outputs are fully reliable, even in ambiguous or poorly observed scenes, leaving the planner without an explicit measure of uncertainty. To address this limitation, we propose a camera-only E2E framework that estimates aleatoric uncertainty directly in BEV space and incorporates it into planning. Our method produces a dense, uncertainty-aware drivability map that captures both semantic structure and geometric layout at pixel-level resolution. To further promote safe and rule-compliant behavior, we introduce a lane-following regularization that encodes lane structure and traffic norms. This prior stabilizes trajectory planning under normal conditions while preserving the flexibility needed for maneuvers such as overtaking or lane changes. Together, these components enable robust and interpretable trajectory planning, even under challenging uncertainty conditions. Evaluated on the NAVSIM benchmark, our method achieves state-of-the-art performance, delivering substantial gains on both the challenging NAVHARD and NAVSAFE subsets. These results demonstrate that our principled aleatoric uncertainty modeling combined with driving priors significantly advances the safety and reliability of camera-only E2E autonomous driving.
zh

[CV-84] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

【速读】:该论文旨在解决共言语手势生成中因手势数据集缺乏描述性文本标注而导致的语义先验差距(semantic prior gap)以及难以实现多模态协同控制的问题。解决方案的关键在于提出CoordSpeaker框架,其核心创新包括:首先通过一种新颖的手势描述生成框架(gesture captioning framework),利用运动-语言模型在多个粒度上生成描述性文本;其次构建一个具有统一跨数据集运动表征的条件潜在扩散模型(conditional latent diffusion model),并引入分层控制的去噪器(hierarchically controlled denoiser),从而实现高可控性和协调性的手势生成。该方法首次将手势理解与描述生成结合以弥合语义鸿沟,并提供了双向手势-文本映射的新视角。

链接: https://arxiv.org/abs/2511.22863
作者: Fengyi Fang,Sicheng Yang,Wenming Yang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
zh

[CV-85] Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation AAAI2026

【速读】:该论文旨在解决多模态测试时适应(Multimodal Test-Time Adaptation, MMTTA)中因不同模态间分布偏移程度差异所引发的复杂耦合效应问题,即单一模态浅层特征偏移与跨模态高层语义错位之间的相互干扰,这限制了现有单模态测试时适应方法向多模态场景的直接扩展。其解决方案的关键在于提出一种名为“通过渐进式再对齐桥接模态”(Bridging Modalities via Progressive Re-alignment, BriMPR)的新框架,采用分而治之策略,首先通过提示调优(prompt tuning)实现各模态全局特征分布的校准以完成初步语义对齐,随后利用可信伪标签和模态掩码组合进行跨模态实例级对比学习,从而增强模态间信息交互并进一步精炼对齐效果。

链接: https://arxiv.org/abs/2511.22862
作者: Jiacheng Li,Songhe Feng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 (Oral)

点击查看摘要

Abstract:Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method. Our source code is available at [this URL](this https URL).
zh

[CV-86] MARVO: Marine-Adaptive Radiance-aware Visual Odometry CVPR2026

【速读】:该论文旨在解决水下视觉定位(underwater visual localization)中的关键挑战,包括波长依赖的光衰减、纹理缺失以及非高斯传感器噪声等问题。解决方案的核心在于提出MARVO框架,其关键创新点包括:(1) 在前端引入物理感知的辐射率适配器(Physics Aware Radiance Adapter),结合Transformer-based特征匹配机制,补偿颜色通道衰减与对比度损失,从而在浑浊环境中获得几何一致的特征对应关系;(2) 在后端构建基于因子图(factor-graph)的视觉惯性气压估计器,融合预积分IMU运动因子、MARVO导出的视觉位姿因子及气压深度先验,实现全状态最大后验(MAP)估计;(3) 引入基于强化学习的姿态图优化器(Reinforcement-Learning-based Pose-Graph Optimizer),通过学习SE(2)流形上的最优回缩动作,突破传统最小二乘法求解时易陷入局部极小值的问题,显著提升全局轨迹精度。

链接: https://arxiv.org/abs/2511.22860
作者: Sacchin Sundar,Atman Kikani,Aaliya Alam,Sumukh Shrote,A. Nayeemulla Khan,A. Shahina
机构: University of Michigan (密歇根大学); Vellore Institute of Technology (维洛尔理工学院); University of Pennsylvania (宾夕法尼亚大学); Sri Sivasubramaniya Nadar College of Engineering (斯里·希瓦苏布拉曼尼亚纳达理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 3 tables, Submitted to CVPR2026

点击查看摘要

Abstract:Underwater visual localization remains challenging due to wavelength-dependent attenuation, poor texture, and non-Gaussian sensor noise. We introduce MARVO, a physics-aware, learning-integrated odometry framework that fuses underwater image formation modeling, differentiable matching, and reinforcement-learning optimization. At the front-end, we extend transformer-based feature matcher with a Physics Aware Radiance Adapter that compensates for color channel attenuation and contrast loss, yielding geometrically consistent feature correspondences under turbidity. These semi dense matches are combined with inertial and pressure measurements inside a factor-graph backend, where we formulate a keyframe-based visual-inertial-barometric estimator using GTSAM library. Each keyframe introduces (i) Pre-integrated IMU motion factors, (ii) MARVO-derived visual pose factors, and (iii) barometric depth priors, giving a full-state MAP estimate in real time. Lastly, we introduce a Reinforcement-Learningbased Pose-Graph Optimizer that refines global trajectories beyond local minima of classical least-squares solvers by learning optimal retraction actions on SE(2).
zh

[CV-87] GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light Camera

【速读】:该论文旨在解决室内场景中基于逆渲染(inverse rendering)的材质反射率与光照分离难题,尤其针对共位光源-相机设置下因强互反射、动态阴影、近场照明及移动镜面高光等复杂因素导致的重建失真问题。其核心解决方案是提出GLOW框架,通过结合神经隐式表面表示与神经辐射缓存(neural radiance cache),实现全局光照的近似建模,并利用精心设计的正则化项与初始化策略联合优化几何结构与反射属性;同时引入动态辐射缓存以适应近场光源运动引起的光照不连续性,并采用表面角度加权的辐射度损失函数抑制闪光灯拍摄中常见的镜面伪影,从而显著提升在自然光照和共位光照条件下材质反射率估计的准确性。

链接: https://arxiv.org/abs/2511.22857
作者: Jiaye Wu,Saeed Hadadan,Geng Lin,Peihan Tu,Matthias Zwicker,David Jacobs,Roni Sengupta
机构: University of Maryland (马里兰大学); University of North Carolina (北卡罗来纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse rendering of indoor scenes remains challenging due to the ambiguity between reflectance and lighting, exacerbated by inter-reflections among multiple objects. While natural illumination-based methods struggle to resolve this ambiguity, co-located light-camera setups offer better disentanglement as lighting can be easily calibrated via Structure-from-Motion. However, such setups introduce additional complexities like strong inter-reflections, dynamic shadows, near-field lighting, and moving specular highlights, which existing approaches fail to handle. We present GLOW, a Global Illumination-aware Inverse Rendering framework designed to address these challenges. GLOW integrates a neural implicit surface representation with a neural radiance cache to approximate global illumination, jointly optimizing geometry and reflectance through carefully designed regularization and initialization. We then introduce a dynamic radiance cache that adapts to sharp lighting discontinuities from near-field motion, and a surface-angle-weighted radiometric loss to suppress specular artifacts common in flashlight captures. Experiments show that GLOW substantially outperforms prior methods in material reflectance estimation under both natural and co-located illumination.
zh

[CV-88] Resolving Evidence Sparsity: Agent ic Context Engineering for Long-Document Understanding

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在处理长文档时性能下降的问题,尤其是在多页、跨模态信息分散且输入冗余严重的情况下。其核心挑战在于如何有效识别并整合关键文本与视觉线索(如表格和图表),同时减少冗余信息对模型判断的干扰。解决方案的关键是提出SLEUTH框架——一个基于多智能体的分层细化系统,通过检索器与四个协作智能体协同工作,从粗到细地筛选出高价值的多模态证据,并最终构建凝练、富含证据的上下文以支持精准推理。该方法具有模型无关性和可扩展性,显著提升了多个长文档理解基准上的表现,达到当前最优水平。

链接: https://arxiv.org/abs/2511.22850
作者: Keliang Liu,Zizhi Chen,Mingcheng Li,Jingqun Tang,Dingkang Yang,Lihua Zhang
机构: Fudan University (复旦大学); Fysics Intelligence Technologies Co., Ltd. (Fysics AI); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.
zh

[CV-89] Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

【速读】:该论文旨在解决现有多模态知识增强视觉问答(Multimodal Knowledge-Based Visual Question Answering, MKB-VQA)基准中存在的“视觉捷径”(visual shortcuts)问题,即模型仅依赖查询图像与目标文档中主要实体的视觉匹配即可获得较高性能,而无需真正理解图文之间的复杂语义关系。解决方案的关键在于构建一个名为RETINA的新基准,其通过LLM驱动的自动管道生成包含次级主体(secondary subjects)及其相关实体图像的样本,从而消除视觉捷径;同时提出Multi-Image MultImodal Retriever(MIMIR),利用多个相关实体图像增强文档嵌入表示,有效应对RETINA数据集,显著优于仅使用单张图像的先前方法。

链接: https://arxiv.org/abs/2511.22843
作者: Dosung Lee,Sangwon Jung,Boyoung Kim,Minyoung Kim,Sungyeon Kim,Junyoung Sung,Paul Hongsuck Seo
机构: Korea University (韩国大学); KAIST (韩国科学技术院); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from “visual shortcuts”, as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.
zh

[CV-90] Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLM s

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对矛盾模态输入时的鲁棒性不足问题,即模型是否能在音频、视觉与文本等模态信息不一致时仍保持可靠的跨模态推理能力。为系统评估这一问题,作者构建了MMA-Bench数据集,包含视频和任务以探测模型对特定模态的依赖程度,并结合黑盒与白盒可解释性技术揭示现有模型的脆弱性。解决方案的关键在于提出一种模态对齐微调策略(modality alignment tuning),通过训练模型识别何时应优先、利用或忽略特定模态线索,从而显著增强其多模态锚定(multimodal grounding)能力,为开发具备内在可靠跨模态推理能力的MLLMs提供了明确路径。

链接: https://arxiv.org/abs/2511.22826
作者: Tianle Chen,Chaitanya Chakka,Arjun Reddy Akula,Xavier Thomas,Deepti Ghadiyaram
机构: Boston University (波士顿大学); Google DeepMind (谷歌深度大脑)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model’s reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.
zh

[CV-91] Captain Safari: A World Engine

【速读】:该论文旨在解决现有世界引擎(World Engine)在处理高自由度(6-DoF)相机轨迹和复杂户外场景时,难以保持长程几何一致性、偏离目标路径或生成过于保守运动的问题。其解决方案的关键在于提出一种姿态条件的世界记忆机制——Captain Safari,通过维护一个动态局部记忆库,并利用检索器获取与相机位姿对齐的世界标记(world tokens),以此条件化视频生成过程,从而在执行挑战性相机动作的同时稳定保持三维结构。

链接: https://arxiv.org/abs/2511.22815
作者: Yu-Cheng Chou,Xingrui Wang,Yitong Li,Jiahao Wang,Hanting Liu,Cihang Xie,Alan Yuille,Junfei Xiao
机构: Johns Hopkins University (约翰霍普金斯大学); Tsinghua University (清华大学); UC Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
zh

[CV-92] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

【速读】:该论文旨在解决高分辨率遥感影像中土地覆盖分类面临的两大挑战:一是标注数据稀缺且类别分布不均衡,二是高分辨率场景中存在的几何畸变问题。其解决方案的关键在于提出LC4-DViT框架,该框架融合了基于文本引导的生成式数据增强与具备形变感知能力的Vision Transformer(DViT)。具体而言,通过GPT-4o生成场景描述并结合超分辨率示例构建类平衡、高保真度的训练图像,以缓解数据稀缺和不平衡问题;同时,DViT采用DCNv4可变形卷积骨干网络与Vision Transformer编码器协同建模,有效捕捉细粒度几何结构与全局语义信息,从而提升分类精度与泛化能力。实验表明,该方法在AID-Beach等8类数据集上达到0.9572的整体准确率,在跨数据集测试中也展现出良好的迁移性能。

链接: https://arxiv.org/abs/2511.22812
作者: Kai Wang,Siyi Chen,Weicong Pang,Chenchen Zhang,Renjun Gao,Ziru Chen,Cheng Li,Dasa Gu,Rui Huang,Alexis Kai Hon Lau
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); The Johns Hopkins University (约翰霍普金斯大学); National University of Singapore (新加坡国立大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible this http URL project is available at this https URL

点击查看摘要

Abstract:Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’ s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT’ s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
zh

[CV-93] From Pixels to Feelings: Aligning MLLM s with Human Cognitive Perception of Images

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解图像主观认知属性方面的不足,例如图像的可记忆性、趣味性、美学吸引力及情感感染力等。现有模型虽能准确识别图像内容,但在模拟人类感知这些细微心理特征方面表现不佳。解决方案的关键在于提出CogIP-Bench基准以系统评估MLLMs对图像认知属性的理解能力,并通过后训练(post-training)阶段显著提升模型与人类判断的一致性;进一步证明这种认知对齐不仅具有预测能力,还能迁移至下游创意任务中,如引导图像生成过程以合成更具目标特质(如更易记或更美观)的图像,从而实现更以人为本的人工智能。

链接: https://arxiv.org/abs/2511.22805
作者: Yiming Chen,Junlin Han,Tianyi Bai,Shengbang Tong,Filippos Kokkinos,Philip Torr
机构: Oxford University (牛津大学); HKUST (香港科技大学); University College London (伦敦大学学院); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project page with codes/datasets/models: this https URL

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model’s alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
zh

[CV-94] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在文化混杂场景下表现不佳的问题,即当来自不同文化的元素在同一视觉场景中共同出现时,模型难以准确识别和区分各自的文化特征。研究发现,当前LVLMs在文化混杂设置中存在显著的性能下降,表现为对背景的高度依赖(准确率下降14%)以及对相同食物在不同语境下产生不一致预测。解决方案的关键在于通过使用多样化的文化混杂数据集进行监督微调(supervised fine-tuning),显著提升了模型的一致性和降低对背景的敏感性,从而增强其在真实世界多文化环境中的鲁棒性。

链接: https://arxiv.org/abs/2511.22787
作者: Eunsu Kim,Junyeong Park,Na Min An,Junseong Kim,Hitesh Laxmichand Patel,Jiho Jin,Julia Kruk,Amit Agarwal,Srikant Panda,Fenal Ashokbhai Ilasariya,Hyunjung Shim,Alice Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
zh

[CV-95] Distracted Robot: How Visual Clutter Undermine Robotic Manipulation

【速读】:该论文旨在解决机器人在杂乱场景中执行操作策略时性能评估不统一的问题,传统方法未能充分考虑环境复杂性与干扰物的分布特征。其解决方案的关键在于提出一种基于心理物理学视角的统一杂乱度量(clutter measure),该度量综合考量了环境因素、干扰物的数量、特性及其排列方式,并在此基础上构建了高保真仿真与真实世界中的系统化评估场景。实验表明,该度量能有效预测性能下降趋势,揭示不同视觉-语言-动作(vision-language-action, VLA)模型在杂乱环境下的独特脆弱性及成功判定的一致性差异,从而为提升机器人在复杂现实场景中的鲁棒性提供了可量化、可比较的基准。

链接: https://arxiv.org/abs/2511.22780
作者: Amir Rasouli,Montgomery Alban,Sajjad Pakdamansavoji,Zhiyuan Li,Zhanguang Zhang,Aaron Wu,Xuan Zhao
机构: Huawei Technologies Canada (华为技术加拿大公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 figures, 2 tables

点击查看摘要

Abstract:In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.
zh

[CV-96] Alzheimers Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期预测的难题,特别是针对轻度认知障碍(Mild Cognitive Impairment, MCI)向AD转化的不确定性问题。由于并非所有MCI患者都会进展为AD,准确区分稳定型MCI(sMCI)与进展型MCI(pMCI)对早期干预至关重要。解决方案的关键在于提出一种端到端的多模态深度学习模型,其核心创新是融合卷积神经网络(Convolutional Neural Networks, CNNs)与视觉Transformer(Vision Transformers)以提取磁共振成像(MRI)中的局部空间特征和全局上下文依赖关系,并引入双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)建模四个连续时间点的MRI特征及其他非影像生物标志物的时间动态变化,从而实现对受试者在第48个月认知状态的精准预测。该方法在ADNI数据集上达到了95.05%的平均预测准确率,显著优于现有研究。

链接: https://arxiv.org/abs/2511.22774
作者: Mahdieh Behjat Khatooni,Mohsen Soryani
机构: Iran University of Science and Technology (伊朗科学与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject’s cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer’s disease.
zh

[CV-97] Fusion or Confusion? Assessing the impact of visible-thermal image fusion for automated wildlife detection

【速读】:该论文旨在解决野生动物监测中传统调查方法效率低下的问题,尤其聚焦于如何通过融合可见光(Visible, VIS)与热红外(Thermal Infrared, TIR)遥感影像提升对大蓝鹭(Ardea herodias)个体及巢穴的自动化检测精度。其解决方案的关键在于采用两种图像融合策略——早期融合(基于主成分分析PCA)与晚期融合(基于分类与回归树CART),结合YOLO11n目标检测模型,并利用深度学习实现VIS与TIR图像的自动配准,从而在保持高F1分数的同时有效识别误报来源。实验表明,晚期融合使主要类别“已占巢穴”的F1分数从90.2%提升至93.0%,验证了多源遥感数据协同分析在生态监测中的有效性。

链接: https://arxiv.org/abs/2511.22768
作者: Camille Dionne-Pierre,Samuel Foucher,Jérôme Théau,Jérôme Lemaître,Patrick Charbonneau,Maxime Brousseau,Mathieu Varin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, submitted to Remote Sensing in Ecology and Conservation

点击查看摘要

Abstract:Efficient wildlife monitoring methods are necessary for biodiversity conservation and management. The combination of remote sensing, aerial imagery and deep learning offer promising opportunities to renew or improve existing survey methods. The complementary use of visible (VIS) and thermal infrared (TIR) imagery can add information compared to a single-source image and improve results in an automated detection context. However, the alignment and fusion process can be challenging, especially since visible and thermal images usually have different fields of view (FOV) and spatial resolutions. This research presents a case study on the great blue heron (Ardea herodias) to evaluate the performances of synchronous aerial VIS and TIR imagery to automatically detect individuals and nests using a YOLO11n model. Two VIS-TIR fusion methods were tested and compared: an early fusion approach and a late fusion approach, to determine if the addition of the TIR image gives any added value compared to a VIS-only model. VIS and TIR images were automatically aligned using a deep learning model. A principal component analysis fusion method was applied to VIS-TIR image pairs to form the early fusion dataset. A classification and regression tree was used to process the late fusion dataset, based on the detection from the VIS-only and TIR-only trained models. Across all classes, both late and early fusion improved the F1 score compared to the VIS-only model. For the main class, occupied nest, the late fusion improved the F1 score from 90.2 (VIS-only) to 93.0%. This model was also able to identify false positives from both sources with 90% recall. Although fusion methods seem to give better results, this approach comes with a limiting TIR FOV and alignment constraints that eliminate data. Using an aircraft-mounted very high-resolution visible sensor could be an interesting option for operationalizing surveys.
zh

[CV-98] MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

【速读】:该论文旨在解决乳腺双视角(craniocaudal, CC 和 mediolateral oblique, MLO)数字乳腺断层摄影图像合成中图像保真度与跨视图一致性不足的问题。其解决方案的关键在于提出并验证了一种三通道去噪扩散概率模型(denoising diffusion probabilistic model, DDPM),通过在原始两通道图像基础上引入第三通道编码(包括求和、绝对差值和零通道三种形式),以增强模型对双视图结构关系的建模能力,从而生成具有高保真度和良好跨视图一致性的合成图像。实验表明,使用求和或绝对差值作为第三通道编码的模型在IoU和Dice相似系数上显著优于其他方法(p < 0.001),且生成图像的分布特性与真实数据接近(EMD = 0.020,KS = 0.077),验证了该策略的有效性。

链接: https://arxiv.org/abs/2511.22759
作者: Jorge Alberto Garza-Abdala,Gerardo A. Fumagal-González,Daly Avendano,Servando Cardona,Sadam Hussain,Eduardo de Avila-Armenta,Jasiel H. Toscano-Martínez,Diana S. M. Rosales Gurmendi,Alma A. Pedro-Pérez,Jose Gerardo Tamez-Pena
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: This study aims to develop and evaluate a three channel denoising diffusion probabilistic model (DDPM) for synthesizing single breast dual view mammograms and to assess the impact of channel representations on image fidelity and cross view consistency. Materials and Methods: A pretrained three channel DDPM, sourced from Hugging Face, was fine tuned on a private dataset of 11020 screening mammograms to generate paired craniocaudal (CC) and mediolateral oblique (MLO) views. Three third channel encodings of the CC and MLO views were evaluated: sum, absolute difference, and zero channel. Each model produced 500 synthetic image pairs. Quantitative assessment involved breast mask segmentation using Intersection over Union (IoU) and Dice Similarity Coefficient (DSC), with distributional comparisons against 2500 real pairs using Earth Movers Distance (EMD) and Kolmogorov Smirnov (KS) tests. Qualitative evaluation included a visual Turing test by a non expert radiologist to assess cross view consistency and artifacts. Results: Synthetic mammograms showed IoU and DSC distributions comparable to real images, with EMD and KS values (0.020 and 0.077 respectively). Models using sum or absolute difference encodings outperformed others in IoU and DSC (p 0.001), though distributions remained broadly similar. Generated CC and MLO views maintained cross view consistency, with 6 to 8 percent of synthetic images exhibiting artifacts consistent with those in the training data. Conclusion: Three channel DDPMs can generate realistic and anatomically consistent dual view mammograms with promising applications in dataset augmentation.
zh

[CV-99] All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning

【速读】:该论文旨在解决计算病理学(Computational Pathology, CPath)中因染色协议、扫描设备和成像设置差异导致的域偏移(domain shift)问题,从而提升模型在跨中心数据上的泛化能力。现有基于视觉-语言模型(Vision-Language Models, VLMs)的知识蒸馏方法受限于预定义提示(prompt)的零样本性能不稳定,且组织病理学缺乏如“草图”等语义描述符,难以设计针对临床中心的领域特定提示。解决方案的关键在于提出领域不变提示调优(Domain Invariant Prompt Tuning, DIPT),通过为每个领域学习一组独立的输入token,并在领域间平均得到领域不变提示,使学生模型能够从PLIP文本编码器中蒸馏知识,实现视觉特征与领域不变嵌入对齐,从而显著提升多域训练下的分类性能,尤其在F1-score指标上优于当前最先进的知识蒸馏方法。

链接: https://arxiv.org/abs/2511.22739
作者: Amir Mohammad Ezzati,Alireza Malekhosseini,Armin Khosravi,Mohammad Hossein Rohban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain generalization is critical in computational pathology (CPath) due to inherent domain shifts caused by variations in staining protocols, scanner devices, and imaging settings across clinical centers. Vision-language models (VLMs), such as PLIP-a pathology-tuned CLIP-trained on image-text pairs across diverse domains, serve as strong knowledge distillation sources. However, their zero-shot performance with predefined prompts remains limited due to sensitivity to prompt variations. Moreover, unlike natural images, histopathology centers lack semantic descriptors (e.g., ‘sketch’), making it difficult to define domain-specific prompts for clinical centers. This requires a data-driven approach for learning domain-specific and ultimately class-generic continuous prompts. We propose Domain Invariant Prompt Tuning (DIPT) for knowledge distillation process, a novel step that learns multiple input tokens for each domain. These tokens are trained separately for each domain and are averaged across domains, leading to domain-invariant prompts. Our student model then distills knowledge from PLIP’s text encoder by leveraging the prompts learned by DIPT. This leads to alignment of visual features with domain-invariant embeddings, enhancing generalization by training on multiple domains. Our method adds a significant improvement in average F1-score to existing state-of-the-art (SOTA) knowledge distillation approaches in domain generalization with histopathology datasets. This work helps the way of deploying robust CPath models in real-world clinical problems with heterogeneous data sources.
zh

[CV-100] Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction AAAI2026

【速读】:该论文旨在解决从双目相机采集的稀疏视角中实现高质量自由视角渲染的问题,尤其针对传统基于高斯点绘(Gaussian Splatting)方法在输入视图稀疏时重建不稳定、依赖密集视图优化的局限性。其解决方案的关键在于提出一种两阶段学习策略:第一阶段通过自监督方式训练尺度感知的像素级点图(scale-aware point map),利用迭代亲和力学习将点图映射到真实空间,从而提升几何表示的鲁棒性;第二阶段通过立体匹配对两个输入视图的点图进行几何精修,并将高斯原型锚定于优化后的平面以生成高质量渲染图像,有效缓解了大视差和稀疏输入带来的挑战。

链接: https://arxiv.org/abs/2511.22704
作者: Boyao Zhou,Shunyuan Zheng,Zhanfeng Liao,Zihan Ma,Hanzhang Tu,Boning Liu,Yebin Liu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. Project page: this https URL

点击查看摘要

Abstract:We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.
zh

[CV-101] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

【速读】:该论文旨在解决当前高性能图像生成模型主要依赖于参数量巨大(20B至80B)的专有系统(如Nano Banana Pro和Seedream 4.0),导致开源模型在消费级硬件上难以进行推理与微调的问题。其解决方案的关键在于提出Z-Image,一个基于可扩展单流扩散Transformer(Scalable Single-Stream Diffusion Transformer, S3-DiT)架构的6B参数基础生成模型,通过系统性优化整个模型生命周期——从精选数据基础设施到精简训练课程——仅用314K H800 GPU小时完成全量训练,并结合少量步骤的知识蒸馏与奖励后训练策略,进一步得到Z-Image-Turbo,在企业级H800 GPU上实现亚秒级推理延迟且兼容消费级硬件(16GB显存)。该方案打破了“规模至上”的范式,证明了在显著降低计算开销的前提下仍可达到甚至超越顶级商业模型的性能表现。

链接: https://arxiv.org/abs/2511.22699
作者: Z-Image Team,Huanqia Cai,Sihan Cao,Ruoyi Du,Peng Gao,Steven Hoi,Shijie Huang,Zhaohui Hou,Dengyang Jiang,Xin Jin,Liangchen Li,Zhen Li,Zhong-Yu Li,David Liu,Dongyang Liu,Junhan Shi,Qilong Wu,Feng Yu,Chi Zhang,Shifeng Zhang,Shilin Zhou
机构: Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the “scale-at-all-costs” paradigm. By systematically optimizing the entire model lifecycle – from a curated data infrastructure to a streamlined training curriculum – we complete the full training workflow in just 314K H800 GPU hours (approx. 630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
zh

[CV-102] Ar2Can: An Architect and an Artist Leverag ing a Canvas for Multi-Human Generation

【速读】:该论文旨在解决当前文本到图像生成模型在多人类场景生成中存在的一致性问题,如人脸重复、身份混淆或个体计数错误等。其解决方案的关键在于提出一种两阶段框架Ar2Can,通过将空间布局规划(spatial planning)与身份渲染(identity rendering)解耦:第一阶段由Architect模块预测结构化布局以指定每个人的位置;第二阶段由Artist模块基于空间锚定的人脸匹配奖励(结合匈牙利算法空间对齐与ArcFace身份相似度)生成高保真图像,从而确保人脸位置准确且身份忠实还原。该方法采用基于组合奖励的Group Relative Policy Optimization (GRPO)进行优化,在无需真实多人类图像的情况下,仅使用合成数据即显著提升计数准确性与身份保留能力。

链接: https://arxiv.org/abs/2511.22690
作者: Shubhankar Borse,Phuc Pham,Farzad Farhadzadeh,Seokeon Choi,Phong Ha Nguyen,Anh Tuan Tran,Sungrack Yun,Munawar Hayat,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.
zh

[CV-103] Emergent Extreme-View Geometry in 3D Foundation Models

【速读】:该论文旨在解决3D基础模型(3DFM)在极端非重叠视角下几何推理能力不足的问题,即现有模型虽能从图像中联合预测深度、位姿和点云图,但在缺乏训练数据覆盖的极端视角条件下表现受限。其解决方案的关键在于提出一种轻量级对齐机制,通过仅微调骨干网络中的少量偏置项(bias terms),而不改变解码器头结构,来优化模型内部的3D表示;这种针对性适应策略显著提升了相对位姿估计性能,同时保持单图像深度和点云质量不受影响。

链接: https://arxiv.org/abs/2511.22686
作者: Yiwen Zhang,Joseph Tung,Ruojin Cai,David Fouhey,Hadar Averbuch-Elor
机构: Cornell University (康奈尔大学); New York University (纽约大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is at this https URL

点击查看摘要

Abstract:3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.
zh

[CV-104] Decoupled DMD: CFG Augmentation as the Spear Distribution Matching as the Shield

【速读】:该论文旨在解决扩散模型蒸馏(Diffusion Model Distillation)中对训练目标理解的误区问题,特别是针对分布匹配蒸馏(Distribution Matching Distillation, DMD)机制在文本到图像生成等复杂任务中为何能实现优异少步数生成性能的内在原因。传统观点认为DMD的成功主要归因于学生模型输出分布与预训练教师模型分布的一致性匹配,但本文通过严格分解DMD训练目标发现,在需要Classifier-Free Guidance (CFG) 的场景下,真正驱动少步蒸馏性能提升的核心机制并非分布匹配(DM),而是一个此前被忽视的关键组件——CFG增强(CFG Augmentation, CA),其作为蒸馏过程中的“引擎”主导性能提升;而DM项则退化为一个“正则化器”,用于稳定训练并减少伪影。这一解耦分析揭示了两类机制的不同作用,进而推动了更系统化的蒸馏设计,例如分离“引擎”与“正则化器”的噪声调度策略,从而进一步提升生成质量。

链接: https://arxiv.org/abs/2511.22677
作者: Dongyang Liu,Peng Gao,David Liu,Ruoyi Du,Zhen Li,Qilong Wu,Xin Jin,Sihan Cao,Shifeng Zhang,Hongsheng Li,Steven Hoi
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student’s output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core engine'' of distillation, while the Distribution Matching (DM) term functions as a regularizer’’ that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( this https URL ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.
zh

[CV-105] A deep learning perspective on Rubens attribution

【速读】:该论文旨在解决艺术史领域中关于画家真迹与工作室成员作品的鉴别难题,尤其聚焦于彼得·保罗·鲁本斯(Peter Paul Rubens)及其工作坊的复杂创作归属问题。其解决方案的关键在于利用卷积神经网络(Convolutional Neural Network, CNN)对经过严格筛选的已确认真迹与对比作品数据集进行训练,从而识别出微观层面的风格特征,这些特征能够反映艺术家本人的独特笔触与技法,最终实现高精度的分类与作者溯源,为传统艺术鉴定提供计算辅助手段。

链接: https://arxiv.org/abs/2511.22667
作者: A. Afifi,A. Kalimullin,S. Korchagin,I. Kudryashov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study explores the use of deep learning for the authentication and attribution of paintings, focusing on the complex case of Peter Paul Rubens and his workshop. A convolutional neural network was trained on a curated dataset of verified and comparative artworks to identify micro-level stylistic features characteristic of the master s hand. The model achieved high classification accuracy and demonstrated the potential of computational analysis to complement traditional art historical expertise, offering new insights into authorship and workshop collaboration.
zh

[CV-106] VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models NEURIPS2025

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在少量标注数据下进行下游任务适配时,现有多模态提示学习方法因采用固定共享提示和确定性参数而难以捕捉实例级差异与不确定性的问题。解决方案的关键在于提出一种变分多模态提示学习(Variational Multi-Modal Prompt Learning, VaMP)框架,通过从学习到的后验分布中采样生成实例条件提示,实现样本特异性的提示调优,并引入基于实例表示与类别原型的类别感知先验以融合局部与全局语义信息;整个框架通过重参数化采样实现端到端训练,将提示调优建模为对潜在提示表示的变分推断,从而有效建模不确定性和任务结构。

链接: https://arxiv.org/abs/2511.22664
作者: Silin Cheng,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: this https URL
zh

[CV-107] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

【速读】:该论文旨在解决统一多模态模型在图像生成与理解任务中因目标冲突而导致的训练困难问题,尤其关注过度模型解耦(如双图像编码器或冻结多模态大语言模型)所引发的跨任务交互能力退化问题。其解决方案的关键在于提出一种名为注意力交互对齐(Attention Interaction Alignment, AIA)的损失函数,通过显式学习任务特定的跨模态交互模式来缓解任务冲突,从而在不依赖模型解耦的前提下优化模型性能。实验表明,AIA在Emu3和Janus-Pro上的应用显著提升了跨模态注意力结构的合理性,并同步增强了生成与理解能力。

链接: https://arxiv.org/abs/2511.22663
作者: Dian Zheng,Manyuan Zhang,Hongyu Li,Kai Zou,Hongbo Liu,Ziyu Guo,Kaituo Feng,Yexin Liu,Ying Luo,Yan Feng,Peng Pei,Xunliang Cai,Hongsheng Li
机构: CUHK MMLab(香港中文大学多媒体实验室); Meituan(美团); University of Science and Technology of China(中国科学技术大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
zh

[CV-108] Geometrically-Constrained Agent for Spatial Reasoning

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在空间推理中存在的重要问题:语义到几何的鸿沟(semantic-to-geometric gap)。具体而言,VLMs 在定性语义推理方面表现优异,但其推理过程发生在低保真度的语义空间中,与高保真几何结构不一致,导致空间推理结果几何上不可靠。现有方法如基于训练的方法受“预言悖论”(oracle paradox)困扰,而工具集成方法虽约束最终计算却未约束VLM的规划过程,从而产生几何错误的计划。解决方案的关键在于提出一种无需训练的代理范式——几何约束代理(Geometrically-Constrained Agent, GCA),通过引入形式化的任务约束来解耦VLM的双重角色:首先由VLM作为语义分析师将模糊查询转化为可验证的任务约束(定义参考系和目标),随后作为任务求解器在该约束确定的确定性边界内生成并执行工具调用,从而实现几何约束下的可靠推理路径。

链接: https://arxiv.org/abs/2511.22659
作者: Zeren Chen,Xiaoya Lu,Zhijie Zheng,Pengrui Li,Lehan He,Yijin Zhou,Jing Shao,Bohan Zhuang,Lu Sheng
机构: Beihang University (北京航空航天大学); Shanghai AI Laboratory; Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute; ZIP Lab, Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 13 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,‘’ learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM’s planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM’s role into two stages. First, acting as a semantic analyst, the VLM translates the user’s ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at this https URL.
zh

[CV-109] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

【速读】:该论文旨在解决当前遥感多模态大语言模型(Multimodal Large Language Models, MLLMs)在地理空间场景理解中依赖人工标注的链式思维(Chain-of-Thought, CoT)数据进行冷启动训练所带来的高标注成本和人类偏见问题,这些问题限制了模型推理的多样性与泛化能力。其解决方案的关键在于提出GeoZero框架,通过构建两个数据集——GeoZero-Instruct用于初步监督微调以获取地理空间知识,GeoZero-Hard用于强化学习阶段激发深层推理;并引入答案锚定组相对策略优化(Answer-Anchored Group Relative Policy Optimization, A² GRPO),利用模型自身输出的答案对推理过程进行正则化,从而引导多样且准确的推理路径,最终实现无需预定义CoT监督即可在多种地理空间任务中涌现出通用推理能力。

链接: https://arxiv.org/abs/2511.22645
作者: Di Wang,Shunyu Liu,Wentao Jiang,Fengxiang Wang,Yi Liu,Xiaolei Qin,Zhiming Luo,Chaoyang Zhou,Haonan Guo,Jing Zhang,Bo Du,Dacheng Tao,Liangpei Zhang
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code, data, and models will be publicly available at this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A ^2 GRPO), where the reasoning process is regularized by the model’s own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at this https URL.
zh

[CV-110] REASON EDIT: Towards Reasoning -Enhanced Image Editing Models

【速读】:该论文旨在解决当前图像编辑模型在理解抽象指令和实现高精度编辑方面存在的局限性问题。现有方法通常采用冻结的多模态大语言模型(MLLM)编码器与扩散解码器相结合的架构,虽然能完成基础编辑任务,但受限于MLLM推理能力的未被激活,难以应对复杂或模糊的指令。其解决方案的关键在于引入两种推理机制——“思考”(thinking)与“反思”(reflection),构建一个“思考-编辑-反思”循环框架:其中,“思考”机制利用MLLM的世界知识解析抽象指令,提升语义理解;“反思”机制则自动评估编辑结果,修正意外操作并判断是否达到最优编辑状态。实验表明,该方法在ImgEdit、GEdit和Kris等多个指标上均显著优于基线模型,尤其在初始化Step1X-Edit时取得最高达8.2%的性能提升。

链接: https://arxiv.org/abs/2511.22625
作者: Fukun Yin,Shiyu Liu,Yucheng Han,Zhibo Wang,Peng Xing,Rui Wang,Wei Cheng,Yingming Wang,Aojie Li,Zixin Yin,Pengtao Chen,Xiangyu Zhang,Daxin Jiang,Xianfang Zeng,Gang Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL

点击查看摘要

Abstract:Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
zh

[CV-111] Stable-Drift: A Patient-Aware Latent Drift Replay Method for Stabilizing Representations in Continual Learning

【速读】:该论文旨在解决深度学习模型在顺序训练新数据时出现的灾难性遗忘(catastrophic forgetting)问题,这一问题严重限制了人工智能在医学影像领域的持续部署,尤其是在需要不断适应来自不同医院的新数据而不能损害已有诊断知识的情况下。解决方案的关键在于提出一种基于潜在漂移(latent drift)引导的回放机制:通过量化样本在未经调整的领域自适应后内部特征表示的变化来识别具有高表示不稳定的样本,并以患者级别聚合漂移信息,从而在记忆缓冲区中存储每个患者中多层表示变化最大的切片图像进行回放。该方法显著优于简单的微调和随机回放策略,在跨医院新冠肺炎CT分类任务中验证了其有效性。

链接: https://arxiv.org/abs/2511.22615
作者: Paraskevi-Antonia Theofilou,Anuhya Thota,Stefanos Kollias,Mamatha Thota
机构: National Technical University of Athens, Greece(希腊雅典国立技术大学); London School of Economics and Political Science, UK(英国伦敦经济学院); University of Lincoln, UK(英国林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:When deep learning models are sequentially trained on new data, they tend to abruptly lose performance on previously learned tasks, a critical failure known as catastrophic forgetting. This challenge severely limits the deployment of AI in medical imaging, where models must continually adapt to data from new hospitals without compromising established diagnostic knowledge. To address this, we introduce a latent drift-guided replay method that identifies and replays samples with high representational instability. Specifically, our method quantifies this instability via latent drift, the change in a sample internal feature representation after naive domain adaptation. To ensure diversity and clinical relevance, we aggregate drift at the patient level, our memory buffer stores the per patient slices exhibiting the greatest multi-layer representation shift. Evaluated on a cross-hospital COVID-19 CT classification task using state-of-the-art CNN and Vision Transformer backbones, our method substantially reduces forgetting compared to naive fine-tuning and random replay. This work highlights latent drift as a practical and interpretable replay signal for advancing robust continual learning in real world medical settings.
zh

[CV-112] MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

【速读】:该论文旨在解决零样本视觉导航(zero-shot visual navigation)中的长距离路径规划与局部避障控制难题,尤其是在未见过的场景中保持导航鲁棒性的问题。解决方案的关键在于提出一种双尺度框架MG-Nav(Memory-Guided Navigation),其核心是稀疏空间记忆图(Sparse Spatial Memory Graph, SMG),该图以区域为中心,聚合多视角关键帧与目标语义信息,兼顾外观特征与空间结构,并保留视点多样性;全局层面通过图像到实例的混合检索生成可达的路径点序列,局部层面则采用障碍感知的点目标模式执行导航,并在接近目标时切换至图像目标模式;同时引入轻量级几何模块VGGT-adapter,基于预训练VGGT模型对观测与目标特征进行3D感知空间对齐,从而提升视点一致性与目标识别精度。

链接: https://arxiv.org/abs/2511.22609
作者: Bo Wang,Jiehong Lin,Chenzhi Liu,Xinting Hu,Yifei Yu,Tianjia Liu,Zhongrui Wang,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10pages, 5 figures

点击查看摘要

Abstract:We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.
zh

[CV-113] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing

【速读】:该论文旨在解决当前眼动追踪(eye tracking)在空间计算(spatial computing)应用中精度不足的问题。其关键解决方案包括:首先,构建了首个高精度基准数据集 GazeTrack,涵盖多种族、年龄及视力条件下的瞳孔定位与注视追踪数据;其次,提出一种新颖的形状误差正则化方法以约束瞳孔椭圆拟合,并结合开源数据集训练提升语义分割和瞳孔位置预测精度;再次,设计了一种类纸张展开的坐标变换方法,用于在 GazeTrack 数据集上更准确地预测注视向量;最终,开发出一种低计算复杂度的注视向量生成模型,在降低误差的同时提升了效率。

链接: https://arxiv.org/abs/2511.22607
作者: Xiaoyin Yang
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.
zh

[CV-114] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection

【速读】:该论文针对零样本工业异常检测(Zero-shot Industrial Anomaly Detection, ZSAD)方法中生成粗粒度异常图的问题展开研究,其核心挑战在于视觉变换器(Vision Transformers, ViTs)仅能提取图像块(patch-level)特征,导致难以恢复精细的像素级异常区域。现有方法虽尝试利用ZSAD特征预测更细粒度异常,但仍因合成训练异常与真实异常之间的差异而存在漏检问题。论文的关键创新在于提出一种可插拔的异常感知精修模块(Anomaly-aware Refiner, AnoRefiner),其核心是设计了一个异常精修解码器(Anomaly Refinement Decoder, ARD),通过逐步融合异常评分图(anomaly score maps)提供的空间互补信息来增强图像特征,从而降低对合成异常数据的依赖;同时引入一种面向批量生产场景的渐进式组内测试时训练策略(Progressive Group-wise Test-time Training, PGT),在不同产品组间实现ARD的渐进式优化,且兼容任意ZSAD模型。实验表明,AnoRefiner可使多种ZSAD模型在像素级平均精度(pixel-AP)上提升最高达5.2%。

链接: https://arxiv.org/abs/2511.22595
作者: Dayou Huang,Feng Xue,Xurui Li,Yu Zhou
机构: Huazhong University of Science and Technology (华中科技大学); University of Trento (特伦托大学); Wuhan JingCe Electronic Group Co., LTD (武汉精测电子集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD’s image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at this https URL. Comments: 17 pages, 10 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.22595 [cs.CV] (or arXiv:2511.22595v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.22595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型在细粒度语义理解上的局限性问题,即由于缺乏区域级监督信号导致其局部感知能力不足,且现有方法试图提升局部感知时往往破坏全局对齐,形成全局与局部性能之间的权衡困境。解决方案的关键在于提出HarmoCLIP框架,通过引入显式的细粒度语义监督项,直接对齐文本片段与对应视觉区域,从而建立图像区域空间与文本空间的精准映射;同时设计区域-语言对齐(Region-Language Alignment)监督策略,在不损害全局语义一致性的前提下增强局部表征能力,最终实现全局与局部表示的协同优化,有效缓解该权衡问题并显著提升多任务性能。

链接: https://arxiv.org/abs/2511.22594
作者: Haoxi Zeng,Haoxuan Li,Yi Bin,Pengpeng Zeng,Xing Xu,Yang Yang,Heng Tao Shen
机构: Tongji University (同济大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at this https URL.
zh

[CV-116] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

【速读】:该论文旨在解决如何通过不同的思维链(Chain-of-Thought, CoT)设计来提升视觉语言模型(Vision-Language Models, VLMs)的可泛化视觉推理能力这一问题。其核心挑战在于,尽管已有大量研究使用长程或视觉增强型CoT数据(如“think with image”)来监督中间推理过程,但尚不明确具体哪种CoT结构真正有助于模型获得跨场景的泛化能力。解决方案的关键在于:在受控的迷宫求解基准上系统评估三种典型CoT格式——纯语言CoT、带空间坐标轨迹的接地CoT(Grounding CoT)和含图像操作的视觉CoT(Visual CoT),并发现较短且仅保留必要接地步骤的CoT反而能显著提升模型在不同难度任务上的泛化性能,即存在“短即是长”效应(“short is long” effect)。这一发现为构建更具泛化性的监督微调(SFT)数据集提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2511.22586
作者: Yifan Du,Kun Zhou,Yingqian Min,Yue Ling,Wayne Xin Zhao,Youbin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as “think with image”, has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a “short is long” effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.
zh

[CV-117] xt Condition Embedded Regression Network for Automated Dental Abutment Design

【速读】:该论文旨在解决人工牙种植体基台(abutment)设计过程耗时且依赖经验的问题,以及因基台设计不当导致的种植体并发症(如种植体周围炎)风险。其解决方案的关键在于提出一种文本条件嵌入的基台设计框架(Text Condition Embedded Abutment Design, TCEAD),通过引入文本引导定位模块(Text-guided Localization, TGL)增强模型对口腔扫描数据中基台区域的精准定位能力,并基于Mesh Mask Autoencoder(MeshMAE)自监督学习框架进行预训练以提升局部细粒度特征(如种植体宽度、高度及与对颌牙距离)的提取能力,从而实现高效、高适应性的自动化基台设计。

链接: https://arxiv.org/abs/2511.22578
作者: Mianjie Zheng,Xinquan Yang,Xuguang Li,Xiaoling Luo,Xuefen Liu,Kun Tang,He Meng,Linlin Shen
机构: Shenzhen University (深圳大学); Shenzhen University General Hospital (深圳大学总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model’s feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.
zh

[CV-118] Bringing Your Portrait to 3D Presence

【速读】:该论文旨在解决从单张 portrait 图像(涵盖头部、半身到全身)重建可驱动的 3D 人体虚拟形象(animatable 3D human avatar)所面临的三大瓶颈问题:姿态和构图敏感的特征表示、可扩展数据有限以及不可靠的代理网格(proxy-mesh)估计。其解决方案的关键在于三个核心组件:一是提出 Dual-UV 表示,通过 Core-UV 和 Shell-UV 分支将图像特征映射到规范 UV 空间,消除姿态与构图引起的 token 位移;二是构建一个解耦的合成数据流形(factorized synthetic data manifold),融合 2D 生成多样性与几何一致的 3D 渲染,辅以优化训练策略提升真实感与身份一致性;三是设计鲁棒的代理网格跟踪器,在部分遮挡下保持稳定性。上述方法共同实现了在真实场景中的强泛化能力,仅用半身合成数据训练即可达到头部和上半身重建的最先进性能,并取得具有竞争力的全身重建结果。

链接: https://arxiv.org/abs/2511.22553
作者: Jiawei Zhang,Lei Chu,Jiahao Li,Zhenyu Zang,Chong Li,Xiao Li,Xun Cao,Hao Zhu,Yan Lu
机构: Nanjing University (南京大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.
zh

[CV-119] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior NEURIPS2025

【速读】:该论文旨在解决传统图像压缩方法在优化人类感知或机器分析任务时各自独立、难以兼顾的问题。其核心挑战在于如何在压缩过程中同时保障图像的语义信息完整性(对机器任务至关重要)与视觉感知质量(对人类用户重要)。解决方案的关键在于提出Diff-ICMH框架,通过引入语义一致性损失(Semantic Consistency loss, SC loss)确保压缩后图像保留关键语义信息,同时利用生成式先验(generative priors)提升感知真实性;此外,设计标签引导模块(Tag Guidance Module, TGM)以低额外比特率激发预训练扩散模型的生成能力,从而实现单一编码器-解码器(codec)和比特流支持多智能任务,且不牺牲人类感知质量。

链接: https://arxiv.org/abs/2511.22549
作者: Ruoyu Feng,Yunpeng Qi,Jinming Liu,Yixin Gao,Xin Li,Xin Jin,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model’s generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH’s superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: this https URL.
zh

[CV-120] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

【速读】:该论文旨在解决3D扩散模型(3D diffusion models)在推理过程中因迭代去噪过程导致的计算效率低下问题,同时避免现有基于缓存的加速方法在3D场景中因数值误差累积引发的几何不一致性(geometric inconsistency)。其解决方案的关键在于提出一种无需训练的、面向几何感知的缓存框架Fast3Dcache,核心创新包括:1)预测性缓存调度约束(Predictive Caching Scheduler Constraint, PCSC),根据体素稳定模式动态分配缓存额度;2)时空稳定性准则(Spatiotemporal Stability Criterion, SSC),通过速度幅值和加速度阈值筛选可复用的稳定潜在特征。该方法在显著提升推理速度(最高提速27.12%)的同时,大幅降低浮点运算量(FLOPs减少54.8%),并保持几何质量接近原始模型(Chamfer Distance仅增加2.48%,F-Score仅下降1.95%)。

链接: https://arxiv.org/abs/2511.22533
作者: Mengyu Yang,Yanming Yang,Chenyi Xu,Chenxi Song,Yufan Zuo,Tong Zhao,Ruibo Li,Chi Zhang
机构: AGI Lab, Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).
zh

[CV-121] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在端到端自动驾驶中面临的两大核心问题:一是数值推理能力有限,难以应对需要逐步因果推理的复杂驾驶场景;二是输入输出映射过于简化,导致模型在动态环境中决策鲁棒性不足。解决方案的关键在于提出CoT4AD框架,通过引入链式思维(Chain-of-Thought, CoT)推理机制,将感知、问题生成、预测与动作执行过程显式建模为多任务对齐的推理链,从而增强视觉语言模型(VLM)的数值和因果推理能力。训练阶段显式构建“感知-提问-预测-动作”CoT以对齐推理空间与动作空间,推理阶段则采用隐式CoT实现一致的数值推理和稳健决策,显著提升模型在真实世界和仿真基准(如nuScenes和Bench2Drive)上的性能表现。

链接: https://arxiv.org/abs/2511.22532
作者: Zhaohui Wang,Tengbo Yu,Hao Tang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.
zh

[CV-122] DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

【速读】:该论文旨在解决文档视觉问答(DocVQA)中模型在准确性与效率之间存在的显著权衡问题:大型教师模型虽具备较强的定位能力,但部署成本过高;而轻量级学生模型在推理时则因缺乏空间推理能力导致定位性能大幅下降。解决方案的关键在于提出DocVAL——一种验证链式思维(validated chain-of-thought)蒸馏框架,通过三个核心组件实现空间推理能力的有效迁移:(1) 利用验证阶段的文本检测对训练信号进行过滤和去噪;(2) 引入多模块验证器(multi-module validator, VAL),强制答案正确性和几何一致性,并生成细粒度的像素级误差反馈;(3) 采用两阶段学生训练策略,先从验证后的链式思维轨迹中学习,再基于VAL反馈进行迭代优化。该方法使轻量级学生模型(Gemma-3 12B)在无需推理时OCR或文本检测的情况下,达到91.4% ANLS和82.4% mAP的性能,显著提升了DocVQA任务中的空间理解能力。

链接: https://arxiv.org/abs/2511.22521
作者: Ahmad Mohammadshirazi,Pinaki Prasad Guha Neogi,Dheeraj Kulshrestha,Rajiv Ramnath
机构: Ohio State University (俄亥俄州立大学); Flairsoft (Flairsoft)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy–efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4% ANLS and 82.4% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.
zh

[CV-123] RealD2iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion

【速读】:该论文旨在解决机器人操作中的视觉仿真到现实(sim2real)差距问题,即在仿真环境中获取的深度观测无法准确反映真实传感器所固有的复杂噪声模式。解决方案的关键在于提出一种“从干净到嘈杂”(clean-to-noisy)的新范式,利用扩散模型的去噪能力,学习合成带有真实世界特征的噪声深度图,从而仅通过仿真驱动的方式弥合这一差距。其核心创新是提出 RealD²iff 框架,该框架采用分层粗到细的扩散结构,将深度噪声分解为全局结构失真和局部细节扰动,并引入频率引导监督(Frequency-Guided Supervision, FGS)与差异引导优化(Discrepancy-Guided Optimization, DGO)两种互补策略,实现渐进式建模与精化,最终支持零样本 sim2real 机器人操作并生成无需人工采集的真实感深度数据对。

链接: https://arxiv.org/abs/2511.22505
作者: Xiujian Liang,Jiacheng Liu,Mingyang Sun,Qichen He,Cewu Lu,Jianhua Sun
机构: SII; FDU(复旦大学); WU(武汉大学); SJTU(上海交通大学); ZJU(浙江大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealD ^2 iff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization (DGO) for localized refinement. To integrate RealD ^2 iff seamlessly into imitation learning, we construct a pipeline that spans six stages. We provide comprehensive empirical and experimental validation demonstrating the effectiveness of this paradigm. RealD ^2 iff enables two key applications: (1) generating real-world-like depth to construct clean-noisy paired datasets without manual sensor data collection. (2) Achieving zero-shot sim2real robot manipulation, substantially improving real-world performance without additional fine-tuning.
zh

[CV-124] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

【速读】:该论文旨在解决科学论文与其展示海报布局之间对应关系不明确的问题,尤其是在大规模数据标注缺失的情况下,难以有效理解或生成与论文内容结构相匹配的海报布局。解决方案的关键在于构建了一个名为SciPostGen的大规模数据集,其中包含科学论文与其对应海报布局的配对标注,并基于此发现论文结构特征(如章节分布)与海报中元素数量存在显著关联。进一步地,作者提出了一种“检索增强型海报布局生成框架”(Retrieval-Augmented Poster Layout Generation),通过检索与给定论文结构一致的历史布局作为生成指导,从而提升生成布局的语义合理性与实用性,且在有无布局约束条件下均能有效生成符合需求的海报结构。

链接: https://arxiv.org/abs/2511.22490
作者: Shun Inadumi,Shohei Tanaka,Tosho Hirasawa,Atsushi Hashimoto,Koichiro Yoshino,Yoshitaka Ushiku
机构: OMRON SINIC X Corp.(OMRON SINIC X公司); NAIST(日本信息学研究生院大学); RIKEN GRP(理化学研究所); Science Tokyo(东京科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Dataset: this https URL , Code: this https URL

点击查看摘要

Abstract:As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this gap, we introduce SciPostGen, a large-scale dataset for understanding and generating poster layouts from scientific papers. Our analyses based on SciPostGen show that paper structures are associated with the number of layout elements in posters. Based on this insight, we explore a framework, Retrieval-Augmented Poster Layout Generation, which retrieves layouts consistent with a given paper and uses them as guidance for layout generation. We conducted experiments under two conditions: with and without layout constraints typically specified by poster creators. The results show that the retriever estimates layouts aligned with paper structures, and our framework generates layouts that also satisfy given constraints.
zh

[CV-125] AI killed the video star. Audio-driven diffusion model for expressive talking head generation

【速读】:该论文旨在解决音频驱动人脸生成(audio-driven talking head generation)中难以同步实现唇部运动、面部表情和头部姿态运动的问题。现有方法通常仅关注唇动同步,而忽略表情与头部动作的自然融合,导致生成结果缺乏真实感。其解决方案的关键在于提出Dimitra++框架,核心创新是引入条件运动扩散Transformer(conditional Motion Diffusion Transformer, cMDT),该模型基于3D表示学习面部运动序列,并以参考人脸图像(决定外观)和音频序列(驱动运动)作为双重条件输入,从而实现多维度面部动态的联合建模与高质量生成。

链接: https://arxiv.org/abs/2511.22488
作者: Baptiste Chopin,Tashvik Dhamija,Pranav Balaji,Yaohui Wang,Antitza Dantcheva
机构: Inria(法国国家信息与自动化研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2502.17198

点击查看摘要

Abstract:We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.
zh

[CV-126] Adversarial Flow Models

【速读】:该论文旨在解决生成式 AI(Generative AI)中传统对抗生成网络(GANs)训练不稳定以及一致性模型(consistency-based methods)需学习多步中间状态导致的计算冗余与误差累积问题。其解决方案的关键在于提出对抗流模型(adversarial flow models),该模型将对抗机制与流模型(flow models)统一,使生成器学习确定性的噪声到数据映射,从而获得最优传输(optimal transport)路径,显著提升训练稳定性;同时直接支持一步或多步生成,无需显式建模概率流中的中间时间步,节省模型容量、减少训练迭代次数并避免误差传播。在相同1NFE条件下,该方法在ImageNet-256px上实现了优于一致性模型的性能,并通过深度重复策略实现无中间监督的端到端训练,达到FID 1.94的最新纪录。

链接: https://arxiv.org/abs/2511.22475
作者: Shanchuan Lin,Ceyuan Yang,Zhijie Lin,Hao Chen,Haoqi Fan
机构: ByteDance Seed (字节跳动种子项目)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present adversarial flow models, a class of generative models that unifies adversarial models and flow models. Our method supports native one-step or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without needing to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 2NFE and 4NFE counterparts.
zh

[CV-127] Rethinking Cross-Generator Image Forgery Detection through DINOv3

【速读】:该论文旨在解决生成式 AI(Generative AI)图像伪造检测中跨生成器泛化能力不足的问题,即现有检测方法往往依赖特定生成模型的特征记忆,导致在未见过的生成器上性能显著下降。其解决方案的关键在于发现并利用冻结的视觉基础模型(如 DINOv3)本身具备较强的跨生成器检测能力,且该能力源于对全局、低频结构等弱但可迁移的真实性线索的依赖。基于此洞察,作者提出一种无需训练的 token 排序策略,结合轻量级线性探测器筛选出与真实性相关的少量 token 子集,从而在多个数据集上稳定提升检测准确率,为图像伪造检测提供了一个通用、高效且可解释的基线方法。

链接: https://arxiv.org/abs/2511.22471
作者: Zhenglin Huang,Jason Li,Haiquan Wen,Tianxiao Li,Xi Yang,Lu Qi,Bei Peng,Xiaowei Huang,Ming-Hsuan Yang,Guangliang Cheng
机构: University of Liverpool, UK (利物浦大学); Nanyang Technological University (南洋理工大学); HKUST (香港科技大学); UC Merced (加州大学默塞德分校); University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.
zh

[CV-128] Hybrid Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval

【速读】:该论文旨在解决文本驱动的人体异常检索(text-based person anomaly retrieval)任务中模型难以提取细粒度特征的问题,从而提升检索精度。其核心解决方案在于提出一种局部-全局混合视角(Local-Global Hybrid Perspective, LHP)模块,该模块与视觉语言模型(Vision-Language Model, VLM)协同工作,以同时挖掘细粒度与粗粒度特征;此外,通过引入统一图像-文本(Unified Image-Text, UIT)模型融合多种损失函数(如图像-文本对比损失ITC、图像-文本匹配损失ITM、掩码语言建模损失MLM及掩码图像建模损失MIM),并设计一种迭代集成策略替代传统并行集成方法,进一步优化性能;最终结合基于LHP引导的新型特征选择算法,显著提升了模型在PAB数据集上的表现,实现SOTA效果,R@1指标提升达9.70%。

链接: https://arxiv.org/abs/2511.22470
作者: Tien-Huy Nguyen,Huu-Loc Tran,Huu-Phong Phan-Nguyen,Quang-Vinh Dinh
机构: University of Information Technology (信息科技大学); Vietnam National University Ho Chi Minh (越南国家大学胡志明市分校); AI VIETNAM Lab (AI越南实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted on World Wide Web 2025 Workshop

点击查看摘要

Abstract:Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model’s performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70% improvement in R@1, 1.77% improvement in R@5, and 1.01% improvement in R@10.
zh

[CV-129] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

【速读】:该论文旨在解决当前自动驾驶和数字地图构建中对**中层道路语义(mid-level road semantics)理解不足的问题,即现有基准主要聚焦于检测或分割等低级感知任务,忽视了推理能力在推断道路拓扑结构与动态场景结构中的关键作用。解决方案的关键在于提出一个轻量但信息丰富的基准数据集RoadSceneBench,强调关系理解和结构一致性,同时设计了一种名为分层关系奖励传播与时间一致性(Hierarchical Relational Reward Propagation with Temporal Consistency, HRRP-T)**的训练框架,用于视觉-语言模型(VLMs),通过自适应奖励信号促进空间连贯性和语义对齐,使模型从静态识别迈向几何感知且时序一致的推理能力。

链接: https://arxiv.org/abs/2511.22466
作者: Xiyan Liu,Han Wang,Yuhu Wang,Junjie Cai,Zhe Cao,Jianzhong Yang,Zhen Lu
机构: Baidu Inc (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at this https URL.
zh

[CV-130] Gaussians on Fire: High-Frequency Reconstruction of Flames

【速读】:该论文旨在解决从有限视角(仅三视图)中重建动态火焰的三维结构及其时空演化问题,其核心挑战在于火焰的高动态性、透明特性以及高频细节难以捕捉。解决方案的关键在于:首先通过融合密集多视角立体视觉与单目深度先验,将静态背景与动态火焰区域分离;其次以3D高斯作为基本表示单元,结合每帧光流投影构建初始3D流场,并为每个高斯赋予生命周期和线性速度参数以匹配密集光流信息,从而有效捕捉火焰的高频特征;最后采用定制硬件同步机制实现跨相机的亚帧级时间对齐,使系统能够在低成本消费级硬件上完成高质量重建。

链接: https://arxiv.org/abs/2511.22459
作者: Jakob Nazarenus,Dominik Michels,Wojtek Palubicki,Simin Kou,Fang-Lue Zhang,Soren Pirk,Reinhard Koch
机构: Kiel University (基尔大学); KAUST (国王阿卜杜拉大学科技); Adam Mickiewicz University (亚当·密凯维奇大学); Victoria University of Wellington (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:We propose a method to reconstruct dynamic fire in 3D from a limited set of camera views with a Gaussian-based spatiotemporal representation. Capturing and reconstructing fire and its dynamics is highly challenging due to its volatile nature, transparent quality, and multitude of high-frequency features. Despite these challenges, we aim to reconstruct fire from only three views, which consequently requires solving for under-constrained geometry. We solve this by separating the static background from the dynamic fire region by combining dense multi-view stereo images with monocular depth priors. The fire is initialized as a 3D flow field, obtained by fusing per-view dense optical flow projections. To capture the high frequency features of fire, each 3D Gaussian encodes a lifetime and linear velocity to match the dense optical flow. To ensure sub-frame temporal alignment across cameras we employ a custom hardware synchronization pattern – allowing us to reconstruct fire with affordable commodity hardware. Our quantitative and qualitative validations across numerous reconstruction experiments demonstrate robust performance for diverse and challenging real fire scenarios.
zh

[CV-131] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

【速读】:该论文旨在解决文本引导的3D扩散模型在推理阶段生成质量受限的问题,即如何在不进行额外训练的前提下提升生成结果的质量。其解决方案的关键在于提出ITS3D框架,将优化问题建模为寻找最优高斯噪声输入的过程,并通过验证器引导的搜索算法迭代优化噪声候选样本。该框架的核心创新包括:1)引入高斯归一化以稳定搜索过程,缓解因噪声偏离标准高斯分布导致的分布偏移;2)采用基于奇异值分解(Singular Value Decomposition, SVD)的压缩技术降低高维3D搜索空间的计算复杂度,同时保留有效搜索方向;3)设计奇异空间重置机制,依据多样性指标动态调整搜索空间,避免陷入局部次优解。实验表明,该方法显著提升了文本到3D生成的质量,验证了计算高效搜索策略在生成式AI中的潜力。

链接: https://arxiv.org/abs/2511.22456
作者: Zhenglin Zhou,Fan Ma,Xiaobo Xia,Hehe Fan,Yi Yang,Tat-Seng Chua
机构: ReLER, Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at this https URL.
zh

[CV-132] Beyond Real versus Fake Towards Intent-Aware Video Analysis

【速读】:该论文旨在解决当前深度伪造(deepfake)视频检测方法仅关注真实性验证而忽视视频背后意图识别的问题。传统方法难以应对由恶意意图驱动的虚假内容带来的复杂社会风险,如金融欺诈、政治宣传或制造恐慌等。解决方案的关键在于提出一个以人类为中心的意图分析基准 IntentHQ,包含5168个标注了23种细粒度意图类别的视频数据集,并开发了一种融合时空视频特征、音频处理和文本分析的多模态监督与自监督模型,从而实现对视频潜在动机和目标的精准推断。

链接: https://arxiv.org/abs/2511.22455
作者: Saurabh Atreya,Nabyl Quignon,Baptiste Chopin,Abhijit Das,Antitza Dantcheva
机构: BITS Pilani Hyderabad, India; Inria Center at Université Côte d’Azur, France; Hochschule Darmstadt, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including “Financial fraud”, “Indirect marketing”, “Political propaganda”, as well as “Fear mongering”. We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.
zh

[CV-133] Benchmarking machine learning models for multi-class state recognition in double duantum dot data

【速读】:该论文旨在解决半导体量子点(Quantum Dots, QDs)器件在大规模集成时面临的自动调谐难题,特别是如何从电荷稳定性图(Charge-Stability Diagrams, CSDs)中准确识别多类量子点状态,以支持设备的自动化校准与操作。其解决方案的关键在于系统性地评估四种现代机器学习(Machine Learning, ML)架构在双量子点CSD图像分类任务中的性能表现,发现卷积神经网络(Convolutional Neural Networks, CNNs)在实验数据上展现出最佳的准确性与计算效率平衡,且配合最小-最大归一化(min-max scaling)能实现稳定可靠的多类状态识别,成为当前最实用的工程化方案。

链接: https://arxiv.org/abs/2511.22451
作者: Valeria Díaz Moreno,Ryan P Khalili,Daniel Schug,Patrick J. Walsh,Justyna P. Zwolak
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Maryland, College Park (马里兰大学科尔利奇公园分校); Stanford University (斯坦福大学); National Institute of Standards and Technology (国家标准与技术研究院); Joint Center for Quantum Information and Computer Science (联合量子信息与计算机科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices’ bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models – U-Nets and visual transformers (ViTs) – achieve the highest MSE score (defined as 1-\mathrmMSE ) on synthetic data (over 0.98 ) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.
zh

[CV-134] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

【速读】:该论文旨在解决深度伪造(Deepfake)技术滥用带来的安全风险,尤其是针对难以识别的伪造媒体内容进行高效、可靠的检测问题。其解决方案的关键在于提出一种名为FauxNet的新网络架构,该架构基于预训练的视觉语音识别(Visual Speech Recognition, VSR)特征,通过提取视频中的时序VSR特征来区分真实视频与伪造视频。特别地,FauxNet聚焦于零样本检测(zero-shot detection),即在未见过的伪造生成技术下仍具备良好的泛化能力,并能对不同生成方法进行溯源归因,从而显著优于现有最先进方法。

链接: https://arxiv.org/abs/2511.22443
作者: Maheswar Bora,Tashvik Dhamija,Shukesh Reddy,Baptiste Chopin,Pranav Balaji,Abhijit Das,Antitza Dantcheva
机构: Birla Institute of Technology and Sciences, Pilani (比尔拉理工学院与科学学院,皮兰尼); Inria Center at Université Côte d’Azur (法国蔚蓝海岸大学 INRIA 中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.
zh

[CV-135] What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F_1

【速读】:该论文旨在解决多维性能指标(如精度和召回率)在分类模型评估中产生的排名冲突问题,即如何在不牺牲评价完整性的情况下构建一个全局、有意义的综合排名。其核心挑战在于:尽管F-score(Fβ)被广泛用于整合精度与召回率,但其诱导的排名是否真正代表了二者间的最优权衡尚无理论保障。解决方案的关键在于将这一权衡问题形式化为基于Kendall秩相关系数的优化问题,并通过推导出β参数的闭式表达式,实现对任意性能分布下最优β值的精确计算,从而获得比传统F₁和其偏斜不敏感版本更优的排名一致性。

链接: https://arxiv.org/abs/2511.22442
作者: Sébastien Piérard,Adrien Deliège,Marc Van Droogenbroeck
机构: Montefiore Institute, University of Liège (列日大学蒙特菲尔学院)
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so,it has been proposed to take a weighted harmonic mean, known as the F-score, F-measure, or F_\beta . Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of F_\beta scores in the literature, some clarification is in order. Concretely: (1) We establish that F_\beta -induced rankings are meaningful and define a shortest path between precision- and recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that F_1 and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for \beta for any distribution or set of performances, and we illustrate their use on six case studies.
zh

[CV-136] GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents

【速读】:该论文旨在解决图像在社交媒体上传播时暴露地理线索(geographic cues)所带来的隐私泄露问题,尤其针对现有基于大视觉语言模型(Large Vision Language Models, LVLMs)的地理定位方法在复杂场景下性能不足、缺乏人类推理能力与工具协同机制的问题。解决方案的关键在于提出Geo-Detective代理系统,其通过模拟人类推理过程,采用四步自适应策略选择机制,并集成如视觉逆向搜索等专用工具,以动态整合外部地理信息提升定位准确性;实验表明,该方法在国家层级和细粒度地理定位任务中均显著优于基线LVLMs,且在引入外部线索后可将“未知”预测率降低超50.6%,凸显其高效性与潜在隐私风险。

链接: https://arxiv.org/abs/2511.22441
作者: Xinyu Zhang,Yixin Wu,Boyang Zhang,Chenhao Lin,Chao Shen,Michael Backes,Yang Zhang
机构: CISPA Helmholtz Center for Information Security; Xi’an Jiaotong University
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages with 7 figures and 12 tables

点击查看摘要

Abstract:Images shared on social media often expose geographic cues. While early geolocation methods required expert effort and lacked generalization, the rise of Large Vision Language Models (LVLMs) now enables accurate geolocation even for ordinary users. However, existing approaches are not optimized for this task. To explore the full potential and associated privacy risks, we present Geo-Detective, an agent that mimics human reasoning and tool use for image geolocation inference. It follows a procedure with four steps that adaptively selects strategies based on image difficulty and is equipped with specialized tools such as visual reverse search, which emulates how humans gather external geographic clues. Experimental results show that GEO-Detective outperforms baseline large vision language models (LVLMs) overall, particularly on images lacking visible geographic features. In country level geolocation tasks, it achieves an improvement of over 11.1% compared to baseline LLMs, and even at finer grained levels, it still provides around a 5.2% performance gain. Meanwhile, when equipped with external clues, GEO-Detective becomes more likely to produce accurate predictions, reducing the “unknown” prediction rate by more than 50.6%. We further explore multiple defense strategies and find that Geo-Detective exhibits stronger robustness, highlighting the need for more effective privacy safeguards.
zh

[CV-137] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection

【速读】:该论文旨在解决少样本多类工业异常检测(few-shot multi-class industrial anomaly detection)中的关键挑战,即视觉-语言模型在数据稀缺条件下难以同时实现类别自适应性与判别能力,导致正常与异常状态边界模糊,进而引发细微缺陷漏检和非典型正常样本误判的问题。解决方案的核心在于提出ABounD框架,其关键创新为:1)动态概念融合(Dynamic Concept Fusion, DCF)模块通过融合可泛化的先验知识与类别特定线索生成类自适应提示;2)对抗边界锻造(Adversarial Boundary Forging, ABF)模块利用PGD风格扰动生成边界级栅栏特征,精确塑造决策边界。两者协同优化,结合概念-边界损失(Concept-Boundary Loss),使决策边界紧密贴合正常数据分布,同时保持语义对齐的鲁棒性与灵活性。

链接: https://arxiv.org/abs/2511.22436
作者: Runzhi Deng,Yundi Hu,Xinshuang Zhang,Zhao Wang,Xixi Liu,Wang-Zhou Dai,Caifeng Shan,Fang Zhao
机构: Nanjing University (南京大学); China Mobile Zijin Innovation Institute (中国移动紫金创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.
zh

[CV-138] SkeletonAgent : An Agent ic Interaction Framework for Skeleton-based Action Recognition

【速读】:该论文旨在解决基于骨架的动作识别中,大型语言模型(Large Language Models, LLM)在缺乏与识别模型协同反馈的情况下,难以提供有效判别性语义线索的问题。其核心挑战在于,LLM通常独立调用且未接收来自识别模型的性能反馈,导致生成的提示无法精准区分语义相近的动作类别。解决方案的关键在于提出SkeletonAgent框架,通过两个协作代理——Questioner和Selector实现跨模态交互:Questioner根据识别模型输出的混淆类信息向LLM提供上下文引导,以提升提示的针对性;Selector则从LLM响应中提取关节级别的约束信息,并将其反馈给识别器,从而实现细粒度的跨模态对齐,显著增强动作分类的判别能力。

链接: https://arxiv.org/abs/2511.22433
作者: Hongda Liu,Yunfan Liu,Changlu Wang,Yunlong Wang,Zhenan Sun
机构: NLPR, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM’s response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at this https URL.
zh

[CV-139] Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation NEURIPS2025

【速读】:该论文旨在解决当前前馈式三维重建模型在精细几何恢复和鲁棒性方面的不足,其根本原因在于高质量深度与位姿监督信号的稀缺性,以及多视角点云回归带来的固有几何错位问题。解决方案的关键在于提出一种轻量级微调方法Fin3R:冻结负责视图匹配的解码器,仅对图像编码器进行微调,并通过定制的轻量LoRA适配器,将大规模无标签数据上强单目教师模型蒸馏出的精细几何信息注入编码器,从而显著提升重建精度与边界清晰度,同时保持测试时内存和延迟几乎不变。

链接: https://arxiv.org/abs/2511.22429
作者: Weining Ren,Hongjun Wang,Xiao Tan,Kai Han
机构: The University of Hong Kong (香港大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textiti) the scarcity of high-fidelity depth and pose supervision and (\textitii) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \hrefthis http URLthis https URL
zh

[CV-140] Wukongs 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

【速读】:该论文旨在解决现有3D形态变换(3D morphing)方法中普遍存在的问题:依赖人工对应关系匹配与形变轨迹估计,导致泛化能力受限且预处理成本高。针对这一挑战,WUKONG提出了一种无需训练的框架,其核心创新在于利用基于流的生成模型(flow-based generative models)先验来实现高保真度的纹理3D形态变换。关键解决方案包括:1)将形态变换建模为最优传输重心(optimal transport barycenter)问题,以利用流模型内在的连续性保障形状过渡平滑;2)引入顺序初始化策略,避免几何突变并保持身份一致性;3)设计相似性引导的语义一致性机制,选择性保留高频纹理细节,从而在不产生过度平滑等伪影的前提下实现精确的混合动态控制。

链接: https://arxiv.org/abs/2511.22425
作者: Minghao Yin,Yukang Cao,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods – which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) – WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
zh

[CV-141] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention

【速读】:该论文旨在解决现有3D头像风格化方法在适应新艺术风格时依赖计算成本高昂的优化或特定领域微调的问题,从而限制了其灵活性与效率。解决方案的关键在于提出DiffStyle360——一个基于扩散模型(diffusion-based)的框架,能够在仅提供单张风格参考图像的情况下,生成多视角一致且身份保留的3D头像风格化结果,无需针对每种风格进行额外训练。其核心创新包括:1)Style Appearance Module,用于在潜在空间中解耦风格与内容;2)Style Fusion Attention机制,动态平衡结构保真度与风格忠实度;此外,通过使用3D GAN生成的多视角数据集进行鲁棒微调,并引入温度控制的键缩放策略以调节推理阶段的风格强度,显著提升了跨多样艺术领域的风格化质量。

链接: https://arxiv.org/abs/2511.22411
作者: Furkan Guzelant,Arda Goktogan,Tarık Kaya,Aysegul Dundar
机构: Bilkent University (比尔肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.
zh

[CV-142] UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data

【速读】:该论文旨在解决复杂低空环境中无人机(Unmanned Aerial Vehicle, UAV)感知的准确性问题,尤其是在真实数据采集受限(如空域法规、隐私保护及环境多样性)和人工标注3D姿态与跨模态对应关系成本高昂的情况下。解决方案的关键在于构建一个高保真度的多模态合成数据集UAV-MM3D,包含40万帧同步数据,覆盖多种场景(城市、郊区、森林、沿海)和天气条件(晴朗、多云、雨天、雾天),并融合RGB、红外(IR)、LiDAR、雷达和动态视觉传感器(Dynamic Vision Sensor, DVS)五种模态,每帧提供2D/3D边界框、6自由度(6-DoF)位姿及实例级标注,从而支持3D检测、位姿估计、目标跟踪和短时轨迹预测等核心任务。此外,作者提出LGFusionNet(LiDAR引导的多模态融合基线)和专用无人机轨迹预测基线,为低空UAV感知研究提供可扩展的公共基准。

链接: https://arxiv.org/abs/2511.22404
作者: Longkun Zou,Jiale Wang,Rongqin Liang,Hai Wu,Ke Chen,Yaowei Wang
机构: Pengcheng Laboratory; University of Southern California
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.
zh

[CV-143] Asking like Socrates: Socrates helps VLMs understand remote sensing images

【速读】:该论文旨在解决遥感(Remote Sensing, RS)视觉语言任务中广泛存在的“伪推理”(pseudo reasoning)问题,即模型倾向于叙述推理过程而非基于视觉证据进行真实推理,其根源在于“瞥视效应”(Glance Effect)——由于对大尺度遥感图像的粗粒度感知导致理解不完整,进而依赖语言自一致性而非视觉证据进行决策。解决方案的关键是提出RS-EoT(Remote Sensing Evidence-of-Thought),一种以语言驱动、迭代式寻求视觉证据的推理范式;为实现该范式,进一步设计了SocraticAgent,一个通过交替进行推理与视觉检查的自对弈多智能体系统,并采用两阶段渐进式强化学习策略:首先在细粒度定位任务上训练以增强RS-EoT能力,再在遥感视觉问答(RS VQA)任务上泛化至更广泛的理解场景。实验表明,该方法显著优于现有模型,且分析验证了其具备明确的迭代推理与证据获取循环,有效缓解了瞥视效应,实现了真正基于视觉证据的推理。

链接: https://arxiv.org/abs/2511.22396
作者: Run Shao,Ziyu Li,Zhaoyang Zhang,Linrui Xu,Xinran He,Hongyuan Yuan,Bolei He,Yongxing Dai,Yiming Yan,Yijun Chen,Wang Guo,Haifeng Li
机构: Central South University (中南大学); Baidu Inc. (百度公司); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at this https URL
zh

[CV-144] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

【速读】:该论文旨在解决训练-free 3D编辑中因扩散采样过程中时间步依赖噪声导致潜在空间锚点不一致,从而引发编辑效果弱化或几何不稳定的问题。解决方案的关键在于提出AnchorFlow框架,其核心思想是通过建立源与目标轨迹间的全局潜在锚点(latent anchor),并结合松弛化的锚点对齐损失(relaxed anchor-alignment loss)和锚点对齐更新规则(anchor-aligned update rule),确保潜在参考空间的稳定性,从而实现语义忠实且结构稳健的3D形状编辑。

链接: https://arxiv.org/abs/2511.22357
作者: Zhenglin Zhou,Fan Ma,Chengzhuo Gui,Xiaobo Xia,Hehe Fan,Yi Yang,Tat-Seng Chua
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Code is at this https URL.
zh

[CV-145] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

【速读】:该论文旨在解决当前AI生成图像(如通过生成对抗网络GAN或扩散模型生成的图像)在真实场景下检测可靠性差、解释性不足的问题。现有伪造检测系统在面临严重下采样、压缩及跨域分布偏移时性能显著下降,且多为黑箱分类器,缺乏对判定依据的透明解释,限制了其在高风险场景中的应用。解决方案的关键在于提出INSIGHT框架——一个统一的多模态可解释检测与溯源系统,其核心创新包括:(1) 分层超分辨率技术放大细微伪造线索而不引入误导性伪影;(2) 基于Grad-CAM的多尺度定位方法识别生成模式的空间区域;(3) CLIP引导的语义对齐机制将视觉异常映射为人类可理解的描述词;(4) 利用结构化的ReAct+思维链提示策略驱动视觉语言模型生成细粒度、一致的解释,并通过双阶段G-Eval + LLM-as-a-judge验证机制确保事实准确性。此方案显著提升了极端低分辨率(16x16–64x64)下的检测鲁棒性和解释质量。

链接: https://arxiv.org/abs/2511.22351
作者: Anshul Bagaria
机构: Indian Institute of Technology, Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 17 figures

点击查看摘要

Abstract:The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification. Comments: 36 pages, 17 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.22351 [cs.CV] (or arXiv:2511.22351v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.22351 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-146] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment AAAI2026

【速读】:该论文旨在解决标准归一化流(Normalizing Flows, NFs)在生成质量上的局限性,其根源在于通过对数似然优化所获得的语义表征能力较弱。为解决此问题,作者提出了一种创新的对齐策略:利用NF架构的可逆特性,在生成(反向)过程中将中间特征与强大视觉基础模型(vision foundation model)的表示进行对齐,而非传统地正则化前向传递过程,从而显著提升语义一致性与生成性能。该方案的关键在于创造性地借助NF的可逆结构,在不改变训练流程的前提下增强嵌入的语义知识,并进一步引入一种无需训练的测试时优化算法用于分类任务评估,从而更内在地衡量NF的表征能力。实验表明,该方法在ImageNet 64×64和256×256上均取得新的SOTA结果,且训练速度提升超过3.3倍。

链接: https://arxiv.org/abs/2511.22345
作者: Yang Chen,Xiaowei Xu,Shuai Wang,Chenhui Zhu,Ruxue Wen,Xubin Li,Tiezheng Ge,Limin Wang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF’s embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3 \times , while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64 \times 64 and 256 \times 256. Our code is available at this https URL.
zh

[CV-147] Unexplored flaws in multiple-choice VQA evaluations

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像-文本输入理解能力评估中存在未被充分认识的提示格式偏差问题。现有基于多项选择视觉问答(Multiple-choice Visual Question Answering, VQA)的评测方法虽已知对答案选项顺序敏感,但本文揭示了提示格式中其他语义中立但细微的变化同样显著影响模型表现,且这些偏差独立于已知的顺序偏倚和模型置信度。解决方案的关键在于系统性地识别并量化三种主要的提示格式变体因素,并通过涵盖7种MLLMs、5个VQA数据集及48种不同提示格式的大规模实证研究,证明现有偏差缓解策略无法有效应对这些新发现的提示格式偏倚,从而呼吁未来评估体系需更严格控制提示设计变量以提升评测可靠性。

链接: https://arxiv.org/abs/2511.22341
作者: Fabio Rosenthal,Sebastian Schmidt,Thorsten Graf,Thorsten Bagodonat,Stephan Günnemann,Leo Schwinn
机构: Technical University of Munich (慕尼黑工业大学); Volkswagen AG (大众集团); Munich Data Science Institute (慕尼黑数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving \mathbf\textseven MLLMs and \mathbf\textfive VQA datasets, spanning \mathbf48 distinct \mathbf\textprompt format variations . Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM’s confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
zh

[CV-148] Prompt-based Consistent Video Colorization

【速读】:该论文旨在解决现有视频着色方法中存在的时序闪烁问题以及对大量人工输入的依赖。其核心解决方案是利用来自语言和分割的丰富语义引导,结合语言条件扩散模型实现高保真度的自动视频着色;关键创新在于通过自动生成的对象掩码和文本提示提供引导,并采用光流(RAFT)将前一帧的颜色信息进行空间映射,再通过校正步骤检测并修复因映射引入的不一致性,从而在无需特定颜色输入的情况下实现时序稳定且视觉真实的着色效果。

链接: https://arxiv.org/abs/2511.22330
作者: Silvia Dani,Tiberio Uricchio,Lorenzo Seidenari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.
zh

[CV-149] Small Object Detection for Birds with Swin Transformer WWW

【速读】:该论文旨在解决小目标检测(small object detection)在训练样本稀疏场景下的性能瓶颈问题,尤其是在目标类别特定且数量有限的情况下,传统方法难以学习到有效的特征表示。其解决方案的关键在于改进检测网络中“颈部”(neck)结构的特征提取能力,提出一种基于Swin Transformer的层级化设计:通过调整移位窗口(shifted window)大小以适应小目标特性,并利用Swin Transformer进行图像特征上采样,从而增强对小而稀疏目标(如鸟类)的感知能力。实验表明,较小的窗口尺寸(默认为2)有助于提升小目标检测的平均精度(mAP)。

链接: https://arxiv.org/abs/2511.22310
作者: Da Huo,Marc A. Kastner,Tingwei Liu,Yasutomo Kawanishi,Takatsugu Hirayama,Takahiro Komamizu,Ichiro Ide
机构: Nagoya University (名古屋大学); Kyoto University (京都大学); RIKEN (理化学研究所); University of Human Environments (人类环境大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is included in the proceedings of the 18th International Conference on Machine Vision Applications (MVA2023) ( this https URL ) The paper has received Runner-Up Solution Award (2nd) and Best Booster Award from Small Object Detection Challenge for Spotting Birds 2023 in MVA

点击查看摘要

Abstract:Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.
zh

[CV-150] Structure is Supervision: Multiview Masked Autoencoders for Radiology

【速读】:该论文旨在解决医学机器学习系统在缺乏大量标注数据情况下难以构建鲁棒模型的问题,核心挑战在于如何有效利用临床数据中固有的结构信息(如多视角影像和文本报告)来提升模型的泛化能力。解决方案的关键在于提出一种自监督框架Multiview Masked Autoencoder (MVMAE),其通过结合掩码图像重建与跨视图对齐机制,将放射学检查中不同投影视角间的冗余信息转化为强大的自监督信号,从而学习到视角不变且与疾病相关的表征;进一步扩展的MVMAE-V2T则引入放射科报告作为辅助文本信号,增强语义锚定能力,同时保持纯视觉推理特性,在低标签场景下尤其显著提升了性能。

链接: https://arxiv.org/abs/2511.22294
作者: Sonia Laguna,Andrea Agostini,Alain Ryser,Samuel Ruiperez-Campillo,Irene Cannistraci,Moritz Vandenhirtz,Stephan Mandt,Nicolas Deperrois,Farhad Nooralahzadeh,Michael Krauthammer,Thomas M. Sutter,Julia E. Vogt
机构: ETH Zurich (苏黎世联邦理工学院); UC Irvine (加州大学欧文分校); University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
zh

[CV-151] Match-and-Fuse: Consistent Generation from Unstructured Image Sets ATC

【速读】:该论文旨在解决无结构图像集合(unstructured image sets)在可控生成过程中的一致性问题,即如何在保持集合内共享视觉元素跨图像一致性的同时,生成新的图像集合。现有方法通常针对单张图像或密集采样的视频进行操作,难以保证多视角、不同拍摄时间及背景内容下的全局一致性。解决方案的关键在于提出“Match-and-Fuse”框架,其核心思想是将图像集合建模为图结构,其中节点代表图像、边表示图像对之间的联合生成任务,并通过融合图像对间的内部特征(基于密集输入对应关系)实现局部一致性与整体一致性统一,且无需掩码或人工标注。该方法利用文本到图像模型中隐含的先验机制——当多个视图共享同一画布时更易生成连贯结果,从而实现零样本、免训练的一致性可控生成。

链接: https://arxiv.org/abs/2511.22287
作者: Kate Feingold,Omri Kaduri,Tali Dekel
机构: Weizmann Institute of Science (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.
zh

[CV-152] he Collapse of Patches

【速读】:该论文旨在解决图像建模中如何高效利用局部区域信息以提升模型性能的问题,尤其关注在掩码图像建模(Masked Image Modeling, MIM)场景下,如何识别并利用对目标区域重构最具依赖性的图像块(patch)。其解决方案的关键在于提出“patch collapse”现象——即观察某些图像块可降低其余块的特征分布熵,类比于量子力学中的波函数坍缩;通过训练一个软选择机制的自编码器,学习每个目标块所依赖的最优块子集,并基于PageRank算法量化各块的依赖重要性,从而得到一种最优重建顺序。该顺序被证明能显著提升自回归图像生成和视觉Transformer分类任务的效率与性能,例如仅需22%高秩块即可实现高精度分类,体现了patch collapse作为新型图像建模视角在提升视觉效率方面的潜力。

链接: https://arxiv.org/abs/2511.22281
作者: Wei Guo,Shunqi Mao,Zhuonan Liang,Heng Wang,Weidong Cai
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle’s wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region’s collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch’s PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at this https URL .
zh

[CV-153] DriveVGGT: Visual Geometry Transformer for Autonomous Driving

【速读】:该论文旨在解决将通用图像重建模型VGGT直接应用于自动驾驶(Autonomous Driving, AD)场景时性能不佳的问题,其核心原因在于AD任务具有不同于通用场景的先验知识,包括:相机视场重叠度低、已知相机内参与外参以提供绝对尺度约束、以及所有相机相对位置固定。为充分融合这些先验信息,作者提出DriveVGGT,一个面向自动驾驶数据的尺度感知4D重建框架。其关键创新在于:(1) 引入Temporal Video Attention (TVA)模块,独立处理多摄像头视频序列,利用单摄像头内的时空连续性;(2) 设计Multi-camera Consistency Attention (MCA)模块,通过归一化的相对位姿嵌入进行窗口注意力机制,在保证跨摄像头一致性的同时限制每个token仅关注邻近帧;(3) 扩展标准VGGT头部结构,新增绝对尺度头和自车位姿头,从而显式建模绝对尺度与运动估计。实验表明,DriveVGGT在自动驾驶数据集上优于VGGT、StreamVGGT和fastVGGT,并通过消融实验证明了各设计的有效性。

链接: https://arxiv.org/abs/2511.22264
作者: Xiaosong Jia,Yanhao Liu,Junqi You,Renqiu Xia,Yu Hong,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.
zh

[CV-154] Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting? AAAI2026

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)场景中水印保护的脆弱性问题,即现有水印方案是否能真正实现鲁棒的版权保护与所有权验证。研究发现,传统针对2D图像设计的水印移除技术无法有效适用于3DGS场景,因其渲染流程和每个高斯原始体的独特属性导致水印嵌入机制具有特殊性。为此,作者提出GSPure——首个专为3DGS水印表示设计的净化框架,其核心创新在于通过分析视图依赖的渲染贡献并利用几何精确的特征聚类,精准识别并移除含水印的高斯原始体,同时最大程度保留原始场景完整性。实验表明,GSPure在去除水印方面显著优于现有方法(最高可降低16.34dB水印PSNR),且对原始场景保真度影响极小(PSNR损失<1dB)。

链接: https://arxiv.org/abs/2511.22262
作者: Wenkai Huang,Yijia Guo,Gaolei Li,Lei Ma,Hang Zhang,Liwen Hu,Jiazheng Wang,Jianhua Li,Tiejun Huang
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Alibaba Group (阿里巴巴集团); 4. Tongji University (同济大学); 5. University of Science and Technology of China (中国科学技术大学); 6. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for 3D scenes, widely adopted due to its exceptional efficiency and high-fidelity visual quality. Given the significant value of 3DGS assets, recent works have introduced specialized watermarking schemes to ensure copyright protection and ownership verification. However, can existing 3D Gaussian watermarking approaches genuinely guarantee robust protection of the 3D assets? In this paper, for the first time, we systematically explore and validate possible vulnerabilities of 3DGS watermarking frameworks. We demonstrate that conventional watermark removal techniques designed for 2D images do not effectively generalize to the 3DGS scenario due to the specialized rendering pipeline and unique attributes of each gaussian primitives. Motivated by this insight, we propose GSPure, the first watermark purification framework specifically for 3DGS watermarking representations. By analyzing view-dependent rendering contributions and exploiting geometrically accurate feature clustering, GSPure precisely isolates and effectively removes watermark-related Gaussian primitives while preserving scene integrity. Extensive experiments demonstrate that our GSPure achieves the best watermark purification performance, reducing watermark PSNR by up to 16.34dB while minimizing degradation to original scene fidelity with less than 1dB PSNR loss. Moreover, it consistently outperforms existing methods in both effectiveness and generalization.
zh

[CV-155] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

【速读】:该论文旨在解决超声医学领域中缺乏能够贯通低层超声基础感知(如分割、定位)与高层超声综合解读(如诊断、推理)的统一基础模型的问题。其核心解决方案是提出UMind-VL,一个融合像素级结构理解与复杂临床推理能力的统一基础模型;关键创新在于引入轻量级动态卷积掩码解码器(Dynamic Convolutional Mask Decoder),该模块基于大语言模型(Large Language Model, LLM)输出生成动态核以实现掩码预测,并结合任务特定标记(task-specific tokens),在单一框架内统一完成分割、检测、几何测量及诊断推理等多任务。

链接: https://arxiv.org/abs/2511.22256
作者: Dengbo Chen,Ziwei Zhao,Kexin Zhang,Shishuang Zhao,Junjie Hou,Yaqian Wang,Nianxi Liao,Anlan Sun,Fei Gao,Jia Ding,Yuhang Liu,Dong Wang
机构: Yizhun Medical AI Team
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.
zh

[CV-156] UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries ICDM

【速读】:该论文旨在解决图像引导检索(Image-Guided Retrieval with Optional Text, IGROT)任务中在低数据监督下的语义对齐难题,该任务统一了组合图像检索(Composed Image Retrieval, CIR)与草图图像检索(Sketch-Based Image Retrieval, SBIR)两大场景。其核心挑战在于如何在仅使用少量标注样本(如5,000条)的情况下,使模型能够灵活适应带文本或不带文本的多模态查询,并准确匹配目标图像。解决方案的关键是提出UNION——一种轻量且可泛化的目标表征方法,它通过融合图像嵌入与空文本提示(null-text prompt)来增强语义一致性,无需修改预训练视觉-语言模型架构即可实现跨模态对齐,从而在多个基准测试上显著优于许多依赖大量监督信号的基线方法。

链接: https://arxiv.org/abs/2511.22253
作者: Hoang-Bao Le,Allie Tran,Binh T. Nguyen,Liting Zhou,Cathal Gurrin
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDM - MMSR Workshop 2025

点击查看摘要

Abstract:Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.
zh

[CV-157] oward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

【速读】:该论文试图解决 latent diffusion 模型中随着潜在空间维度提升而出现的重建-生成权衡问题:高容量自编码器虽能提高重建保真度,但生成质量却会下降。研究表明,这一现象源于编码器与解码器在高频成分上的行为差异——解码器高度依赖高频潜在特征以恢复细节,而编码器对高频内容表征不足,导致扩散模型训练时高频区域暴露不足且欠拟合。解决方案的关键在于提出一种即插即用的频率预热课程(FreqWarm),在扩散或流匹配训练初期增强对高频潜在信号的早期暴露,无需修改或重新训练自编码器。该方法在多个高维自编码器上均显著提升生成质量(如降低 gFID 值),并保持架构无关性,验证了显式调控频率暴露可有效将高维潜在空间转化为更易扩散的目标。

链接: https://arxiv.org/abs/2511.22249
作者: Bolin Lai,Xudong Wang,Saketh Rambhatla,James M. Rehg,Zsolt Kira,Rohit Girdhar,Ishan Misra
机构: Meta AI; Georgia Institute of Technology (佐治亚理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training – without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.
zh

[CV-158] FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text

【速读】:该论文旨在解决图像引导检索中视觉检索(无文本)与组合检索(含文本)任务难以统一的问题,现有方法在跨子任务性能平衡上存在局限,且缺乏高效可用的基准数据集。其解决方案的关键在于提出一个轻量级高质量的IGROT数据集FIGROTD,包含16,474个训练三元组和1,262个测试三元组,覆盖CIR(图像到图像检索)、SBIR(基于草图的图像检索)和CSTBIR(带文本约束的草图到图像检索)三种场景,并设计了方差引导特征掩码(Variance Guided Feature Mask, VaGFeM)以增强判别性维度,同时采用InfoNCE与三元组损失联合优化策略,从而在少量样本下实现对九个基准的竞争力表现,显著优于更强基线模型。

链接: https://arxiv.org/abs/2511.22247
作者: Hoang-Bao Le,Allie Tran,Binh T. Nguyen,Liting Zhou,Cathal Gurrin
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MMM 2026

点击查看摘要

Abstract:Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text). Despite its relevance in applications like Google Image and Bing, progress has been limited by the lack of an accessible benchmark and methods that balance performance across subtasks. Large-scale datasets such as MagicLens are comprehensive but computationally prohibitive, while existing models often favor either visual or compositional queries. We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets across CIR, SBIR, and CSTBIR. To reduce redundancy, we propose the Variance Guided Feature Mask (VaGFeM), which selectively enhances discriminative dimensions based on variance statistics. We further adopt a dual-loss design (InfoNCE + Triplet) to improve compositional reasoning. Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy, outperforming stronger baselines despite fewer triplets.
zh

[CV-159] Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models

【速读】:该论文旨在解决文本到图像扩散模型在个性化生成中的难题,即如何仅用少量参考图像将预训练模型适配到特定用户主体,同时保持文本-图像语义对齐的先验能力。其核心挑战在于:若过度关注主体保真度,模型易过拟合有限参考图像,忽略预训练分布;若强调先验保留,则无法学习新的个性化特征。解决方案的关键在于提出一种基于语义锚定(semantic anchoring)的个性化过程,通过将新概念锚定在其常见对应分布上,引导模型以稳定可控的方式适应新概念,从而在扩展预训练分布至个性化区域的同时维持其语义结构。这一策略显著提升了主体保真度与文本-图像对齐的一致性与稳定性。

链接: https://arxiv.org/abs/2511.22245
作者: Seoyun Yang,Gihoon Kim,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable progress in generating diverse and realistic images from textual descriptions. However, they still struggle with personalization, which requires adapting a pretrained model to depict user-specific subjects from only a few reference images. The key challenge lies in learning a new visual concept from a limited number of reference images while preserving the pretrained semantic prior that maintains text-image alignment. When the model focuses on subject fidelity, it tends to overfit the limited reference images and fails to leverage the pretrained distribution. Conversely, emphasizing prior preservation maintains semantic consistency but prevents the model from learning new personalized attributes. Building on these observations, we propose the personalization process through a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions. We therefore reformulate personalization as the process of learning a rare concept guided by its frequent counterpart through semantic anchoring. This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions while preserving its semantic structure. As a result, the proposed method achieves stable adaptation and consistent improvements in both subject fidelity and text-image alignment compared to baseline methods. Extensive experiments and ablation studies further demonstrate the robustness and effectiveness of the proposed anchoring strategy.
zh

[CV-160] Snap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

【速读】:该论文旨在解决文本到图像扩散模型在测试时缩放(test-time scaling)过程中因计算资源受限而导致的候选样本探索不足问题。现有方法通过在多个噪声种子中搜索以最大化图像奖励函数来提升生成质量,但由于每个候选样本需完全去噪后才能评估奖励,导致在固定预算下难以充分探索多样化的噪声种子。解决方案的关键在于提出一种噪声感知剪枝框架(TTSnap),其核心创新是训练噪声感知奖励模型(noise-aware reward models),通过自蒸馏(self-distillation)和课程学习(curriculum training)策略,使中间去噪阶段的奖励预测与最终干净图像的奖励排名保持一致,从而在无需完全去噪的情况下高效剪枝低质量候选样本,显著提升测试时缩放的效率与效果。

链接: https://arxiv.org/abs/2511.22242
作者: Qingtao Yu,Changlin Song,Minghao Sun,Zhengyang Yu,Vinay Kumar Verma,Soumya Roy,Sumit Negi,Hongdong Li,Dylan Campbell
机构: Australian National University (澳大利亚国立大学); Amazon Research (亚马逊研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: this https URL.
zh

[CV-161] Creating Blank Canvas Against AI-enabled Image Forgery AAAI2026

【速读】:该论文旨在解决由生成式 AI (Generative AI) 技术引发的图像伪造检测难题,特别是针对高真实感图像编辑带来的潜在风险。其解决方案的关键在于利用 Segment Anything Model (SAM) 的感知能力,通过引入对抗扰动使 SAM 无法“看见”图像内容,从而将图像视为一个空白画布;当图像被篡改时,这种扰动会使得篡改区域在 SAM 的感知中变得显著,进而实现对伪造区域的精准定位。为提升欺骗效果并彻底屏蔽 SAM 对图像的识别能力,作者进一步提出一种频率感知优化策略,显著增强了篡改区域的可检测性。

链接: https://arxiv.org/abs/2511.22237
作者: Qi Song,Ziyuan Luo,Renjie Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:AIGC-based image editing technology has greatly simplified the realistic-level image modification, causing serious potential risks of image forgery. This paper introduces a new approach to tampering detection using the Segment Anything Model (SAM). Instead of training SAM to identify tampered areas, we propose a novel strategy. The entire image is transformed into a blank canvas from the perspective of neural models. Any modifications to this blank canvas would be noticeable to the models. To achieve this idea, we introduce adversarial perturbations to prevent SAM from ``seeing anything’', allowing it to identify forged regions when the image is tampered with. Due to SAM’s powerful perceiving capabilities, naive adversarial attacks cannot completely tame SAM. To thoroughly deceive SAM and make it blind to the image, we introduce a frequency-aware optimization strategy, which further enhances the capability of tamper localization. Extensive experimental results demonstrate the effectiveness of our method.
zh

[CV-162] Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice

【速读】:该论文旨在解决3D显微图像分割中误差频发的问题,尤其是在使用当前前沿基础模型进行分割时,仍存在大量生物意义相关的错误,导致需要大量人工校正。为提升校正效率与准确性,作者提出VessQC这一开源工具,其核心在于通过整合不确定性图(uncertainty maps)来引导用户关注最可能包含生物学意义错误的区域,从而实现高效的人机协同校正。实验表明,基于不确定性的引导校正显著提升了错误检测召回率(从67%提升至94.0%,p=0.007),同时未明显增加总校正时间,有效弥合了不确定性估计与实际人机交互之间的关键鸿沟。

链接: https://arxiv.org/abs/2511.22236
作者: Simon Püttmann,Jonathan Jair Sànchez Contreras,Lennart Kowitz,Peter Lampen,Saumya Gupta,Davide Panzeri,Nina Hagemann,Qiaojie Xiong,Dirk M. Hermann,Cao Chen,Jianxu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D microscopy image segmentation is critical for quantitative bioimage analysis but even state-of-the-art foundation models yield error-prone results. Therefore, manual curation is still widely used for either preparing high-quality training data or fixing errors before analysis. We present VessQC, an open-source tool for uncertainty-guided curation of large 3D microscopy segmentations. By integrating uncertainty maps, VessQC directs user attention to regions most likely containing biologically meaningful errors. In a preliminary user study uncertainty-guided correction significantly improved error detection recall from 67% to 94.0% (p=0.007) without a significant increase in total curation time. VessQC thus enables efficient, human-in-the-loop refinement of volumetric segmentations and bridges a key gap in real-world applications between uncertainty estimation and practical human-computer interaction. The software is freely available at this http URL.
zh

[CV-163] IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution AAAI2026

【速读】:该论文旨在解决从低分辨率(Low-Resolution, LR)输入重建高分辨率(High-Resolution, HR)3D Gaussian Splatting(3DGS)模型的难题,其核心挑战在于LR数据缺乏细粒度纹理和几何信息。现有方法依赖预训练的2D超分辨率(2D Super-Resolution, 2DSR)模型增强纹理,但存在因跨视角不一致性及2DSR模型固有的域差异导致的3D Gaussian模糊问题。解决方案的关键在于提出IE-SRGS框架,通过联合利用外部2DSR先验与内部3DGS特征的互补优势:一方面使用2DSR和深度估计模型生成HR图像与深度图作为外部知识,另一方面采用多尺度3DGS模型生成跨视角一致且域自适应的内部表示;并通过掩码引导融合策略协同整合二者,有效指导3D Gaussian优化以实现高保真重建。

链接: https://arxiv.org/abs/2511.22233
作者: Xiang Feng,Tieshi Zhong,Shuo Chang,Weiliu Wang,Chengkai Wang,Yifei Chen,Yuhe Wang,Zhenzhong Kuang,Xuefei Yin,Yanming Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Reconstructing high-resolution (HR) 3D Gaussian Splatting (3DGS) models from low-resolution (LR) inputs remains challenging due to the lack of fine-grained textures and geometry. Existing methods typically rely on pre-trained 2D super-resolution (2DSR) models to enhance textures, but suffer from 3D Gaussian ambiguity arising from cross-view inconsistencies and domain gaps inherent in 2DSR models. We propose IE-SRGS, a novel 3DGS SR paradigm that addresses this issue by jointly leveraging the complementary strengths of external 2DSR priors and internal 3DGS features. Specifically, we use 2DSR and depth estimation models to generate HR images and depth maps as external knowledge, and employ multi-scale 3DGS models to produce cross-view consistent, domain-adaptive counterparts as internal knowledge. A mask-guided fusion strategy is introduced to integrate these two sources and synergistically exploit their complementary strengths, effectively guiding the 3D Gaussian optimization toward high-fidelity reconstruction. Extensive experiments on both synthetic and real-world benchmarks show that IE-SRGS consistently outperforms state-of-the-art methods in both quantitative accuracy and visual fidelity.
zh

[CV-164] 3D-Consistent Multi-View Editing by Diffusion Guidance

【速读】:该论文旨在解决基于扩散模型的图像编辑方法在多视角场景下产生的几何与光照不一致性问题,尤其是在3D表示(如NeRF或高斯溅射模型)中的编辑效果不佳的问题。解决方案的关键在于提出一种无需训练的扩散框架,通过引入一致性损失(consistency loss)来引导扩散采样过程,确保未编辑图像中对应点在编辑后经历相似的变换,从而实现跨视角的语义一致性和几何一致性。该方法具有灵活性,可与多种图像编辑技术结合,支持密集和稀疏的多视角编辑设置,并显著提升3D重建质量与文本提示的忠实度。

链接: https://arxiv.org/abs/2511.22228
作者: Josef Bengtson,David Nilsson,Dong In Lee,Fredrik Kahl
机构: Chalmers University of Technology (查尔姆斯理工大学); Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: this https URL
zh

[CV-165] Controllable 3D Object Generation with Single Image Prompt

【速读】:该论文旨在解决当前基于文本到图像扩散模型(text-to-image diffusion models)进行3D物体生成时,依赖文本反转(textual inversion)所带来的训练时间长且控制能力弱的问题。其核心解决方案包括:(1) 引入一个现成的图像适配器(image adapter),无需文本反转即可生成3D对象,并实现对深度、姿态和文本等条件的增强控制;(2) 提出一种基于深度条件的预热策略(depth conditioned warmup strategy),以提升生成结果的3D一致性。实验表明,所提方法在定性和定量指标上均达到与现有文本反转方法相当的性能,同时显著改善了3D一致性,用户研究进一步验证了其在输入图像匹配度和3D一致性方面的优越性。

链接: https://arxiv.org/abs/2511.22194
作者: Jaeseok Lee,Jaekoo Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:this https URL
zh

[CV-166] ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition

【速读】:该论文旨在解决面部表情识别中因依赖预训练卷积神经网络(Convolutional Neural Networks, CNNs)学习面部外观表征而忽视面部区域间关系的问题。其解决方案的关键在于提出一种外观与关系感知的并行图注意力融合网络(Appearance- and Relation-aware Parallel Graph attention fusion Network, ARPGNet),通过构建面部区域关系图并利用图注意力机制建模面部区域间的动态关联,将得到的关系表示序列与CNN提取的外观表示序列共同输入至并行图注意力融合模块,实现两者之间的相互增强与互补,从而更有效地学习时空联合表征。

链接: https://arxiv.org/abs/2511.22188
作者: Yan Li,Yong Zhao,Xiaohan Xia,Dongmei Jiang
机构: Northwestern Polytechnical University (西北工业大学); Pengcheng Laboratory (鹏城实验室); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Affective Computing. Submitted in August 2023; Accepted in October 2025

点击查看摘要

Abstract:The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.
zh

[CV-167] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving

【速读】:该论文旨在解决现有自动驾驶仿真方法在大视角变化下难以实现真实且可控的视图合成,以及难以保证几何一致性的问题。解决方案的关键在于提出了一种混合仿真框架 HybridWorldSim,其核心是将静态背景的多遍历神经重建(multi-traversal neural reconstruction)与动态代理的生成建模(generative modeling)相结合,从而在保持视觉和空间一致性的同时,生成多样且高保真的驾驶场景。

链接: https://arxiv.org/abs/2511.22187
作者: Qiang Li,Yingwenqi Jiang,Tuoxi Li,Duyu Chen,Xiang Feng,Yucheng Ao,Shangyue Liu,Xingchen Yu,Youcheng Cai,Yumeng Liu,Yuexin Ma,Xin Hu,Li Liu,Yu Zhang,Linkun Xu,Bingtao Gao,Xueyuan Wang,Shuchang Zhou,Xianming Liu,Ligang Liu
机构: XPeng Motors; ShanghaiTech University; University of Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.
zh

[CV-168] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

【速读】:该论文旨在解决从单张RGB图像中准确估计密集足部接触(foot contact)的问题,现有方法通常依赖零速度约束或仅关注关节级接触,难以捕捉足部与地面之间的详细交互。其核心挑战在于鞋款外观多样性导致模型泛化能力弱,以及地面纹理单调导致特征提取困难。解决方案的关键在于提出FEet COntact estimation (FECO)框架,通过引入鞋款风格无关的对抗训练(shoe style adversarial training)以学习对鞋款变化不敏感的接触特征,并设计基于空间上下文的地面特征提取器(ground feature extractor)来增强对地面属性的感知能力,从而实现鲁棒且精确的密集足部接触估计。

链接: https://arxiv.org/abs/2511.22184
作者: Daniel Sungho Jung,Kyoung Mu Lee
机构: IPAI; Dept. of ECE & ASRI, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.
zh

[CV-169] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction

【速读】:该论文旨在解决自动驾驶中轨迹规划问题,特别是如何有效融合视觉感知与运动预测信息以生成高质量的规划路径。传统方法依赖于地图特征,而本文提出一种基于视觉的替代方案——MTR-VP(Motion Transformer for Vision-based Planning),其关键在于使用ViT编码器从原始图像和历史运动状态中学习场景上下文嵌入(context embeddings),并利用交叉注意力机制将意图信息与这些嵌入相结合,从而实现无需显式地图输入的端到端轨迹规划。实验表明,单纯堆叠视觉与运动特征难以有效融合,但通过预测多模态未来轨迹分布而非单一轨迹,显著提升了规划性能。

链接: https://arxiv.org/abs/2511.22181
作者: Maitrayee Keskar,Mohan Trivedi,Ross Greer
机构: University of California, Merced (加州大学默塞德分校); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent’s future 5-second trajectory in bird’s-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.
zh

[CV-170] Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification

【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)早期且客观诊断困难的问题,因其临床表型和神经机制具有高度异质性。解决方案的关键在于提出一种融合Chebyshev谱图卷积与图注意力网络(Graph Attention Networks, GAT)的图卷积神经网络(Graph Convolutional Network, GCN)模型,利用多模态神经影像数据(包括静息态功能磁共振成像 rs-fMRI、结构磁共振成像 sMRI)及表型变量构建多分支特征提取架构,并通过基于站点相似性的群体图结构编码个体间关系,实现更精准的分类。该方法在ABIDE I数据集上达到74.82%测试准确率和0.82 AUC,显著优于多种主流基线模型。

链接: https://arxiv.org/abs/2511.22178
作者: Adnan Ferdous Ashrafi,Hasanul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 2 tables, Accepted and presented at Image and Vision Computing New Zealand (IVCNZ) 2025

点击查看摘要

Abstract:ASD is a complicated neurodevelopmental disorder marked by variation in symptom presentation and neurological underpinnings, making early and objective diagnosis extremely problematic. This paper presents a Graph Convolutional Network (GCN) model, incorporating Chebyshev Spectral Graph Convolution and Graph Attention Networks (GAT), to increase the classification accuracy of ASD utilizing multimodal neuroimaging and phenotypic data. Leveraging the ABIDE I dataset, which contains resting-state functional MRI (rs-fMRI), structural MRI (sMRI), and phenotypic variables from 870 patients, the model leverages a multi-branch architecture that processes each modality individually before merging them via concatenation. Graph structure is encoded using site-based similarity to generate a population graph, which helps in understanding relationship connections across individuals. Chebyshev polynomial filters provide localized spectral learning with lower computational complexity, whereas GAT layers increase node representations by attention-weighted aggregation of surrounding information. The proposed model is trained using stratified five-fold cross-validation with a total input dimension of 5,206 features per individual. Extensive trials demonstrate the enhanced model’s superiority, achieving a test accuracy of 74.82% and an AUC of 0.82 on the entire dataset, surpassing multiple state-of-the-art baselines, including conventional GCNs, autoencoder-based deep neural networks, and multimodal CNNs.
zh

[CV-171] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

【速读】:该论文旨在解决现有文本到图像生成模型在后训练阶段依赖模型权重调整(如微调或蒸馏)所带来的局限性,即难以在不改变模型结构的前提下提升生成质量与文本对齐能力。其核心解决方案是提出一种基于实例级采样时间表重调度(instance-level sampling timeline rescheduling)的框架:通过单次遍历的Dirichlet策略学习prompt和噪声条件下的个性化采样路径,而非使用固定的全局采样计划。关键创新在于引入一种基于James-Stein估计器的新型奖励基线(reward baseline),该基线在高维策略学习中能提供更准确的梯度估计,显著降低估计误差并提升性能。此方法无需修改模型权重,即可在Stable Diffusion和Flux系列模型上实现更强的文本渲染精度与组合控制能力,并使5步采样器达到与专门蒸馏模型相当的生成质量,展现出模型无关的后训练潜力。

链接: https://arxiv.org/abs/2511.22177
作者: Peiyu Yu,Suraj Kothawade,Sirui Xie,Ying Nian Wu,Hongliang Fei
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.
zh

[CV-172] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning

【速读】:该论文旨在解决当前多模态人工智能模型在“视觉 grounded 推理”能力上的局限性问题,即现有方法要么受限于端到端强化学习(Reinforcement Learning, RL)的不稳定性,要么受制于监督微调(Supervised Fine-Tuning, SFT)的僵化性,导致模型难以在复杂真实场景中实现灵活且鲁棒的视觉推理。解决方案的关键在于提出一种两阶段训练框架 GRiP(Guided Reasoning and Perception),其核心创新在于认知增强型强化学习阶段:一是引入显著性加权交并比奖励(Salience-Weighted IoU Reward),引导模型聚焦任务关键物体而非干扰项;二是设计多启发式奖励(Multi-Heuristic Reward),鼓励多样且逻辑有效的推理路径,从而提升模型的认知灵活性与视觉感知的精准性。该框架基于 Qwen2.5-VL-7B 初始化,在 TreeBench 和 V* Bench 等高难度基准上取得开源模型最优性能,验证了以认知启发信号指导模型“看什么”和“如何思考”的有效性。

链接: https://arxiv.org/abs/2511.22172
作者: Zhaoyang Wei,Wenchao Ding,Yanchao Hao,Xi Chen
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9pages

点击查看摘要

Abstract:Models capable of “thinking with images” by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model’s perceptual focus and logical pathways. GRiP’s core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.
zh

[CV-173] BrepGPT : Autoregressive B-rep Generation with Voronoi Half-Patch

【速读】:该论文旨在解决边界表示(Boundary Representation, B-rep)在生成式建模中因几何与拓扑元素高度耦合而导致的现有方法依赖多阶段级联网络、存在误差累积和计算效率低的问题。其解决方案的关键在于提出一种新颖的Voronoi Half-Patch(VHP)表示方法,将B-rep结构分解为统一的局部单元,通过将几何信息分配给最近的半边(half-edge)并采样其下一指针来实现几何属性与拓扑关系的统一编码;同时采用双路矢量量化变分自编码器(dual VQ-VAEs)将顶点拓扑与VHP映射为基于顶点的离散token序列,最终使用仅解码器的Transformer模型进行自回归预测,从而实现高效且高质量的单阶段B-rep生成。

链接: https://arxiv.org/abs/2511.22171
作者: Pu Li,Wenhao Zhang,Weize Quan,Biao Zhang,Peter Wonka,Dong-Ming Yan
机构: MAIS, Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences (中国科学院自动化研究所,中国科学院大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive framework for B-rep generation. Our key innovation lies in the Voronoi Half-Patch (VHP) representation, which decomposes B-reps into unified local units by assigning geometry to nearest half-edges and sampling their next pointers. Unlike hierarchical representations that require multiple distinct encodings for different structural levels, our VHP representation facilitates unifying geometric attributes and topological relations in a single, coherent format. We further leverage dual VQ-VAEs to encode both vertex topology and Voronoi Half-Patches into vertex-based tokens, achieving a more compact sequential encoding. A decoder-only Transformer is then trained to autoregressively predict these tokens, which are subsequently mapped to vertex-based features and decoded into complete B-rep models. Experiments demonstrate that BrepGPT achieves state-of-the-art performance in unconditional B-rep generation. The framework also exhibits versatility in various applications, including conditional generation from category labels, point clouds, text descriptions, and images, as well as B-rep autocompletion and interpolation.
zh

[CV-174] Partially Shared Concept Bottleneck Models AAAI2026

【速读】:该论文旨在解决当前概念瓶颈模型(Concept Bottleneck Models, CBMs)在自动化生成概念时面临的三大挑战:视觉定位能力差、概念冗余以及缺乏能够平衡预测准确率与概念紧凑性的原则性度量指标。其解决方案的关键在于提出了一种部分共享的概念瓶颈模型(Partially Shared CBM, PS-CBM),包含三个核心组件:(1) 融合大语言模型(Large Language Models, LLMs)语义与基于样本的视觉线索的多模态概念生成器;(2) 基于激活模式合并概念的部分共享概念策略,以实现特定性与紧凑性的平衡;(3) 一种后处理的“概念高效准确率”(Concept-Efficient Accuracy, CEA)指标,同时量化预测准确性和概念紧凑性。实验证明,PS-CBM在11个不同数据集上均显著优于现有最优CBM方法,在提升分类准确率(1.0%–7.4%)的同时,大幅改善CEA指标(2.0%–9.5%),且所需概念数量显著减少。

链接: https://arxiv.org/abs/2511.22170
作者: Delong Zhao,Qiang Huang,Di Yan,Yiqun Sun,Jun Yu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, 11 tables, Accepted to AAAI 2026

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM’s effectiveness in achieving both high accuracy and strong interpretability.
zh

[CV-175] Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

【速读】:该论文旨在解决长时程颗粒物(Particulate Matter, PM)浓度场预测在复杂地形与强大气动力学区域(如东亚)中可靠性不足的问题,特别是现有基础模型(如Aurora)因缺乏区域特异性动态信息且依赖非实时输入,难以支撑本地化预警系统的实际需求。其解决方案的关键在于两个层面:一是构建并发布高分辨率、基于真实观测的CMAQ-OBS数据集,显著降低区域误差(减少59.5%),支持48–120小时实时预报;二是提出分组相对策略优化(Group-Relative Policy Optimization, GRPO)框架,引入类别感知奖励机制和课程式 rollout 策略,以对齐预测结果与操作优先级,有效缓解标准监督微调(SFT)模型因误报成本不对称导致的过度预测问题,使假警报率下降47.3%,同时保持F1-score竞争力,从而提升长期预报系统的实用性与可靠性。

链接: https://arxiv.org/abs/2511.22169
作者: Inha Kang,Eunki Kim,Wonjeong Ryu,Jaeyo Shin,Seungjun Yu,Yoon-Hee Kang,Seongeun Jeong,Eunhye Kim,Soontae Kim,Hyunjung Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non-real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.
zh

[CV-176] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

【速读】:该论文旨在解决现有说话人脸生成方法中因依赖显式光流和局部变形而导致的复杂全局运动建模不足及身份漂移问题。其解决方案的关键在于提出IMTalker框架,通过引入跨注意力机制(cross-attention mechanism)在统一潜在空间中隐式建模运动差异与身份对齐,替代传统基于光流的局部变形;同时设计身份自适应模块(identity-adaptive module),将运动潜在变量投影至个性化空间以实现运动与身份的清晰解耦,并结合轻量级流匹配运动生成器从音频、姿态和注视线索中生成可控且生动的隐式运动向量,从而在保持高保真度的同时显著提升生成效率与稳定性。

链接: https://arxiv.org/abs/2511.22167
作者: Bo Chen,Tao Liu,Qi Chen,Xie Chen,Zilong Zheng
机构: X-LANCE Lab, Shanghai Jiao Tong University (上海交通大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.
zh

[CV-177] RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际部署中面临的关键计算成本攻击问题,此类攻击通过注入恶意纹理导致系统资源被异常占用甚至引发拒绝服务(DoS)风险,从而威胁3D重建系统的可靠性。解决方案的核心在于提出首个全面且有效的黑盒防御框架RemedyGS,其由两个关键组件构成:一是检测器,用于识别被污染纹理的输入图像;二是净化器(purifier),能够从受攻击图像中恢复出原始自然图像,并通过对抗训练强化恢复图像与原始自然图像之间的分布一致性,从而在保障安全性的同时维持重建质量,实现在白盒、黑盒及自适应攻击场景下的最优防御效果。

链接: https://arxiv.org/abs/2511.22147
作者: Yanping Li,Zhening Liu,Zijian Li,Zehong Lin,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility.
zh

[CV-178] Stacked Ensemble of Fine-Tuned CNNs for Knee Osteoarthritis Severity Grading

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)严重程度评估中依赖X射线图像且主观性强、准确率受限的问题。传统基于Kellgren-Lawrence(KL)分级系统的诊断方法需要专业医师进行判读,存在耗时、易受主观因素影响等缺陷。为此,研究提出了一种基于微调卷积神经网络(Convolutional Neural Networks, CNNs)的堆叠集成模型(stacked ensemble model),其关键在于融合多种预训练架构(包括MobileNetV2、You Only Look Once v8(YOLOv8)和DenseNet201)作为基学习器,并采用Categorical Boosting(CatBoost)作为元学习器,实现对KOA的二分类检测与多类KL分级的联合建模。该方案在多类分类任务中达到73%的平衡测试准确率,在二分类任务中达87.5%,优于现有文献中的方法。

链接: https://arxiv.org/abs/2511.22143
作者: Adarsh Gupta,Japleen Kaur,Tanvi Doshi,Teena Sharma,Nishchal K. Verma,Shantaram Vasikarla
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校); Indian Institute of Technology Kanpur (印度理工学院坎普尔分校); California State University, Northridge (加州州立大学北岭分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and Presented at IEEE UEMCON, IBM T.J. Watson Research Center, New York, USA, 2024

点击查看摘要

Abstract:Knee Osteoarthritis (KOA) is a musculoskeletal condition that can cause significant limitations and impairments in daily activities, especially among older individuals. To evaluate the severity of KOA, typically, X-ray images of the affected knee are analyzed, and a grade is assigned based on the Kellgren-Lawrence (KL) grading system, which classifies KOA severity into five levels, ranging from 0 to 4. This approach requires a high level of expertise and time and is susceptible to subjective interpretation, thereby introducing potential diagnostic inaccuracies. To address this problem a stacked ensemble model of fine-tuned Convolutional Neural Networks (CNNs) was developed for two classification tasks: a binary classifier for detecting the presence of KOA, and a multiclass classifier for precise grading across the KL spectrum. The proposed stacked ensemble model consists of a diverse set of pre-trained architectures, including MobileNetV2, You Only Look Once (YOLOv8), and DenseNet201 as base learners and Categorical Boosting (CatBoost) as the meta-learner. This proposed model had a balanced test accuracy of 73% in multiclass classification and 87.5% in binary classification, which is higher than previous works in extant literature.
zh

[CV-179] SemOD: Semantic Enabled Object Detection Network under Various Weather Conditions

【速读】:该论文旨在解决当前基于摄像头的自动驾驶感知模型在复杂多变天气条件下性能下降的问题,尤其是在恶劣天气(如雨、雾、雪等)下,模型难以泛化且缺乏对图像退化区域的有效修复能力。解决方案的关键在于引入语义信息(semantic information)以增强图像重建与目标检测的鲁棒性:通过设计一个包含预处理单元(Preprocessing Unit, PPU)和检测单元(Detection Unit, DTU)的端到端网络架构,其中PPU利用带有语义引导的U形结构对退化图像进行精细化修复,从而恢复缺失区域的合理内容并保持视觉一致性;DTU则将此语义增强后的特征用于改进的YOLO目标检测网络中,实现跨天气场景下的高精度目标识别。实验表明,该方法在多个基准数据集上相较于现有方法mAP提升1.47%至8.80%,验证了语义信息在全天气图像变换与检测中的核心作用。

链接: https://arxiv.org/abs/2511.22142
作者: Aiyinsi Zuo,Zhaoliang Zheng
机构: University of Rochester (罗切斯特大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In the field of autonomous driving, camera-based perception models are mostly trained on clear weather data. Models that focus on addressing specific weather challenges are unable to adapt to various weather changes and primarily prioritize their weather removal characteristics. Our study introduces a semantic-enabled network for object detection in diverse weather conditions. In our analysis, semantics information can enable the model to generate plausible content for missing areas, understand object boundaries, and preserve visual coherency and realism across both filled-in and existing portions of the image, which are conducive to image transformation and object recognition. Specific in implementation, our architecture consists of a Preprocessing Unit (PPU) and a Detection Unit (DTU), where the PPU utilizes a U-shaped net enriched by semantics to refine degraded images, and the DTU integrates this semantic information for object detection using a modified YOLO network. Our method pioneers the use of semantic data for all-weather transformations, resulting in an increase between 1.47% to 8.80% in mAP compared to existing methods across benchmark datasets of different weather. This highlights the potency of semantics in image enhancement and object detection, offering a comprehensive approach to improving object detection performance. Code will be available at this https URL.
zh

[CV-180] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的手语生成方法在追求语义准确性的同时忽视情感表达的问题,导致生成的手语视频缺乏自然性和表现力。解决方案的关键在于提出EASL(Emotion-Aware Sign Language)架构,其核心创新是引入情感-语义解耦模块与渐进式训练策略,实现语义特征与情感特征的分离提取,并在姿态解码阶段利用情感表征引导语义交互,从而生成带有7类情绪置信度评分的手语动作,显著提升手语表达的情感丰富性与真实性。

链接: https://arxiv.org/abs/2511.22135
作者: Yanchao Zhao,Jihao Zhu,Yu Liu,Weizhuo Chen,Yuling Yang,Kun Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.
zh

[CV-181] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

【速读】:该论文旨在解决通用视觉-语言-动作(Vision-Language-Action, VLA)模型在引入多模态数据以增强推理能力后出现的动作性能退化(action degeneration)问题,即模型在微调后虽然提升了推理能力,但执行具体操作任务的准确性显著下降。解决方案的关键在于提出DualVLA框架,其核心创新包括:一是设计双层数据剪枝方法,去除冗余的具身推理数据,避免其对动作学习产生负面影响;二是引入双教师自适应蒸馏策略,根据不同数据域分配差异化的监督信号,在保持推理能力的同时强化动作生成能力。该方法有效平衡了精确动作执行与多模态理解之间的关系,实验证明其在模拟环境和多个基准测试中均展现出优于现有方法的综合性能。

链接: https://arxiv.org/abs/2511.22134
作者: Zhen Fang,Zhuoyang Liu,Jiaming Liu,Hao Chen,Yu Zeng,Shiting Huang,Zehui Chen,Lin Chen,Shanghang Zhang,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: this https URL.
zh

[CV-182] Autonomous labeling of surgical resection margins using a foundation model

【速读】:该论文旨在解决病理学中手术切缘评估依赖物理染料(ink)导致的标准化程度低、 cautery 电灼伪影干扰组织切面识别的问题。其核心解决方案是提出一种虚拟染色网络(Virtual Inking Network, VIN),该模型基于冻结的预训练基础模型提取特征,并结合一个轻量级两层多层感知机(Multilayer Perceptron, MLP)对组织切缘区域进行像素级分类,从而实现无需物理染料即可自动定位全片数字切片中的手术切缘边界。该方法在120张HE染色切片数据集上训练并验证,盲测结果显示区域级准确率达73.3%,且错误集中于局部区域而不破坏整体切缘连续性,表明VIN能有效捕捉与电灼相关的组织形态学特征,具备集成到常规数字病理流程并用于后续切缘距离测量的潜力。

链接: https://arxiv.org/abs/2511.22131
作者: Xilin Yang,Musa Aydin,Yuhong Lu,Sahan Yoruc Selcuk,Bijie Bai,Yijie Zhang,Andrew Birkeland,Katjana Ehrlich,Julien Bec,Laura Marcu,Nir Pillar,Aydogan Ozcan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: 20 Pages, 5 Figures

点击查看摘要

Abstract:Assessing resection margins is central to pathological specimen evaluation and has profound implications for patient outcomes. Current practice employs physical inking, which is applied variably, and cautery artifacts can obscure the true margin on histological sections. We present a virtual inking network (VIN) that autonomously localizes the surgical cut surface on whole-slide images, reducing reliance on inks and standardizing margin-focused review. VIN uses a frozen foundation model as the feature extractor and a compact two-layer multilayer perceptron trained for patch-level classification of cautery-consistent features. The dataset comprised 120 hematoxylin and eosin (HE) stained slides from 12 human tonsil tissue blocks, resulting in ~2 TB of uncompressed raw image data, where a board-certified pathologist provided boundary annotations. In blind testing with 20 slides from previously unseen blocks, VIN produced coherent margin overlays that qualitatively aligned with expert annotations across serial sections. Quantitatively, region-level accuracy was ~73.3% across the test set, with errors largely confined to limited areas that did not disrupt continuity of the whole-slide margin map. These results indicate that VIN captures cautery-related histomorphology and can provide a reproducible, ink-free margin delineation suitable for integration into routine digital pathology workflows and for downstream measurement of margin distances.
zh

[CV-183] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

【速读】:该论文旨在解决视频任务微调过程中视觉-语言模型(Vision-Language Models, VLMs)因过拟合导致的泛化能力下降问题,即模型在未见类别上的性能显著退化。其解决方案的关键在于提出一种“可插拔的耦合提示学习框架”,通过引入外部监督提示来缓解微调期间语义空间的收缩现象:一方面,利用其他数据集预训练的文本提示作为硬提示 token 与软提示 token 结合,并通过可学习映射层进行耦合,形成竞争性提示机制以防止语义空间过度偏向监督类别;另一方面,设计一组无关视频集合和负向提示作为通用属性锚点,维持预训练语义空间中属性的通用相关性,从而有效保留模型的泛化能力。

链接: https://arxiv.org/abs/2511.22125
作者: Bin Wang,Ruotong Hu,Wenqian Wang,Wentong Li,Mingliang Gao,Runmin Cong,Wei Zhang
机构: Shandong University of Technology (山东理工大学); Shandong University (山东大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model’s generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
zh

[CV-184] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation NEURIPS2025

【速读】:该论文旨在解决当前单图像三维(3D)生成模型对图像线索(image cues)依赖关系不明确的问题,即现有深度生成模型究竟利用了哪些视觉线索来推断3D结构尚不清楚。其解决方案的关键在于提出Cue3D——首个模型无关的框架,通过系统性地扰动包括阴影(shading)、纹理(texture)、轮廓(silhouette)、透视(perspective)、边缘(edges)和局部连续性(local continuity)在内的七类单目线索,量化每种线索对3D输出质量的影响,从而揭示不同模型对经典视觉线索的敏感性和依赖模式,为提升3D生成模型的透明性、鲁棒性和可控性提供依据。

链接: https://arxiv.org/abs/2511.22121
作者: Xiang Li,Zirui Wang,Zixuan Huang,James M. Rehg
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Highlight; Project page: this https URL

点击查看摘要

Abstract:Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.
zh

[CV-185] GoPrune: Accelerated Structured Pruning with ell_2p-Norm Optimization

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在深度增加时存储和计算成本急剧上升的问题,从而限制了其在资源受限的边缘设备上的部署。现有结构化剪枝方法存在效率低下或仅适用于非结构化剪枝的局限性。解决方案的关键在于提出一种名为GoPrune的加速结构化剪枝方法,其核心创新是引入2,p\ell_2,p-范数用于稀疏网络学习,并将参数pp扩展至区间[0,1)[0,1),同时设计基于近端交替最小化(Proximal Alternating Minimization, PAM)的高效优化算法,使子问题具有闭式解,显著提升压缩效率。

链接: https://arxiv.org/abs/2511.22120
作者: Li Xu,Xianchao Xiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) suffer from rapidly increasing storage and computational costs as their depth grows, which severely hinders their deployment on resource-constrained edge devices. Pruning is a practical approach for network compression, among which structured pruning is the most effective for inference acceleration. Although existing work has applied the \ell_p -norm to pruning, it only considers unstructured pruning with p\in (0, 1) and has low computational efficiency. To overcome these limitations, we propose an accelerated structured pruning method called GoPrune. Our method employs the \ell_2,p -norm for sparse network learning, where the value of p is extended to [0, 1) . Moreover, we develop an efficient optimization algorithm based on the proximal alternating minimization (PAM), and the resulting subproblems enjoy closed-form solutions, thus improving compression efficiency. Experiments on the CIFAR datasets using ResNet and VGG models demonstrate the superior performance of the proposed method in network pruning. Our code is available at this https URL.
zh

[CV-186] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中高质量提示词(prompt)面临的隐私与知识产权安全风险,尤其是针对提示词窃取攻击(prompt stealing attack)的问题。现有方法多依赖白盒梯度信息、大规模标注数据或仅通过图像描述生成提示,限制了实用性与适应性。论文提出了一种黑盒提示窃取框架 PROMPTMINER,其核心创新在于将任务解耦为两个阶段:第一阶段利用强化学习优化重建图像的主要主体(primary subject),第二阶段通过模糊测试驱动搜索恢复风格修饰符(stylistic modifiers)。该设计显著提升了提示词重构的准确性与泛化能力,在多个扩散模型和数据集上均优于基线方法,且对防御扰动具有强鲁棒性。

链接: https://arxiv.org/abs/2511.22119
作者: Mingzhe Li,Renhao Zhang,Zhiyang Wen,Siqi Pan,Bruno Castro da Silva,Juan Zhai,Shiqing Ma
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: this https URL
zh

[CV-187] HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)中图像与基因表达数据之间对齐不充分的问题,特别是现有方法仅关注点级(spot-level)图像到基因的匹配,未能充分利用ST数据在基因表达层面的层次结构,且面临图像与基因模态间的信息不对称挑战——即基因表达谱包含更丰富的分子细节,但可能缺乏显著的视觉对应特征。解决方案的关键在于提出HyperST框架,通过在双曲空间(hyperbolic space)中建模数据的固有层次结构,实现多层级图像-基因表示学习:首先设计多层级表示提取器以捕获每个模态的点级和微环境级(niche-level)特征;其次引入分层双曲对齐模块,在保持空间一致性的同时,将图像嵌入与基因嵌入进行层次化整合,从而增强图像表示的分子语义信息,显著提升跨模态预测性能。

链接: https://arxiv.org/abs/2511.22107
作者: Chen Zhang,Yilu An,Ying Chen,Hao Li,Xitong Ling,Lihao Liu,Junjun He,Yuxiang Lin,Zihui Wang,Rongshan Yu
机构: Xiamen University (厦门大学); Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data’s inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.
zh

[CV-188] MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

【速读】:该论文旨在解决多模态3D理解任务中因模态间异质性和复杂性导致的融合效率与性能不足的问题。传统方法依赖单一密集融合网络,难以有效利用不同模态(如视觉、语言等)间的互补信息。其解决方案的关键在于提出MoE3D框架,将**专家混合(Mixture of Experts, MoE)**引入多模态学习体系,通过部署一组专门处理特定模态或跨模态交互的“专家”网络,实现更灵活高效的特征融合;同时设计信息聚合模块和基于Top-1门控机制的专家选择策略,提升计算效率与融合精度,并结合渐进式预训练策略利用语义和2D先验知识,从而在多个主流3D理解任务上取得显著性能提升,尤其在Multi3DRefer数据集上相比最优基线提升6.1 mIoU。

链接: https://arxiv.org/abs/2511.22103
作者: Yu Li,Yuenan Hou,Yingmei Wei,Xinge Zhu,Yuexin Ma,Wenqi Shao,Yanming Guo
机构: National University of Defense Technology (国防科技大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized “expert” networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.
zh

[CV-189] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation

【速读】:该论文旨在解决现有基于深度学习的脑年龄估计模型在捕捉神经形态变化连续性方面的不足,从而导致特征表示不充分和性能受限的问题。其解决方案的关键在于首次引入受监督对比学习(supervised contrastive learning)与Rank-N-Contrast(RNC)损失函数,以更好地建模T1加权结构磁共振成像(T1w structural MRI)中脑部结构的细微变化,并结合Grad-RAM方法实现对回归结果的可视化解释。实验表明,该方法在小样本条件下即可达到均方误差(MAE)为4.27年、决定系数(R²)为0.93的优异性能,显著优于传统深度回归模型,且在与使用更大训练数据集的最先进方法比较时表现相当或更优,同时揭示了阿尔茨海默病和帕金森病患者生物脑龄与生理年龄差异与其疾病严重程度的相关性,验证了该方法作为神经退行性疾病潜在生物标志物的价值。

链接: https://arxiv.org/abs/2511.22102
作者: Simon Joseph Clément Crête,Marta Kersten-Oertel,Yiming Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:MRI-based brain age estimation models aim to assess a subject’s biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an R^2 of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer’s Disease and Parkinson’s disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.
zh

[CV-190] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

【速读】:该论文旨在解决视频生成中跨视角(第一人称egocentric与第三人称exocentric)转换的同步性与一致性问题,这在影视制作、具身人工智能及世界模型构建中具有重要意义。解决方案的关键在于提出WorldWander框架,其核心创新包括:(i) 在上下文学习机制下实现视角对齐(In-Context Perspective Alignment),以捕捉不同视角间的语义对应关系;(ii) 引入协作式位置编码(Collaborative Position Encoding)来高效建模跨视角的时间与空间同步。此外,作者构建了EgoExo-8K数据集,包含合成与真实场景下的同步三元组数据,支撑模型训练与评估,实验表明该方法在视角一致性、角色连贯性和泛化能力上均达到新基准。

链接: https://arxiv.org/abs/2511.22098
作者: Quanjian Song,Yiren Song,Kelly Peng,Yuan Gao,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); First Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.
zh

[CV-191] DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation WACV2026

【速读】:该论文旨在解决在线手写生成(Online Handwriting Generation, OHG)中难以生成训练时未见过字符的问题,尤其针对字形结构复杂的汉字等glyph-based语言,从而提升模型在真实场景中的适用性。解决方案的关键在于提出了一种双分支自适应网络(Dual-branch Network with Adaptation, DNA),其中风格分支通过学习笔画方向、间距、布局和连贯性等书写特征来生成逼真的手写样式,内容分支则通过局部编码器提取字符结构信息、全局编码器提取纹理细节,实现对未见字符的有效泛化。

链接: https://arxiv.org/abs/2511.22064
作者: Tsai-Ling Huang,Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Hong-Han Shuai,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Online handwriting generation (OHG) enhances handwriting recognition models by synthesizing diverse, human-like samples. However, existing OHG methods struggle to generate unseen characters, particularly in glyph-based languages like Chinese, limiting their real-world applicability. In this paper, we introduce our method for OHG, where the writer’s style and the characters generated during testing are unseen during training. To tackle this challenge, we propose a Dual-branch Network with Adaptation (DNA), which comprises an adaptive style branch and an adaptive content branch. The style branch learns stroke attributes such as writing direction, spacing, placement, and flow to generate realistic handwriting. Meanwhile, the content branch is designed to generalize effectively to unseen characters by decomposing character content into structural information and texture details, extracted via local and global encoders, respectively. Extensive experiments demonstrate that our DNA model is well-suited for the unseen OHG setting, achieving state-of-the-art performance.
zh

[CV-192] OralGPT -Omni: A Versatile Dental Multimodal Large Language Model

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在牙科领域应用中面临的四大挑战:领域特定数据稀缺、牙科专家标注不足、模态特异性建模不充分以及可靠性问题。为应对这些挑战,作者提出了一种专门面向牙科的MLLM——OralGPT-Omni,并设计了关键解决方案:一是构建了TRACE-CoT数据集,这是一个基于临床场景的思维链(Chain-of-Thought, CoT)数据集,能够显式捕捉牙科医生的诊断推理过程;二是提出四阶段训练范式,结合推理监督增强模型对牙科影像的理解与分析能力。此外,研究还发布了首个统一的牙科多模态基准MMOral-Uni,包含2809个开放式问答对,覆盖五种成像模态和五类临床任务,为牙科图像分析提供了系统性评估体系。实验表明,OralGPT-Omni在多个基准上显著优于GPT-5,验证了其在智能牙科中的有效性与可靠性。

链接: https://arxiv.org/abs/2511.22055
作者: Jing Hao,Yuci Liang,Lizhuo Lin,Yuxuan Fan,Wenkai Zhou,Kaixin Guo,Zanting Ye,Yanpeng Sun,Xinyu Zhang,Yanqi Yang,Qiankun Li,Hao Tang,James Kit-Hon Tsoi,Linlin Shen,Kuo Feng Hung
机构: Faculty of Dentistry, The University of Hong Kong; College of Computer Science and Software Engineering, Shenzhen University; The Hong Kong University of Science and Technology (GZ); School of Biomedical Engineering, Southern Medical University; Singapore University of Technology and Design; University of Auckland; University of Science and Technology of China; School of Computer Science, Peking University; College of Artificial Intelligence, Shenzhen University
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 47 pages, 42 figures, 13 tables

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists’ diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists’ decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model’s capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.
zh

[CV-193] PCNet: Triple physical constraints for Low-light Image Enhancement

【速读】:该论文旨在解决低光照图像增强任务中现有基于Retinex理论的深度学习方法因忽略镜面反射(specular reflection)而限制模型泛化能力的问题。传统方法将反射物体假设为理想的朗伯表面(Lambertian),并在图像空间中构建物理约束,导致对真实场景中复杂光照与材质交互建模不足。解决方案的关键在于引入Kubelka-Munk理论,保留镜面反射系数,并将原始物理约束从图像空间重构至模型特征空间,从而建立照明、反射与探测之间的三重物理约束(Triple Physical Constraints, TPCs)理论;在此基础上设计出TPCNet网络,在不增加参数量的前提下显著提升性能指标与视觉质量,且在10个数据集上优于当前主流方法。

链接: https://arxiv.org/abs/2511.22052
作者: Jing-Yi Shi,Ming-Fei Li,Ling-An Wu
机构: Institute of Physics, China Academy of Sciences (中国科学院物理研究所); School of Physical Sciences, University of Chinese Academy of Sciences (中国科学院大学物理科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Low-light image enhancement is an essential computer vision task to improve image contrast and to decrease the effects of color bias and noise. Many existing interpretable deep-learning algorithms exploit the Retinex theory as the basis of model design. However, previous Retinex-based algorithms, that consider reflected objects as ideal Lambertian ignore specular reflection in the modeling process and construct the physical constraints in image space, limiting generalization of the model. To address this issue, we preserve the specular reflection coefficient and reformulate the original physical constraints in the imaging process based on the Kubelka-Munk theory, thereby constructing constraint relationship between illumination, reflection, and detection, the so-called triple physical constraints (TPCs)theory. Based on this theory, the physical constraints are constructed in the feature space of the model to obtain the TPC network (TPCNet). Comprehensive quantitative and qualitative benchmark and ablation experiments confirm that these constraints effectively improve the performance metrics and visual quality without introducing new parameters, and demonstrate that our TPCNet outperforms other state-of-the-art methods on 10 datasets.
zh

[CV-194] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

【速读】:该论文旨在解决真实世界图像超分辨率(Real-ISR)任务中现有方法依赖文本条件扩散模型生成先验(generative prior)所带来的问题,尤其是其与任务目标的不一致性以及由此导致的颜色失真和边缘模糊等缺陷。解决方案的关键在于提出一种图像条件化的流形正则化方法(Image-conditioned Manifold Regularization, ICM),该方法通过利用稀疏但关键的结构信息——颜色映射(colormap)与Canny边缘——构建更适配Real-ISR任务的正则化流形,从而在保持数值稳定性的同时显著提升重建图像的感知质量。

链接: https://arxiv.org/abs/2511.22048
作者: Junoh Kang,Donghun Ryu,Bohyung Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
zh

[CV-195] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

【速读】:该论文旨在解决3D场景占用预测(3D scene occupancy forecasting)中因传统方法依赖变分自编码器(VAE)进行离散占用标记生成而导致的表征能力受限问题,以及基于鸟瞰图(BEV)投影的结构化表示所引入的几何先验限制。其解决方案的关键在于提出一种端到端的轨迹条件预测架构:首先利用稀疏占用表示绕过中间BEV投影步骤及其显式几何约束,其次采用基于注意力机制的Transformer模型直接从原始图像特征中预测多帧未来占用情况,从而更有效地捕捉时空依赖关系。该设计避免了离散标记化的容量瓶颈和BEV表示的结构性局限,在nuScenes基准上实现了1–3秒未来占用预测的最先进性能。

链接: https://arxiv.org/abs/2511.22039
作者: Jiayuan Du,Yiming Zhao,Zhenglong Guo,Yong Pan,Wenbo Hou,Zhihui Hao,Kun Zhan,Qijun Chen
机构: Tongji University (同济大学); Li Auto Inc (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.
zh

[CV-196] PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中现有方法过于复杂的问题,尤其是依赖困难的对抗训练策略或复杂的架构设计(如辅助模型用于特征蒸馏和伪标签生成)。其解决方案的关键在于:在频域(frequency domain)中学习图像风格的适应性变换,通过引入一个轻量级预处理模块来减少源域与目标域之间的分布差异,且该模块仅在训练阶段使用,推理时完全移除,从而不增加任何计算开销。这一方法在域自适应目标检测(Domain-Adaptive Object Detection, DAOD)任务中表现出显著性能提升,尤其适用于源域标注易得(如正常天气或合成数据)而目标域标注困难(如恶劣天气或低光照场景)的情形。

链接: https://arxiv.org/abs/2511.22029
作者: Shuchen Du,Shuo Lei,Feiran Li,Jiacheng Li,Daisuke Iso
机构: Sony Research (索尼研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments. However, most state-of-the-art approaches are overly complex, relying on challenging adversarial training strategies, or on elaborate architectural designs with auxiliary models for feature distillation and pseudo-label generation. In this work, we present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains. The proposed approach introduces only a lightweight pre-processing module during training and entirely discards it at inference time, thus incurring no additional computational overhead. We validate our method on domain-adaptive object detection (DAOD) tasks, where ground-truth annotations are easily accessible in source domains (e.g., normal-weather or synthetic conditions) but challenging to obtain in target domains (e.g., adverse weather or low-light scenes). Extensive experiments demonstrate that our method achieves substantial performance gains on multiple benchmarks, highlighting its practicality and effectiveness.
zh

[CV-197] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

【速读】:该论文旨在解决传统语音-视觉接地(audio-visual grounding)方法中依赖文本中间表示所带来的效率低和鲁棒性差的问题。现有方法通常将语音转写为文本,再提取关键词并利用预训练的图文模型进行对象定位,但这一流程在面对语言变异性(如口音差异)时表现不稳定。论文的关键解决方案是提出一种直接从音频到视觉场景进行对齐的接地范式,无需经过文本转录步骤;为此,作者构建了一个覆盖多样物体和人类口音的新型音频接地数据集,并在该数据集上适配与基准测试了多个来自紧密关联的音频-视觉领域的模型,结果表明直接音频接地不仅可行,且在鲁棒性方面优于传统文本中介方法,尤其在处理语言多样性时更具优势。

链接: https://arxiv.org/abs/2511.22025
作者: Joel Alberto Santos,Zongwei Wu,Xavier Alameda-Pineda,Radu Timofte
机构: Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany (德国维尔茨堡大学计算机视觉实验室,CAIDAS与IFI); Inria at Univ. Grenoble Alpes, CNRS, LJK, France (法国格勒诺布尔阿尔卑斯大学Inria,法国国家科学研究中心,LJK)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.
zh

[CV-198] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在安全关键应用中因错误预测仍被赋予高置信度分数而导致可靠性不足的问题。其解决方案的关键在于提出一种无需训练、基于后处理的不确定性估计方法,通过测量类别内视觉特征的一致性,结合特征投影与多元高斯分布构建类特定的概率嵌入(probabilistic embeddings),从而有效识别错误预测。该方法对分布偏移具有鲁棒性,且在每类仅需10张训练图像的情况下即可实现优异的误差检测性能。

链接: https://arxiv.org/abs/2511.22019
作者: Zhenxiang Lin,Maryam Haghighat,Will Browne,Dimity Miller
机构: Queensland University of Technology (昆士兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at this https URL.
zh

[CV-199] MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis AAAI2026

【速读】:该论文旨在解决当前视觉语言模型在医疗诊断任务中因纯在线策略强化学习(on-policy reinforcement learning)导致的推理路径表面连贯但临床不准确的问题,即模型容易学习到看似合理实则错误的诊断逻辑。其解决方案的关键在于提出MedEyes框架,通过引入离线策略专家引导(off-policy expert guidance),将专家的视觉搜索轨迹转化为结构化的外部行为信号,从而引导模型走向符合临床规范的视觉推理路径;同时设计了双模式探索机制Gaze-guided Reasoning Navigator(GRN)与置信度采样器(Confidence Value Sampler, CVS),实现系统性异常定位与精细化区域分析的协同,并采用双流GRPO优化框架分离在线与离线学习信号,有效缓解奖励同化(reward assimilation)和熵崩溃(entropy collapse)问题,显著提升多医学视觉问答(VQA)基准上的性能(平均提升8.5%)。

链接: https://arxiv.org/abs/2511.22018
作者: Chunzheng Zhu,Yangfang Lin,Shen Chen,Yijun Wang,Jianxin Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by AAAI 2026

点击查看摘要

Abstract:Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5% across multiple medical VQA benchmarks, validating MedEyes’s potential in building interpretable medical AI systems.
zh

[CV-200] StreamFlow: Theory Algorithm and Implementation for High-Efficiency Rectified Flow Generation

【速读】:该论文旨在解决现有加速方法无法直接适用于Rectified Flow(修正流)模型的问题,因其理论、设计与传统扩散模型存在差异,导致现有加速技术难以迁移应用。解决方案的关键在于构建一个从理论、设计到推理策略的全流程加速管道,核心创新包括:采用新型速度场进行批量处理、异构时间步长批处理的向量化优化,以及针对新方法的动态TensorRT编译技术,从而实现对基于流模型的生成式AI(Generative AI)的全面加速,实验表明其在512×512图像生成任务中可达到611%的加速比,显著优于当前公开方法的18%加速效果。

链接: https://arxiv.org/abs/2511.22009
作者: Sen Fang,Hongbin Zhong,Yalin Feng,Dimitris N. Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page at this https URL

点击查看摘要

Abstract:New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.
zh

[CV-201] Can Multi-Modal LLM s Provide Live Step-by-Step Task Guidance? NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在提供实时、交互式分步指导方面的不足,尤其是在用户执行任务过程中检测错误并及时反馈的能力。当前的MLLMs多基于回合制对话机制,难以应对动态视频流中的异步响应需求,限制了其作为未来AI助教的应用潜力。解决方案的关键在于构建一个全新的基准测试集——Qualcomm Interactive Cooking,该数据集基于CaptainCook4D扩展而来,包含用户执行任务时产生的错误及其纠正过程,并对指令和反馈消息进行密集的时间戳标注,特别是精确到视觉事件发生时刻的错误警报。此外,作者提出LiveMamba,一种专为实时交互式教学设计的流式多模态大语言模型,首次实现了面向场景化指导的端到端评估体系与强基线模型。

链接: https://arxiv.org/abs/2511.21998
作者: Apratim Bhattacharyya,Bicheng Xu,Sanjay Haresh,Reza Pourreza,Litian Liu,Sunny Panchal,Pulkit Madan,Leonid Sigal,Roland Memisevic
机构: Qualcomm AI Research (高通人工智能研究); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 (Project page: this https URL )

点击查看摘要

Abstract:Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.
zh

[CV-202] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学图像分割任务中因文本提示(text-prompted)缺乏空间精度以及在域偏移(domain shift)下性能下降的问题,同时克服视觉提示(visual-prompted)方法依赖昂贵且难获取的精确边界框(bbox)标注的局限。其解决方案的关键在于提出PPBoost(Progressive Prompt-Boosting)框架,通过将弱文本信号逐步转化为强空间约束的视觉提示:首先利用视觉语言模型生成初始伪边界框,并基于不确定性感知准则筛选可靠预测;随后用这些伪标签训练一个检测器以生成高质量边界框;推理时进一步优化边界框以紧密覆盖目标结构,从而增强文本提示的空间引导能力,使现有分割模型能够在零样本(zero-shot)条件下显著提升分割性能。

链接: https://arxiv.org/abs/2511.21984
作者: Xuchen Li,Hengrui Gu,Mohan Zhang,Qin Liu,Zhen Tan,Xinyuan Zhu,Huixue Zhou,Tianlong Chen,Kaixiong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.
zh

[CV-203] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models

【速读】:该论文旨在解决指针式仪表盘(pointer meter)读数识别的精确性问题,其核心挑战包括反光、遮挡、动态视角变化以及指针与刻度线之间细小差异带来的识别困难。为应对这些问题,作者提出了一个名为RPM-10K的大规模基准数据集,包含10730张真实场景下的仪表图像,全面覆盖上述挑战。在此基础上,论文进一步设计了一种基于物理关系注入的视觉-语言模型(MRLM),其关键创新在于显式编码指针与刻度之间的几何和因果关系,将感知与物理推理相结合,从而提升模型对仪表读数的理解能力。通过交叉注意力融合和自适应专家选择机制,MRLM能够准确解析仪表配置并生成数值读数,实验验证了该框架的有效性和鲁棒性。

链接: https://arxiv.org/abs/2511.21982
作者: Futian Wang,Chaoliu Weng,Xiao Wang,Zhen Chen,Zhicheng Zhao,Jin Tang
机构: Anhui University (安徽大学); La Trobe University (拉特罗布大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on this https URL
zh

[CV-204] PAT3D: Physics-Augmented Text-to-3D Scene Generation

【速读】:该论文旨在解决现有文本到3D场景生成方法在物理合理性、几何无交集性以及场景结构语义一致性方面的不足,尤其缺乏可直接用于物理仿真或下游任务(如机器人操作)的高质量3D场景。其解决方案的关键在于提出PAT3D框架,该框架首次将视觉-语言模型与基于物理的刚体动力学模拟相结合:首先根据文本提示生成3D对象并推断其空间关系,构建层次化的场景树;随后将其转换为仿真初始条件,并通过可微分刚体模拟器确保物体在重力作用下达到静态平衡且无穿透;进一步引入“仿真闭环优化”机制,在保持物理稳定性和非交集约束的同时提升语义一致性。这一设计使得生成的3D场景不仅视觉逼真、语义准确,而且具备直接用于仿真和交互任务的物理可行性。

链接: https://arxiv.org/abs/2511.21978
作者: Guying Lin,Kemeng Huang,Michael Liu,Ruihan Gao,Hanke Chen,Lyuhao Chen,Beijia Lu,Taku Komura,Yuan Liu,Jun-Yan Zhu,Minchen Li
机构: Carnegie Mellon University (卡内基梅隆大学); The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (香港科技大学); Genesis AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.
zh

[CV-205] DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification

【速读】:该论文旨在解决胃肠道内镜图像中疾病自动分类的难题,尤其针对实际临床环境中常见的光照不均、摄像角度变化及成像伪影等挑战。其解决方案的关键在于构建了一个包含4000张内镜图像的多样化医学影像数据集,并采用先进的深度学习模型(如VGG16和MobileNetV2)实现高精度分类(测试准确率达96.5%),同时引入Grad-CAM可视化技术提升模型的可解释性,从而增强医生对AI决策的信任与临床应用价值。

链接: https://arxiv.org/abs/2511.21959
作者: Walid Houmaidi,Mohamed Hadadi,Youssef Sabiri,Yousra Chtouki
机构: Al Akhawayn University (阿尔艾哈瓦因大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 2 tables. Accepted at DASET 2026

点击查看摘要

Abstract:This paper presents a comprehensive comparative model analysis on a novel gastrointestinal medical imaging dataset, comprised of 4,000 endoscopic images spanning four critical disease classes: Diverticulosis, Neoplasm, Peritonitis, and Ureters. Leveraging state-of-the-art deep learning techniques, the study confronts common endoscopic challenges such as variable lighting, fluctuating camera angles, and frequent imaging artifacts. The best performing models, VGG16 and MobileNetV2, each achieved a test accuracy of 96.5%, while Xception reached 94.24%, establishing robust benchmarks and baselines for automated disease classification. In addition to strong classification performance, the approach includes explainable AI via Grad-CAM visualization, enabling identification of image regions most influential to model predictions and enhancing clinical interpretability. Experimental results demonstrate the potential for robust, accurate, and interpretable medical image analysis even in complex real-world conditions. This work contributes original benchmarks, comparative insights, and visual explanations, advancing the landscape of gastrointestinal computer-aided diagnosis and underscoring the importance of diverse, clinically relevant datasets and model explainability in medical AI research.
zh

[CV-206] WalkCLIP: Multimodal Learning for Urban Walkability Prediction

【速读】:该论文旨在解决传统城市步行友好性(walkability)评估方法依赖人工调查和实地审计所导致的成本高、难以规模化的问题,同时克服单一数据源(如卫星影像、街景图像或人口动态数据)仅能反映步行环境某一方面特征的局限性。其解决方案的关键在于提出WalkCLIP框架,该框架通过多模态融合策略整合三种互补视角:基于GPT-4o生成图像描述的视觉语义表示、结合邻近区域空间上下文的空间聚合模块,以及来自人口动态基础模型的行为信号特征,并在此基础上实现对步行环境的高精度预测与空间一致性优化。

链接: https://arxiv.org/abs/2511.21947
作者: Shilong Xiang,JangHyeon Lee,Min Namgung,Yao-Yi Chiang
机构: University of Minnesota Twin Cities (明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.
zh

[CV-207] APVid-360: Tracking Any Point in 360 from Narrow Field of View Video

【速读】:该论文旨在解决当前人工视觉系统在处理视频序列时缺乏全景持续理解能力的问题,尤其是在Track Any Point (TAP)任务中无法追踪视场外2D点的局限性。其核心挑战在于如何构建不依赖动态4D场景真值模型、却能实现跨帧稳定3D方向预测的表示学习机制。解决方案的关键在于提出TAPVid-360任务,要求模型在视频序列中预测查询场景点相对于观察视角的3D方向,即使该点位于窄视场之外;并通过利用360°视频作为监督信号,将全景视频重采样为窄视场视角,并借助二维跟踪管线计算地面真值方向,从而无需显式4D场景重建即可训练出具有 allocentric(非自我中心)场景表征能力的模型。

链接: https://arxiv.org/abs/2511.21946
作者: Finlay G.C. Hudson,James A.D. Gardner,William A.P. Smith
机构: University of York (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods.
zh

[CV-208] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

【速读】:该论文旨在解决从少量无标注且部分遮挡视角中重建完整三维物体的问题,此类场景在现实世界中普遍存在,传统基于多视角或图像修复的方法难以应对,常导致重建结果不完整或几何不一致。解决方案的关键在于提出AmodalGen3D,一个生成式框架,通过融合2D遮挡外推先验与多视角立体几何约束,利用视图级交叉注意力机制(View-Wise Cross Attention)实现稀疏视角特征融合,并引入立体条件交叉注意力模块(Stereo-Conditioned Cross Attention)以推理未观测区域的结构,从而联合建模可见与隐藏区域,实现符合稀疏视图约束且合理推测未见部分的高质量三维重建。

链接: https://arxiv.org/abs/2511.21945
作者: Junwei Zhou,Yu-Wing Tai
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.
zh

[CV-209] Interpretable Multimodal Cancer Prototyping with Whole Slide Images and Incompletely Paired Genomics

【速读】:该论文旨在解决多模态整合中因组织学(histology)与基因组学(genomics)异质性导致的模态内表征质量下降及模态间融合困难的问题,同时应对临床实践中基因组数据部分缺失或完全不可用的现实场景。其解决方案的关键在于提出一个灵活的多模态原型框架(flexible multimodal prototyping framework),包含四个核心组件:1)基于文本提示(text prompting)和原型加权的生物原型构建(Biological Prototyping);2)样本级与分布级对齐的多视图对齐(Multiview Alignment);3)用于捕获共享与模态特异性信息的双分图融合(Bipartite Fusion);4)面向缺失数据的语义基因组插补(Semantic Genomics Imputation)。该框架在多个下游任务中展现出优于现有最先进方法的一致性能。

链接: https://arxiv.org/abs/2511.21937
作者: Yupei Zhang,Yating Huang,Wanming Hu,Lequan Yu,Hujun Yin,Chao Li
机构: University of Cambridge (剑桥大学); The University of Manchester (曼彻斯特大学); Sun Yat-sen University Cancer Center (中山大学肿瘤中心); The University of Hong Kong (香港大学); University of Dundee (邓迪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal approaches that integrate histology and genomics hold strong potential for precision oncology. However, phenotypic and genotypic heterogeneity limits the quality of intra-modal representations and hinders effective inter-modal integration. Furthermore, most existing methods overlook real-world clinical scenarios where genomics may be partially missing or entirely unavailable. We propose a flexible multimodal prototyping framework to integrate whole slide images and incomplete genomics for precision oncology. Our approach has four key components: 1) Biological Prototyping using text prompting and prototype-wise weighting; 2) Multiview Alignment through sample- and distribution-wise alignments; 3) Bipartite Fusion to capture both shared and modality-specific information for multimodal fusion; and 4) Semantic Genomics Imputation to handle missing data. Extensive experiments demonstrate the consistent superiority of the proposed method compared to other state-of-the-art approaches on multiple downstream tasks. The code is available at this https URL.
zh

[CV-210] Adaptive Parameter Optimization for Robust Remote Photoplethysmography NEURIPS ALT

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在不同光照条件和相机设置下性能受限的问题,现有方法依赖于固定参数优化,缺乏环境适应性。解决方案的关键在于提出一种无需训练的Projection-based Robust Signal Mixing (PRISM)算法,通过在线信号质量评估自适应地联合优化光度去趋势(photometric detrending)与颜色混合(color mixing),从而在多样环境中实现鲁棒的生理信号提取,且保持实时CPU运行效率。

链接: https://arxiv.org/abs/2511.21903
作者: Cecilia G. Morales,Fanurs Chi En Teh,Kai Li,Pushpak Agrawal,Artur Dubrawski
机构: Carnegie Mellon University (卡内基梅隆大学); University of Toronto (多伦多大学); Vellore Institute of Technology (维洛尔技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in Times Series for Health NeurIPs Workshop 2025

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables contactless vital sign monitoring using standard RGB cameras. However, existing methods rely on fixed parameters optimized for particular lighting conditions and camera setups, limiting adaptability to diverse deployment environments. This paper introduces the Projection-based Robust Signal Mixing (PRISM) algorithm, a training-free method that jointly optimizes photometric detrending and color mixing through online parameter adaptation based on signal quality assessment. PRISM achieves state-of-the-art performance among unsupervised methods, with MAE of 0.77 bpm on PURE and 0.66 bpm on UBFC-rPPG, and accuracy of 97.3% and 97.5% respectively at a 5 bpm threshold. Statistical analysis confirms PRISM performs equivalently to leading supervised methods ( p 0.2 ), while maintaining real-time CPU performance without training. This validates that adaptive time series optimization significantly improves rPPG across diverse conditions.
zh

[CV-211] PathReasoning : A multimodal reasoning agent for query-based ROI navigation on whole-slide images

【速读】:该论文旨在解决从高分辨率全切片图像(Whole Slide Images, WSI)中高效、精准地识别诊断相关区域(Region of Interest, ROI)的问题,这是癌症诊断、预后评估和治疗反应预测的关键步骤。传统方法受限于WSI的超大规模(可达100亿像素以上),难以快速定位关键区域,且依赖密集的像素级标注,效率低下。解决方案的核心是提出“PathReasoning”——一种多模态推理代理,通过迭代式推理与自我反思机制,在多个轮次中逐步聚焦于与临床问题相关的区域,构建可解释的推理链(reasoning chain),从而在固定步数内高效发现高价值ROI,无需密集标注。该方法显著优于现有ROI选择策略,并在乳腺癌报告生成任务中比标准GPT-4o提升10%准确率,实现了精准导航、一致诊断、全面报告与证据溯源的统一。

链接: https://arxiv.org/abs/2511.21902
作者: Kunpeng Zhang,Hanwen Xu,Sheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed “PathReasoning”, a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.
zh

[CV-212] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation

【速读】:该论文旨在解决人工构建全关节式三维物体(articulated 3D objects)成本高、难以扩展的问题。现有方法多采用多阶段流程,难以实现端到端的高效合成。其解决方案的关键在于提出UniArt框架,该框架基于扩散模型(diffusion-based framework),通过建立统一的潜在表示(unified latent representation),联合编码几何结构、纹理、部件分割与运动学参数;同时引入可逆的关节-体素嵌入机制(reversible joint-to-voxel embedding),实现关节特征与体积几何的空间对齐,从而在结构生成的同时学习一致的运动行为;此外,将关节类型预测建模为开放集问题(open-set problem),无需固定关节语义即可泛化至新类别和未见物体类型,显著提升了模型的通用性与准确性。

链接: https://arxiv.org/abs/2511.21887
作者: Bu Jin,Weize Li,Songen Gu,Yupeng Zheng,Yuhang Zheng,Zhengyi Zhou,Yao Yao
机构: Hong Kong University of Science and Technology (香港科技大学); Tsinghua University (清华大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.
zh

[CV-213] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium

【速读】:该论文旨在解决当前自回归Transformer模型在开环(open loop)架构下存在的根本性局限:每一时刻的隐藏状态仅通过单次前向传播计算且不再修正,导致错误在序列中累积传播,从而引发长程推理、事实一致性以及多步规划等任务中的性能瓶颈。其解决方案的关键在于提出闭合环路预测原则(closed-loop prediction principle),要求模型在生成每个token前迭代地优化潜在表示,直至达到自洽平衡状态;具体实现为Equilibrium Transformers(EqT),其在标准Transformer层中引入一个等效精炼模块(Equilibrium Refinement Module),通过梯度下降最小化一个无需外部监督即可学习的能量函数,该能量函数约束双向预测一致性、情景记忆连贯性和输出置信度。理论分析表明EqT可近似执行潜变量能量模型中的最大后验估计(MAP inference),并具备线性收敛保证,尤其在传统方法表现不佳的困难样本上提升显著。

链接: https://arxiv.org/abs/2511.21882
作者: Akbar Anbar Jafari,Gholamreza Anbarjafari
机构: University of Tartu (塔尔图大学); 3S Holding OÜ (3S控股有限公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 1 figure, 1 table

点击查看摘要

Abstract:Contemporary autoregressive transformers operate in open loop: each hidden state is computed in a single forward pass and never revised, causing errors to propagate uncorrected through the sequence. We identify this open-loop bottleneck as a fundamental architectural limitation underlying well-documented failures in long-range reasoning, factual consistency, and multi-step planning. To address this limitation, we introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium before committing to each token. We instantiate this principle as Equilibrium Transformers (EqT), which augment standard transformer layers with an Equilibrium Refinement Module that minimizes a learned energy function via gradient descent in latent space. The energy function enforces bidirectional prediction consistency, episodic memory coherence, and output confidence, all computed without external supervision. Theoretically, we prove that EqT performs approximate MAP inference in a latent energy-based model, establish linear convergence guarantees, and show that refinement improves predictions precisely on hard instances where one-shot inference is suboptimal. The framework unifies deep equilibrium models, diffusion language models, and test-time training as special cases. Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance, validating that the benefit of deliberation scales with task difficulty. Just as attention mechanisms resolved the sequential bottleneck of recurrent networks, we propose that closed-loop equilibrium may resolve the commitment bottleneck of open-loop autoregression, representing a foundational step toward language models.
zh

[CV-214] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training

【速读】:该论文旨在解决当前基于得分的生成模型(score-based generative models)在缺乏标注数据或无法训练额外模型时难以有效引导生成高质量样本的问题。现有方法如分类器自由引导(Classifier-Free Guidance, CFG)依赖于标注数据,而自动引导(Auto-Guidance)则需训练一个更小的辅助模型,限制了其在无标签数据场景下的适用性。本文的关键创新在于发现鞍点区域中对数密度估计的正曲率可提供强引导信号,并据此提出无鞍点引导(Saddle-Free Guidance, SFG),该方法通过维护对数密度最大正曲率的估计来指导单个得分模型的生成过程。SFG 不需要额外训练,计算成本与 CFG 相当,且兼容现成的扩散模型和流匹配模型,在 ImageNet-512 无条件生成任务中达到最优 FID 和 FD-DINOv2 指标,同时结合 Auto-Guidance 后在 FD-DINOv2 上实现整体最优性能,显著提升图像多样性并保持良好的提示遵循性和图像保真度。

链接: https://arxiv.org/abs/2511.21863
作者: Eric Yeats,Darryl Hannan,Wilson Fearn,Timothy Doster,Henry Kvinge,Scott Mahan
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2511.21863 [cs.CV] (or arXiv:2511.21863v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.21863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-215] Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI

【速读】:该论文旨在解决医学人工智能(AI)中因多模态(Multimodal, MM)数据稀缺而导致模型鲁棒性不足的问题,特别是在皮肤病变领域,现有数据集通常仅包含图像与极少的元数据,限制了多模态融合带来的性能提升。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成临床文本描述,通过优化提示(prompt)设计和引入医疗元数据来增强文本语义质量,从而有效提升分类任务性能(尤其在域偏移场景下)并实现跨模态检索能力,而后者并非训练时显式优化的目标。

链接: https://arxiv.org/abs/2511.21827
作者: Niccolo Marini,Zhaohui Liang,Sivaramakrishnan Rajaraman,Zhiyun Xue,Sameer Antani
机构: National Library of Medicine (国家医学图书馆); National Institutes of Health (国家卫生研究院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal (MM) learning is emerging as a promising paradigm in biomedical artificial intelligence (AI) applications, integrating complementary modality, which highlight different aspects of patient health. The scarcity of large heterogeneous biomedical MM data has restrained the development of robust models for medical AI applications. In the dermatology domain, for instance, skin lesion datasets typically include only images linked to minimal metadata describing the condition, thereby limiting the benefits of MM data integration for reliable and generalizable predictions. Recent advances in Large Language Models (LLMs) enable the synthesis of textual description of image findings, potentially allowing the combination of image and text representations. However, LLMs are not specifically trained for use in the medical domain, and their naive inclusion has raised concerns about the risk of hallucinations in clinically relevant contexts. This work investigates strategies for generating synthetic textual clinical notes, in terms of prompt design and medical metadata inclusion, and evaluates their impact on MM architectures toward enhancing performance in classification and cross-modal retrieval tasks. Experiments across several heterogeneous dermatology datasets demonstrate that synthetic clinical notes not only enhance classification performance, particularly under domain shift, but also unlock cross-modal retrieval capabilities, a downstream task that is not explicitly optimized during training.
zh

[CV-216] mathcalE_0: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中泛化能力不足、动作控制粗粒度且不稳定的问题。其核心挑战在于如何在不同任务、场景和相机视角下实现更精确、鲁棒且可迁移的动作生成。解决方案的关键是提出E0,一种连续化的离散扩散框架,将动作生成建模为对量化动作标记(action tokens)的迭代去噪过程。该方法通过两个关键机制提升性能:一是利用离散动作标记与预训练VLM/VLA骨干网络的符号结构天然匹配,增强语义条件控制;二是基于真实机器人硬件约束(如编码器分辨率、控制频率和执行延迟)所固有的离散性,采用贝叶斯最优去噪器建模正确的离散动作分布,从而实现更强的泛化能力。此外,E0避免了掩码引入的分布失配问题,并结合球面视角扰动增强策略,在不增加数据的情况下显著提升对相机位姿变化的鲁棒性。实验表明,E0在14个多样化环境中均达到最先进性能,平均优于基线10.7%,并在Franka机械臂上验证了其在真实世界中的高精度、鲁棒性和迁移能力。

链接: https://arxiv.org/abs/2511.21542
作者: Zhihao Zhan,Jiaying Zhou,Likui Zhang,Qinhan Lv,Hao Liu,Jusheng Zhang,Weizheng Li,Ziliang Chen,Tianshui Chen,Keze Wang,Liang Lin,Guangrun Wang
机构: Sun Yat-sen University (中山大学); Guangdong Key Laboratory of Big Data Analysis and Processing; X-Era AI Lab; Guangdong University of Technology (广东工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.
zh

[CV-217] Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶中空间感知能力不足的问题,尤其是在长尾场景和复杂交互下的准确性与稳定性缺陷。现有VLA系统因缺乏有效的空间定位与理解机制,导致感知性能受限。解决方案的关键在于提出Percept-WAM——首个在单一VLM中隐式集成2D/3D场景理解能力的世界感知-行动模型(World-Awareness-Action Model),通过引入World-PV和World-BEV令牌统一编码空间坐标与置信度,并采用网格条件预测机制结合IoU感知评分与并行自回归解码策略,显著提升小目标、远距离及长尾场景下的感知稳定性;同时利用预训练VLM参数保留通用智能(如逻辑推理能力),并直接输出感知结果与轨迹控制指令,实现端到端的感知与决策一体化。

链接: https://arxiv.org/abs/2511.19221
作者: Jianhua Han,Meng Tian,Jiangtong Zhu,Fan He,Huixin Zhang,Sitong Guo,Dechang Zhu,Hao Tang,Pei Xu,Yuze Guo,Minzhe Niu,Haojie Zhu,Qichao Dong,Xuechao Yan,Siyuan Dong,Lu Hou,Qingqiu Huang,Xiaosong Jia,Hang Xu
机构: Yinwang Intelligent Technology Co. Ltd.(银网智能科技有限公司); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
zh

[CV-218] MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT Images

【速读】:该论文旨在解决牙科影像中实例级牙齿分割任务因标注数据稀缺而导致的深度学习模型训练困难问题,尤其是在正位咬合片(Orthopantomogram, OPG)和锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)图像中,手动进行逐个牙齿标注耗时且成本高昂。解决方案的关键在于采用半监督学习(Semi-supervised Learning, SSL)策略,通过利用大量未标注数据提升模型性能;实验表明,最优方法均基于混合式SSL框架,融合了如Segment Anything Model (SAM)等基础模型的知识,并结合多阶段、粗到细的精调流水线,在2D OPG和3D CBCT任务上分别实现了超过44个百分点和61个百分点的指标提升,显著优于仅使用标注数据训练的全监督nnU-Net基线模型。

链接: https://arxiv.org/abs/2511.22911
作者: Yaqi Wang,Zhi Li,Chengyu Wu,Jun Liu,Yifan Zhang,Jiaxue Ni,Qian Luo,Jialuo Chen,Hongyuan Zhang,Jin Liu,Can Han,Kaiwen Fu,Changkai Ji,Xinxu Cai,Jing Hao,Zhihao Zheng,Shi Xu,Junqiang Chen,Qianni Zhang,Dahong Qian,Shuai Wang,Huiyu Zhou
机构: Hangzhou Dianzi University(杭州电子科技大学); Shenzhen University(深圳大学); Queen Mary University of London(伦敦玛丽女王大学); Shandong University(山东大学); Xidian University(西安电子科技大学); Shanghai Jiao Tong University(上海交通大学); Harbin Institute of Technology(哈尔滨工业大学); The University of Hong Kong(香港大学); Chinese Academy of Sciences(中国科学院); Yunnan Provincial Stomatology Hospital(云南省口腔医院); Shanghai MediWorks Precision Instruments Co., Ltd(上海美维精密仪器有限公司); University of Leicester(莱斯特大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants’ submitted code have been made publicly available on GitHub (this https URL), ensuring transparency and reproducibility.
zh

[CV-219] Structure-Preserving Unpaired Image Translation to Photometrically Calibrate JunoCam with Hubble Data

【速读】:该论文旨在解决木星大气动力学研究中因朱诺号(Juno)探测器搭载的光学相机(JunoCam)缺乏绝对光度校准而导致的定量分析困难问题,特别是如何在不依赖配对数据的情况下,实现不同分辨率传感器图像之间的映射,以保留小尺度空间结构。解决方案的关键在于提出了一种结构保持型图像到图像翻译方法(Structure-Preserving Image-to-Image Translation, SP-I2I),其核心创新是引入显式的频域约束机制,确保高频特征得以保留,从而有效应对两传感器间分辨率差异并保障精细结构的完整性,为木星大气研究及遥感图像超分辨任务提供了新的技术路径。

链接: https://arxiv.org/abs/2511.22668
作者: Aditya Pratap Singh,Shrey Shah,Ramanakumar Sankar,Emma Dahl,Gerald Eichstädt,Georgios Georgakis,Bernadette Bucher
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Insights into Jupiter’s atmospheric dynamics are vital for understanding planetary meteorology and exoplanetary gas giant atmospheres. To study these dynamics, we require high-resolution, photometrically calibrated observations. Over the last 9 years, the Juno spacecraft’s optical camera, JunoCam, has generated a unique dataset with high spatial resolution, wide coverage during perijove passes, and a long baseline. However, JunoCam lacks absolute photometric calibration, hindering quantitative analysis of the Jovian atmosphere. Using observations from the Hubble Space Telescope (HST) as a proxy for a calibrated sensor, we present a novel method for performing unpaired image-to-image translation (I2I) between JunoCam and HST, focusing on addressing the resolution discrepancy between the two sensors. Our structure-preserving I2I method, SP-I2I, incorporates explicit frequency-space constraints designed to preserve high-frequency features ensuring the retention of fine, small-scale spatial structures - essential for studying Jupiter’s atmosphere. We demonstrate that state-of-the-art unpaired image-to-image translation methods are inadequate to address this problem, and, importantly, we show the broader impact of our proposed solution on relevant remote sensing data for the pansharpening task.
zh

[CV-220] Hard Spatial Gating for Precision-Driven Brain Metastasis Segmentation: Addressing the Over-Segmentation Paradox in Deep Attention Networks

【速读】:该论文旨在解决脑转移瘤(Brain Metastasis)在MRI图像中分割的难题,其核心挑战在于病灶尺寸小(5–15 mm)以及极端类别不平衡(肿瘤体积占比小于2%),传统软注意力卷积神经网络(Soft-Attention CNNs)常陷入“过度分割悖论”——即高敏感性(召回率 > 0.88)但精度严重下降(精度 < 0.23),边界误差超过150 mm,严重影响立体定向放射外科(Stereotactic Radiosurgery)的规划精度。解决方案的关键在于提出一种以精度优先的新型架构Spatial Gating Network(SG-Net),其采用硬空间门控机制(Hard Spatial Gating Mechanism),通过严格特征选择主动抑制背景伪影,同时保留肿瘤特征,从而显著提升边界定位精度(95% Hausdorff Distance从157.52 mm降至56.13 mm),并在保持良好召回率(0.79)的同时实现精度大幅提升(0.52 vs. 0.20),且模型参数量仅为Attention U-Net的1/8.8,适用于资源受限环境。

链接: https://arxiv.org/abs/2511.22606
作者: Rowzatul Zannath Prerona
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain metastasis segmentation in MRI remains a formidable challenge due to diminutive lesion sizes (5-15 mm) and extreme class imbalance (less than 2% tumor volume). While soft-attention CNNs are widely used, we identify a critical failure mode termed the “over-segmentation paradox,” where models achieve high sensitivity (recall 0.88) but suffer from catastrophic precision collapse (precision 0.23) and boundary errors exceeding 150 mm. This imprecision poses significant risks for stereotactic radiosurgery planning. To address this, we introduce the Spatial Gating Network (SG-Net), a precision-first architecture employing hard spatial gating mechanisms. Unlike traditional soft attention, SG-Net enforces strict feature selection to aggressively suppress background artifacts while preserving tumor features. Validated on the Brain-Mets-Lung-MRI dataset (n=92), SG-Net achieves a Dice Similarity Coefficient of 0.5578 +/- 0.0243 (95% CI: 0.45-0.67), statistically outperforming Attention U-Net (p 0.001) and ResU-Net (p 0.001). Most critically, SG-Net demonstrates a threefold improvement in boundary precision, achieving a 95% Hausdorff Distance of 56.13 mm compared to 157.52 mm for Attention U-Net, while maintaining robust recall (0.79) and superior precision (0.52 vs. 0.20). Furthermore, SG-Net requires only 0.67M parameters (8.8x fewer than Attention U-Net), facilitating deployment in resource-constrained environments. These findings establish hard spatial gating as a robust solution for precision-driven lesion detection, directly enhancing radiosurgery accuracy.
zh

[CV-221] Content Adaptive Encoding For Interactive Game Streaming

【速读】:该论文旨在解决交互式游戏流媒体(Interactive Game Streaming, IGS)中内容自适应编码(Content-Adaptive Encoding, CAE)的难题,尤其在超低延迟和严格计算资源限制下实现分辨率自适应。传统CAE方法依赖于前瞻或缓冲机制,不适用于IGS场景;而本文提出了一种基于历史帧编码元数据的轻量级解决方案:通过训练卷积神经网络(Convolutional Neural Network, CNN),利用当前场景的聚合编码块统计信息窗口,预测下一场景的最佳分辨率选项。其关键创新在于仅需单个CPU核心1毫秒的推理时间,即可在HEVC编码框架内实现无延迟开销的分辨率自适应,相较默认固定分辨率码率阶梯提升2.3 Bjøntegaard Delta-VMAF点。

链接: https://arxiv.org/abs/2511.22327
作者: Shakarim Soltanayev,Odysseas Zisimopoulos,Mohammad Ashraful Anam,Man Cheung Kung,Angeliki Katsenou,Yiannis Andreopoulos
机构: Sony Interactive Entertainment (索尼互动娱乐)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Video-on-demand streaming has benefitted from \textitcontent-adaptive encoding (CAE), i.e., adaptation of resolution and/or quantization parameters for each scene based on convex hull optimization. However, CAE is very challenging to develop and deploy for interactive game streaming (IGS). Commercial IGS services impose ultra-low latency encoding with no lookahead or buffering, and have extremely tight compute constraints for any CAE algorithm execution. We propose the first CAE approach for resolution adaptation in IGS based on compact encoding metadata from past frames. Specifically, we train a convolutional neural network (CNN) to infer the best resolution from the options available for the upcoming scene based on a running window of aggregated coding block statistics from the current scene. By deploying the trained CNN within a practical IGS setup based on HEVC encoding, our proposal: (i) improves over the default fixed-resolution ladder of HEVC by 2.3 Bjøntegaard Delta-VMAF points; (ii) infers using 1ms of a single CPU core per scene, thereby having no latency overhead.
zh

[CV-222] ColonAdapter: Geometry Estimation Through Foundation Model Adaptation for Colonoscopy

【速读】:该论文旨在解决从单目结肠镜图像中准确估计三维几何结构的问题,其核心挑战在于结肠镜场景中存在的非朗伯表面(non-Lambertian surfaces)、移动光源以及大范围纹理缺失区域,导致现有基于自然场景预训练的几何基础模型(geometric foundation models)性能显著下降。解决方案的关键在于提出一种自监督微调框架 ColonAdapter,通过引入细节恢复模块(Detail Restoration Module, DRM)提升低纹理区域的重建精度,并结合几何一致性损失(geometry consistency loss)确保尺度一致性;同时采用置信度加权光度损失(confidence-weighted photometric loss)增强训练稳定性,从而在无需真实相机内参的情况下实现相机位姿估计、单目深度预测和稠密三维点云重建的最优性能。

链接: https://arxiv.org/abs/2511.22250
作者: Zhiyi Jiang,Yifu Wang,Xuelian Cheng,Zongyuan Ge
机构: Southeast University (东南大学); Monash University (莫纳什大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.
zh

[CV-223] GACELLE: GPU-accelerated tools for model parameter estimation and image reconstruction

【速读】:该论文旨在解决定量磁共振成像(quantitative MRI, qMRI)在临床研究中因参数估计计算需求高而难以推广应用的问题,尤其是在高空间分辨率图像或多参数拟合场景下,传统方法处理时间长、限制了其在常规分析流程中的应用及方法创新与临床转化。解决方案的关键在于提出GACELLE——一个开源的GPU加速框架,其核心创新包括:基于随机梯度下降(stochastic gradient descent)和随机采样(stochastic sampler)的优化器,支持全向量化马尔可夫链蒙特卡洛(Markov chain Monte Carlo, MCMC)计算,并通过空间正则化提升估计鲁棒性与不确定性量化能力;同时,GACELLE后端自动管理并行计算、内存分批和参数更新,仅需用户定义前向信号模型即可实现高效、可复现的qMRI参数映射,实测加速比高达451倍(梯度下降)和14,380倍(采样),且不牺牲精度,显著降低qMRI的计算门槛,推动生物标志物开发与大规模影像研究的可重复性与临床落地。

链接: https://arxiv.org/abs/2511.22094
作者: Kwok-Shing Chan(1 and 2),Hansol Lee(1 and 2),Yixin Ma(1 and 2),Berkin Bilgic(1 and 2),Susie Y. Huang(1 and 2),Hong-Hsi Lee(1 and 2),José P. Marques(3) ((1) Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA, United States, (2) Harvard Medical School, Boston, MA, United States, (3) Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Quantitative MRI (qMRI) offers tissue-specific biomarkers that can be tracked over time or compared across populations; however, its adoption in clinical research is hindered by significant computational demands of parameter estimation. Images acquired at high spatial resolution or requiring fitting multiple parameters often require lengthy processing time, constraining their use in routine pipelines and slowing methodological innovation and clinical translation. We present GACELLE, an open source, GPU-accelerated framework for high-throughput qMRI analysis. GACELLE provides a stochastic gradient descent optimiser and a stochastic sampler in MATLAB, enabling fast parameter mapping, improved estimation robustness via spatial regularisation, and uncertainty quantification. GACELLE prioritises accessibility: users only need to provide a forward signal model, while GACELLE’s backend manages computational parallelisation, automatic parameter updates, and memory-batching. The stochastic solver performs fully vectorised Markov chain Monte Carlo with identical likelihood on CPU and GPU, ensuring reproducibility across hardware. Benchmarking demonstrates up to 451-fold acceleration for the stochastic gradient descent solver and 14,380-fold acceleration for stochastic sampling compared to CPU-based estimation, without compromising accuracy. We demonstrated GACELLE’s versatility on three representative qMRI models and on an image reconstruction task. Across these applications, GACELLE improves parameter precision, enhances test-retest reproducibility, and reduces noise in quantitative maps. By combining speed, usability and flexibility, GACELLE provides a generalisable optimisation framework for medical image analysis. It lowers the computational barrier for qMRI, paving the way for reproducible biomarker development, large-scale imaging studies, and clinical translation. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2511.22094 [eess.IV] (or arXiv:2511.22094v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2511.22094 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kwok-Shing Chan [view email] [v1] Thu, 27 Nov 2025 04:30:29 UTC (8,444 KB)
zh

[CV-224] When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks

【速读】:该论文旨在解决两个关键问题:其一,是否需要使用参数量庞大的领域特定基础模型进行视网膜疾病分类,还是通用紧凑架构即可满足需求;其二,针对视网膜影像的专用预训练是否值得承担其高昂的计算成本。解决方案的关键在于系统性地比较不同初始化策略在四种视网膜成像分类任务(包括光学相干断层扫描(OCT)和彩色眼底照相(CFP))上的表现,涵盖从27.6M到303M参数的多种模型架构(如Swin Transformer、ConvNeXt、Vision Transformer及RETFound)。结果表明,预训练普遍带来显著性能提升(5.18%–18.41%),且与任务难度正相关;紧凑型通用模型(27–29M参数)在多数任务中占据帕累托前沿最优位置,仅在极端类别不平衡下的细粒度DR分级任务中,RETFound(303M)才体现出必要性,验证了专用预训练的合理性边界。

链接: https://arxiv.org/abs/2511.22001
作者: David Isztl,Tahm Spitznagel,Gabor Mark Somfai,Rui Santos
机构: Stadtspital Zürich (苏黎世市立医院); Spross Research Institute (斯普罗斯研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.
zh

[CV-225] Digital Elevation Model Estimation from RGB Satellite Imagery using Generative Deep Learning

【速读】:该论文旨在解决在资源受限环境中难以获取高精度数字高程模型(Digital Elevation Model, DEM)的问题,传统方法如激光雷达(LiDAR)和摄影测量依赖特定数据源,在这些场景下往往不可行。其解决方案的关键在于利用生成式深度学习技术,特别是基于条件生成对抗网络(conditional Generative Adversarial Network, GAN)从免费的RGB卫星影像中重建DEM。研究构建了一个包含12,000对RGB-DEM样本的全球数据集,并设计了独特的预处理流程以筛选高质量、无云区域并聚合归一化RGB合成图像;同时采用两阶段训练策略——先在全数据集上训练,再基于结构相似性指数(SSIM)过滤出高质样本进行微调,从而提升复杂地形下的建模性能。该方法为低成本、可扩展的DEM生成提供了新路径,但也揭示了跨不同地形泛化能力仍面临挑战。

链接: https://arxiv.org/abs/2511.21985
作者: Alif Ilham Madani,Riska A. Kuswati,Alex M. Lechner,Muhamad Risqi U. Saputra
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 5 pages, 4 figures, accepted at IGARSS 2025 conference

点击查看摘要

Abstract:Digital Elevation Models (DEMs) are vital datasets for geospatial applications such as hydrological modeling and environmental monitoring. However, conventional methods to generate DEM, such as using LiDAR and photogrammetry, require specific types of data that are often inaccessible in resource-constrained settings. To alleviate this problem, this study proposes an approach to generate DEM from freely available RGB satellite imagery using generative deep learning, particularly based on a conditional Generative Adversarial Network (GAN). We first developed a global dataset consisting of 12K RGB-DEM pairs using Landsat satellite imagery and NASA’s SRTM digital elevation data, both from the year 2000. A unique preprocessing pipeline was implemented to select high-quality, cloud-free regions and aggregate normalized RGB composites from Landsat imagery. Additionally, the model was trained in a two-stage process, where it was first trained on the complete dataset and then fine-tuned on high-quality samples filtered by Structural Similarity Index Measure (SSIM) values to improve performance on challenging terrains. The results demonstrate promising performance in mountainous regions, achieving an overall mean root-mean-square error (RMSE) of 0.4671 and a mean SSIM score of 0.2065 (scale -1 to 1), while highlighting limitations in lowland and residential areas. This study underscores the importance of meticulous preprocessing and iterative refinement in generative modeling for DEM generation, offering a cost-effective and adaptive alternative to conventional methods while emphasizing the challenge of generalization across diverse terrains worldwide.
zh

[CV-226] Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型在医学影像零样本分割任务中的性能差异与适用性问题,尤其是对比 SAM 2 与 SAM 3 在 3D 医学体积和视频数据上的表现,评估后者是否可作为无需定制的即插即用替代方案。解决方案的关键在于设计了一项受控比较实验:在概念提示机制禁用的前提下,仅使用纯视觉提示(单点击、多点击、边界框和密集掩码),并在 16 个公开医学影像数据集(涵盖 CT、MRI、3D 和动态超声、内窥镜)上统一标准化预处理、提示位置、传播规则与指标计算流程,从而剥离提示解释与传播机制的影响,精准评估模型本身对空间提示的理解能力。结果表明,SAM 3 在复杂血管和软组织结构中提供更强的初始分割效果,并展现出更广泛的通用性,成为多数医学分割任务的首选模型。

链接: https://arxiv.org/abs/2511.21926
作者: Satrajit Chakrabarty,Ravi Soni
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.
zh

[CV-227] LAYER: A Quantitative Explainable AI Framework for Decoding Tissue-Layer Drivers of Myofascial Low Back Pain

【速读】:该论文旨在解决肌筋膜疼痛(Myofascial Pain, MP)的组织层面驱动因素不明确且缺乏可靠影像生物标志物的问题。传统研究主要聚焦于肌肉,忽视了筋膜、脂肪等其他软组织在生物力学中的关键作用。其解决方案的关键在于提出了一种解剖学基础的可解释人工智能框架——LAYER(Layer-wise Analysis for Yielding Explainable Relevance Tissue),该框架基于超过4000例多模态三维超声数据,对六个软组织层进行逐层分析,量化各层在MP预测中的贡献。结果显示,非肌肉组织(如深筋膜膜)在B模式成像中具有最高显著性(0.420),而在B模式与剪切波成像结合时,非肌肉层整体显著性(0.316)几乎与肌肉相当(0.317),从而挑战了以肌肉为中心的传统研究范式,并为精准靶向治疗提供了新方向。

链接: https://arxiv.org/abs/2511.21767
作者: Zixue Zeng,Anthony M. Perti,Tong Yu,Grant Kokenberger,Hao-En Lu,Jing Wang,Xin Meng,Zhiyu Sheng,Maryam Satarpour,John M. Cormack,Allison C. Bean,Ryan P. Nussbaum,Emily Landis-Walkenhorst,Kang Kim,Ajay D. Wasan,Jiantao Pu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Myofascial pain (MP) is a leading cause of chronic low back pain, yet its tissue-level drivers remain poorly defined and lack reliable image biomarkers. Existing studies focus predominantly on muscle while neglecting fascia, fat, and other soft tissues that play integral biomechanical roles. We developed an anatomically grounded explainable artificial intelligence (AI) framework, LAYER (Layer-wise Analysis for Yielding Explainable Relevance Tissue), that analyses six tissue layers in three-dimensional (3D) ultrasound and quantifies their contribution to MP prediction. By utilizing the largest multi-model 3D ultrasound cohort consisting of over 4,000 scans, LAYER reveals that non-muscle tissues contribute substantially to pain prediction. In B-mode imaging, the deep fascial membrane (DFM) showed the highest saliency (0.420), while in combined B-mode and shear-wave images, the collective saliency of non-muscle layers (0.316) nearly matches that of muscle (0.317), challenging the conventional muscle-centric paradigm in MP research and potentially affecting the therapy methods. LAYER establishes a quantitative, interpretable framework for linking layer-specific anatomy to pain physiology, uncovering new tissue targets and providing a generalizable approach for explainable analysis of soft-tissue imaging.
zh

人工智能

[AI-0] hinking by Doing: Building Efficient World Model Reasoning in LLM s via Multi-turn Interaction

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂环境中进行世界模型(World Model)推理时存在的效率与灵活性不足问题。现有方法通常依赖结构化的推理流程,限制了模型的主动学习能力,导致对环境反馈的过度依赖和冗余交互。解决方案的核心在于提出一种名为WMAct(World-Model Internalization through Efficient Interaction and Active Reasoning)的新机制,其关键包括:(1) 奖励重缩放机制(reward rescaling mechanism),根据动作有效性动态调整结果奖励,激励模型减少冗余并聚焦于目的性交互;(2) 交互频率退火策略(interaction frequency annealing strategy),逐步降低允许的最大交互轮次,促使模型压缩学习过程、内化环境动态而非持续依赖外部反馈,从而实现高效且可迁移的世界模型推理。

链接: https://arxiv.org/abs/2511.23476
作者: Bao Shu,Yan Cai,Jianjian Sun,Chunrui Han,En Yu,Liang Zhao,Jingcheng Hu,Yinmin Zhang,Haoran Lv,Yuang Peng,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Xiangyu Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model’s active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.
zh

[AI-1] he Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

【速读】:该论文试图解决的问题是:当前语言模型在先进基准测试中的性能提升可能被高估,因为这些进步主要依赖于更昂贵的模型,而未充分考虑单位成本下的实际能力增长。这导致对AI技术真实进展的评估存在偏差。解决方案的关键在于构建迄今为止最大规模的历史与实时价格数据集(来自Artificial Analysis和Epoch AI),量化不同前沿模型在知识、推理、数学和软件工程等任务上的性能提升与成本下降之间的关系。研究发现,每一年达到特定基准性能的成本下降约5–10倍,其中算法效率改进贡献了约3倍/年的提升,远超硬件价格下降的影响,从而揭示出算法优化是推动AI经济效率提升的核心驱动力。论文建议评估者应公开并纳入基准测试的成本信息,以更准确地衡量AI的实际落地影响。

链接: https://arxiv.org/abs/2511.23455
作者: Hans Gundlach,Jayson Lynch,Matthias Mertens,Neil Thompson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around 5\times to 10\times per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around 3\times per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
zh

[AI-2] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因数据集包含次优和碎片化轨迹而导致的奖励传播不准确、价值估计偏差及策略性能下降的问题。现有轨迹拼接(trajectory stitching)方法常受限于行为策略支持域或违反环境动力学,难以有效提升策略。其解决方案的关键在于提出ASTRO框架,通过学习时间距离表示识别可到达的拼接目标,并设计基于动力学引导的拼接规划器,利用“滚动偏差反馈”(Rollout Deviation Feedback,即目标状态序列与执行预测动作后实际到达状态序列之间的差距)自适应生成连接动作序列,从而确保拼接轨迹在分布上新颖且符合动力学约束,最终实现高质量的数据增强与策略优化。

链接: https://arxiv.org/abs/2511.23442
作者: Hang Yu,Di Zhang,Qiwei Du,Yanping Zhao,Hai Zhang,Guang Chen,Eduardo E. Veas,Junqiao Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching’s feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.
zh

[AI-3] owards Continuous Intelligence Growth: Self-Training Continual Learning and Dual-Scale Memory in SuperIntelliAgent

【速读】:该论文旨在解决当前大模型在持续智能增长(continual intelligence growth)过程中缺乏自主学习能力的问题,即如何在无需人工标注的情况下实现模型的自我优化与知识积累。解决方案的关键在于提出一种名为SuperIntelliAgent的代理式学习框架,其核心机制是将一个可训练的小型扩散模型(learner)与一个冻结的大型语言模型(verifier)相结合,通过自监督交互实现持续优化:learner生成候选输出,verifier基于逐步推理对其进行评估,从而形成用于直接偏好优化(DPO)的选择/拒绝配对数据;同时引入双尺度记忆结构(短时上下文记忆和长时知识固化机制)以及回放缓冲区进行辅助监督,使模型能够在不依赖外部标注的前提下,从自身推理过程中提取伪训练信号并形成自适应课程学习路径,最终实现高效、可持续的智能演进。

链接: https://arxiv.org/abs/2511.23436
作者: Jianzhe Lin,Zeyu Pan,Yun Zhu,Ruiqi Song,Jining Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:We introduce SuperIntelliAgent, an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier) to enable continual intelligence growth through self-supervised interaction. Unlike conventional supervised fine-tuning, SuperIntelliAgent learns autonomously without annotation: the learner generates candidate outputs, the verifier evaluates them through step-by-step reasoning, and their interaction produces chosen/rejected pairs for Direct Preference Optimization (DPO). This converts each input into a pseudo-training signal for continual improvement. The framework integrates dual-scale memory: short-term in-context memory that preserves reasoning traces across refinement cycles, and long-term memory that consolidates acquired knowledge through lightweight on-the-fly fine-tuning. A replay buffer retains samples that show verifiable progress and replays them as auxiliary supervision, reinforcing recent learning while forming adaptive curricula. SuperIntelliAgent is infrastructure-agnostic and can be plugged into existing agentic frameworks while turning ordinary inference loops into a lifelong optimization process. We posit that pairing a trainable learner with a reasoning-capable verifier forms a minimal reliable unit of growing intelligence, as paired feedback and partial-history replay yield richer learning curricula and stronger preference alignment. With a small number of automatically generated DPO pairs, the learner improves across all benchmarks, indicating that this mechanism provides a promising direction for continual intelligence accumulation and real-world deployment.
zh

[AI-4] Evaluating LLM s for One-Shot Patching of Real and Artificial Vulnerabilities

【速读】:该论文旨在解决当前自动化漏洞修补研究中存在的重要空白问题,即现有方法主要基于公开披露的真实漏洞评估大型语言模型(Large Language Models, LLMs)的修补能力,而对人工构造的漏洞场景下LLMs的有效性缺乏系统性评估。为解决此问题,作者提出了一种基于漏洞验证证明(Proof-of-Vulnerability, PoV)测试执行的实证评估框架,通过实际运行生成的补丁代码来判断其是否真正修复了漏洞。该方案的关键在于引入PoV测试机制以量化评估LLMs在真实与人工漏洞上的修补效果,并揭示不同LLMs在补丁覆盖范围上的重叠性与互补性,从而为选择最优LLM进行漏洞修补提供依据。

链接: https://arxiv.org/abs/2511.23408
作者: Aayush Garg,Zanis Ali Khan,Renzo Degiovanni,Qiang Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Pre-print - Extended version of the poster paper accepted at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC) Smarter Engineering-Building AI and Building with AI (SEAI) 2026

点击查看摘要

Abstract:Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vulnerabilities, leaving their effectiveness on related artificial vulnerabilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several prominent LLMs, such as OpenAI’s GPT variants, LLaMA, DeepSeek, and Mistral models, using both real and artificial vulnerabilities. Our evaluation employs Proof-of-Vulnerability (PoV) test execution to concretely assess whether LLM-generated source code successfully patches vulnerabilities. Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones. Additionally, our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabilities) and complementarity (vulnerabilities patched exclusively by a single LLM), emphasizing the importance of selecting appropriate LLMs for effective vulnerability patching.
zh

[AI-5] LFM2 Technical Report

【速读】:该论文旨在解决大模型在边缘设备上部署时面临的效率与性能矛盾问题,即如何在有限的计算资源(如CPU、内存)下实现快速推理和强任务能力。其关键解决方案是提出了一种基于硬件感知架构搜索的液态基础模型(Liquid Foundation Models, LFM2),通过融合门控短卷积与少量分组查询注意力模块构建紧凑混合骨干网络,在满足边缘延迟和内存约束的同时显著提升前向填充(prefill)和解码速度(相比同规模模型快达2倍)。此外,LFM2采用分阶段训练策略(包括温度调节的Top-K知识蒸馏、难度排序的课程学习及三阶段后训练),并支持多模态扩展(如视觉-语言、语音、检索),最终在多种基准测试中展现出优异性能,且所有模型均开源权重与轻量级部署包,为边缘智能应用提供了高效可靠的基座。

链接: https://arxiv.org/abs/2511.23404
作者: Alexander Amini,Anna Banaszak,Harold Benoit,Arthur Böök,Tarek Dakhran,Song Duong,Alfred Eng,Fernando Fernandes,Marc Härkönen,Anne Harrington,Ramin Hasani,Saniya Karwa,Yuri Khrustalev,Maxime Labonne,Mathias Lechner,Valentine Lechner,Simon Lee,Zetian Li,Noel Loo,Jacob Marks,Edoardo Mosca,Samuel J. Paech,Paul Pak,Rom N. Parnichkun,Alex Quach,Ryan Rogers,Daniela Rus,Nayan Saxena,Bettina Schlager,Tim Seyde,Jimmy T.H. Smith,Aditya Tadimeti,Neehal Tumma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2’s training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, this http URL, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.
zh

[AI-6] Hierarchical AI-Meteorologist: LLM -Agent System for Multi-Scale and Explainable Weather Forecast Reporting

【速读】:该论文旨在解决生成式 AI (Generative AI) 在气象报告生成中缺乏可解释性与一致性的问题,传统方法将天气预报视为扁平的时间序列,难以捕捉多尺度的气象变化。其解决方案的关键在于提出一种分层式人工智能气象学家(Hierarchical AI-Meteorologist)系统,通过在小时级、6小时级和日级等多个时间尺度上进行分层推理,实现对短期动态与长期趋势的协同建模;同时,系统的核心推理代理能够从结构化气象输入中生成连贯叙述,并提取关键气象事件关键词,作为语义锚点用于验证报告的一致性、时序连贯性和事实准确性,从而显著提升生成报告的可解释性与鲁棒性。

链接: https://arxiv.org/abs/2511.23387
作者: Daniil Sukhorukov,Andrei Zakharov,Nikita Glazkov,Katsiaryna Yanchanka,Vladimir Kirilin,Maxim Dubovitsky,Roman Sultimov,Yuri Maksimov,Ilya Makarov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:We present the Hierarchical AI-Meteorologist, an LLM-agent system that generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation. Unlike standard approaches that treat forecasts as flat time series, our framework performs multi-scale reasoning across hourly, 6-hour, and daily aggregations to capture both short-term dynamics and long-term trends. Its core reasoning agent converts structured meteorological inputs into coherent narratives while simultaneously extracting a few keywords effectively summarizing the dominant meteorological events. These keywords serve as semantic anchors for validating consistency, temporal coherence and factual alignment of the generated reports. Using OpenWeather and Meteostat data, we demonstrate that hierarchical context and keyword-based validation substantially improve interpretability and robustness of LLM-generated weather narratives, offering a reproducible framework for semantic evaluation of automated meteorological reporting and advancing agent-based scientific reasoning.
zh

[AI-7] Agent ic AI Framework for Smart Inventory Replenishment

【速读】:该论文旨在解决零售场景中因产品品类繁多(如服装、食品、化妆品等)导致的需求预测困难、库存短缺以及高潜力产品识别效率低下的问题。其解决方案的关键在于提出一种代理式人工智能(Agentic AI)模型,该模型集成需求预测、供应商选择优化、多智能体协商机制与持续学习能力,实现对库存的动态监控、自动采购请求生成及趋势/高利润产品的持续挖掘,从而显著降低缺货率、减少库存持有成本并提升商品周转效率。

链接: https://arxiv.org/abs/2511.23366
作者: Toqeer Ali Syed,Salman Jan,Gohar Ali,Ali Akarma,Ahmad Ali,Qurat-ul-Ain Mastoi
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Presented at International Conference on Business and Digital Technology, Bahrain, Springer Nature, 27 November 2025

点击查看摘要

Abstract:In contemporary retail, the variety of products available (e.g. clothing, groceries, cosmetics, frozen goods) make it difficult to predict the demand, prevent stockouts, and find high-potential products. We suggest an agentic AI model that will be used to monitor the inventory, initiate purchase attempts to the appropriate suppliers, and scan for trending or high-margin products to incorporate. The system applies demand forecasting, supplier selection optimization, multi-agent negotiation and continuous learning. We apply a prototype to a setting in the store of a middle scale mart, test its performance on three conventional and artificial data tables, and compare the results to the base heuristics. Our findings indicate that there is a decrease in stockouts, a reduction of inventory holding costs, and an improvement in product mix turnover. We address constraints, scalability as well as improvement prospect.
zh

[AI-8] ParaG ate: Parasitic-Driven Domain Adaptation Transfer Learning for Netlist Performance Prediction

【速读】:该论文旨在解决传统电子设计自动化(EDA)流程中布局级性能指标(如时序和功耗)仅在完成放置与布线后才能获取的问题,从而阻碍了早期阶段的全局优化。其解决方案的关键在于提出ParaGate框架,该框架通过三个步骤实现从网表直接预测布局级性能:首先采用两阶段迁移学习方法预测寄生参数,利用中等规模电路预训练并针对大规模电路微调以捕捉极端条件;其次借助EDA工具进行时序分析,将长路径的数值推理任务交由专业工具处理;最后通过子图特征进行全局校准。该方法在少量微调数据下实现了强泛化能力,显著提升了预测精度(如在openE906数据集上到达时间R²从0.119提升至0.897),为综合与放置阶段的全局优化提供了有效指导。

链接: https://arxiv.org/abs/2511.23340
作者: Bin Sun,Jingyi Zhou,Jianan Mu,Zhiteng Chao,Tianmeng Yang,Ziyue Xu,Jing Ye,Huawei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:In traditional EDA flows, layout-level performance metrics are only obtainable after placement and routing, hindering global optimization at earlier stages. Although some neural-network-based solutions predict layout-level performance directly from netlists, they often face generalization challenges due to the black-box heuristics of commercial placement-and-routing tools, which create disparate data across designs. To this end, we propose ParaGate, a three-step cross-stage prediction framework that infers layout-level timing and power from netlists. First, we propose a two-phase transfer-learning approach to predict parasitic parameters, pre-training on mid-scale circuits and fine-tuning on larger ones to capture extreme conditions. Next, we rely on EDA tools for timing analysis, offloading the long-path numerical reasoning. Finally, ParaGate performs global calibration using subgraph features. Experiments show that ParaGate achieves strong generalization with minimal fine-tuning data: on openE906, its arrival-time R2 from 0.119 to 0.897. These results demonstrate that ParaGate could provide guidance for global optimization in the synthesis and placement stages.
zh

[AI-9] Hard-Constrained Neural Networks with Physics-Embedded Architecture for Residual Dynamics Learning and Invariant Enforcement in Cyber-Physical Systems

【速读】:该论文旨在解决复杂网络物理系统(Cyber-Physical Systems, CPS)中同时存在未知动力学和代数不变量(Algebraic Invariants)时的建模与学习问题,尤其是在微分方程驱动的系统中如何实现高精度、数据高效且物理一致的预测。解决方案的关键在于提出了一种结构化的物理信息学习框架:首先设计了混合递归物理信息神经网络(Hybrid Recurrent Physics-Informed Neural Network, HRPINN),通过将已知物理规律作为硬性结构约束嵌入到递归积分器中,仅学习残差动力学;其次进一步提出投影式HRPINN(Projected HRPINN, PHRPINN),引入“预测-投影”机制以显式强制满足代数不变量,从而在保证物理一致性的同时提升模型鲁棒性。该框架兼具理论可解释性和实证有效性,在电池寿命预测和标准约束优化基准测试中均展现出优异性能。

链接: https://arxiv.org/abs/2511.23307
作者: Enzo Nicolás Spotorno,Josafat Leal Filho,Antônio Augusto Fröhlich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages (30 pages main text + 11 pages appendices), 3 figures, 8 tables. Submitted to JMLR

点击查看摘要

Abstract:This paper presents a framework for physics-informed learning in complex cyber-physical systems governed by differential equations with both unknown dynamics and algebraic invariants. First, we formalize the Hybrid Recurrent Physics-Informed Neural Network (HRPINN), a general-purpose architecture that embeds known physics as a hard structural constraint within a recurrent integrator to learn only residual dynamics. Second, we introduce the Projected HRPINN (PHRPINN), a novel extension that integrates a predict-project mechanism to strictly enforce algebraic invariants by design. The framework is supported by a theoretical analysis of its representational capacity. We validate HRPINN on a real-world battery prognostics DAE and evaluate PHRPINN on a suite of standard constrained benchmarks. The results demonstrate the framework’s potential for achieving high accuracy and data efficiency, while also highlighting critical trade-offs between physical consistency, computational cost, and numerical stability, providing practical guidance for its deployment.
zh

[AI-10] Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

【速读】:该论文旨在解决音频-视觉问答(Audio-Visual Question Answering, AVQA)任务中因复杂多模态内容导致的语义理解困难问题,特别是现有方法难以有效捕捉视频中的结构化信息以及对跨模态特征进行细粒度建模的局限性。解决方案的关键在于提出一种新颖的多模态场景图(Multi-Modal Scene Graph),显式地将物体及其关系建模为视觉锚定的结构化表示,从而增强对音频-视觉场景的理解;同时设计基于Kolmogorov-Arnold Network (KAN) 的专家混合(Mixture of Experts, MoE)架构,提升时序融合阶段的表达能力,实现更精细的跨模态交互建模,从而捕获更丰富、细腻的模式并改善时间推理性能。

链接: https://arxiv.org/abs/2511.23304
作者: Zijian Fu,Changsheng Lv,Mengshi Qi,Huadong Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
zh

[AI-11] OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning

【速读】:该论文旨在解决医疗领域大型语言模型(Large Language Models, LLMs)在训练过程中因数据质量与多样性不足而导致的泛化能力差和对未见临床任务鲁棒性弱的问题。其解决方案的关键在于提出一种基于结构化推理轨迹(structured reasoning traces)的数据配方(data recipe),通过构建包含超过800万样本和68亿响应标记的高质量、多样化训练数据集,结合监督微调(Supervised Fine-Tuning, SFT)策略,使模型能够在无显式监督的情况下根据下游任务自动调整自身的推理路径长度,从而实现更稳健的多模态医学推理能力。

链接: https://arxiv.org/abs/2511.23269
作者: Timothy Ossowski,Sheng Zhang,Qianchu Liu,Guanghui Qin,Reuben Tan,Tristan Naumann,Junjie Hu,Hoifung Poon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.
zh

[AI-12] Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在测试阶段遇到新任务时适应效率低下的问题,即模型缺乏持续学习与策略优化的能力。现有方法难以像人类一样通过元认知(metacognition)机制结合记忆系统实现动态调整。其解决方案的关键在于提出一种元认知测试时推理框架(Metacognitive Test-Time Reasoning, MCTR),该框架包含两个协同工作的模块:一是元推理模块(meta-reasoning module),通过增量式构建结构化记忆来存储任务相关规则、环境模式和动作-结果关系;二是动作推理模块(action-reasoning module),基于上下文感知的感知与策略推理,动态检索并整合记忆中的知识以决定最优动作,并通过提出的元认知测试时强化学习(metacognitive test-time reinforcement learning)持续更新策略。这种双层架构使模型能够在测试时自主学习、适应并改进决策能力,从而显著提升对未见任务的泛化性能。

链接: https://arxiv.org/abs/2511.23262
作者: Yang Li,Zhiyuan He,Yuxuan Huang,Zhuhanling Xiao,Chao Yu,Meng Fang,Kun Shao,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) exhibit strong perceptual reasoning abilities, yet they often struggle to adapt efficiently when encountering novel tasks at test time. In contrast, humans leverage the metacognitive model with memory, enabling continuous strategy refinement through metacognitive control when faced with new challenges. To bridge this gap, we propose metacognitive test-time reasoning (MCTR), a framework that equips models with the ability to learn, adapt, and improve during test time through metacognitive self-updating. Inspired by the dual structure of human metacognition, MCTR comprises meta-level and object-level VLM reasoning modules, each equipped with dedicated memory systems for hierarchical adaptive reasoning. Specifically, MCTR consists of (1) a meta-reasoning module which incrementally builds a structured memory by discovering and storing task-relevant rules, environmental patterns, and action-outcome relationships from test-time observations as natural language descriptions; and (2) an action-reasoning module that determines optimal actions through context-aware perception and strategic reasoning by dynamically retrieving and integrating knowledge from memory. The action-reasoning module continuously updates its policy through proposed metacognitive test-time reinforcement learning, adapting as knowledge memory evolves. We evaluate MCTR on 45 Atari games (33 seen, 12 unseen). MCTR demonstrates robust test-time adaptation, achieving 9/12 top-1 results on unseen games compared with baselines. Analyses through ablations, learning dynamics, and case studies reveal the complementary contributions of both components and show meta-reasoning evolving toward human-like adaptation strategies.
zh

[AI-13] me Series Forecasting via Direct Per-Step Probability Distribution Modeling AAAI

【速读】:该论文旨在解决深度神经网络时间序列预测模型在输出时无法有效表征不确定性的问题,因为传统模型直接输出标量值而忽略了预测的置信度信息。其解决方案的关键在于提出了一种名为交错双分支概率分布网络(interleaved dual-branch Probability Distribution Network, interPDN)的新模型,该模型通过在每个时间步上构建离散的概率分布而非单一标量输出,利用预定义支撑集上的期望值作为回归结果,并引入交错支撑集与双分支结构,其中辅助分支提供自监督一致性约束以抑制异常预测,从而提升长期趋势建模能力与预测稳定性。

链接: https://arxiv.org/abs/2511.23260
作者: Linghao Kong,Xiaopeng Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures. This is the preprint version of the paper and supplemental material to appear in AAAI, 2026. Please cite the final published version. Code is available at this https URL

点击查看摘要

Abstract:Deep neural network-based time series prediction models have recently demonstrated superior capabilities in capturing complex temporal dependencies. However, it is challenging for these models to account for uncertainty associated with their predictions, because they directly output scalar values at each time step. To address such a challenge, we propose a novel model named interleaved dual-branch Probability Distribution Network (interPDN), which directly constructs discrete probability distributions per step instead of a scalar. The regression output at each time step is derived by computing the expectation of the predictive distribution on a predefined support set. To mitigate prediction anomalies, a dual-branch architecture is introduced with interleaved support sets, augmented by coarse temporal-scale branches for long-term trend forecasting. Outputs from another branch are treated as auxiliary signals to impose self-supervised consistency constraints on the current branch’s prediction. Extensive experiments on multiple real-world datasets demonstrate the superior performance of interPDN.
zh

[AI-14] AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在农业场景下评估时,现有视觉问答(Visual Question Answering, VQA)数据集普遍缺乏对模型逻辑推理和问题解决能力的充分考察这一关键问题。解决方案的关键在于提出AgriCoT数据集,该数据集引入了链式思维(Chain-of-Thought, CoT)推理机制,通过4,535个精心设计的样本,专门用于评估VLM在零样本(zero-shot)情境下的逻辑推理与问题求解能力,从而更精准地衡量其在复杂农业任务中的实际表现。

链接: https://arxiv.org/abs/2511.23253
作者: Yibin Wen,Qingmei Li,Zi Ye,Jiarui Zhang,Jing Wu,Zurong Mai,Shuohong Lou,Yuhang Chen,Henglian Huang,Xiaoya Fan,Yang Zhang,Lingyuan Zhao,Haohuan Fu,Huang Jianxi,Juepeng Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at this https URL.
zh

[AI-15] One-Shot Secure Aggregation: A Hybrid Cryptographic Protocol for Private Federated Learning in IoT

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在物联网(Internet of Things, IoT)场景下因通信开销过大而导致的可扩展性瓶颈问题,尤其针对设备端带宽、延迟和能耗受限的挑战。传统安全聚合协议虽能保障模型更新隐私,但通常需要多轮交互、大尺寸数据传输及每客户端固定成本,难以部署于边缘设备。其解决方案的关键在于提出Hyb-Agg协议——一种轻量级、高通信效率的安全聚合机制,融合了多密钥CKKS(Multi-Key CKKS, MK-CKKS)同态加密与基于椭圆曲线Diffie-Hellman(Elliptic Curve Diffie-Hellman, ECDH)的加法掩码技术;该设计将每轮安全聚合简化为单次非交互式客户端到服务器传输,使每个客户端的通信开销与参与用户数无关,并消除部分解密交互,同时在RLWE、CDH及随机预言机假设下保持强隐私性,且对服务器与最多N−2个客户端共谋具备鲁棒性。

链接: https://arxiv.org/abs/2511.23252
作者: Imraul Emmaka,Tran Viet Xuan Phuong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures. Accepted at The 7th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA 2025)

点击查看摘要

Abstract:Federated Learning (FL) offers a promising approach to collaboratively train machine learning models without centralizing raw data, yet its scalability is often throttled by excessive communication overhead. This challenge is magnified in Internet of Things (IoT) environments, where devices face stringent bandwidth, latency, and energy constraints. Conventional secure aggregation protocols, while essential for protecting model updates, frequently require multiple interaction rounds, large payload sizes, and per-client costs rendering them impractical for many edge deployments. In this work, we present Hyb-Agg, a lightweight and communication-efficient secure aggregation protocol that integrates Multi-Key CKKS (MK-CKKS) homomorphic encryption with Elliptic Curve Diffie-Hellman (ECDH)-based additive masking. Hyb-Agg reduces the secure aggregation process to a single, non-interactive client-to-server transmission per round, ensuring that per-client communication remains constant regardless of the number of participants. This design eliminates partial decryption exchanges, preserves strong privacy under the RLWE, CDH, and random oracle assumptions, and maintains robustness against collusion by the server and up to N-2 clients. We implement and evaluate Hyb-Agg on both high-performance and resource-constrained devices, including a Raspberry Pi 4, demonstrating that it delivers sub-second execution times while achieving a constant communication expansion factor of approximately 12x over plaintext size. By directly addressing the communication bottleneck, Hyb-Agg enables scalable, privacy-preserving federated learning that is practical for real-world IoT deployments. Comments: 11 pages, 6 figures. Accepted at The 7th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA 2025) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 94A60, 68T05, 68P25 ACMclasses: C.2.4; E.3; K.6.5; I.2.6; D.4.6 Cite as: arXiv:2511.23252 [cs.CR] (or arXiv:2511.23252v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.23252 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-16] GAVINA: flexible aggressive undervolting for bit-serial mixed-precision DNN acceleration

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)加速中因电压过度降低(undervolting)导致的高错误率问题,以及现有低精度(8-bit)加速器在能效方面难以与先进架构竞争的问题。其解决方案的关键在于提出一种名为“Guarded Aggressive underVolting (GAV)”的新技术,该技术通过结合电压降低与位串行计算(bit-serial computation),在特定最低有效位组合上激进地降低电源电压,从而实现灵活的近似计算策略。基于此思想,作者设计了GAVINA(GAV mIxed-precisioN Accelerator)架构,支持任意混合精度和可配置的电压降低,在最激进配置下达到高达89 TOP/s/W的能效,并通过误差建模证明:GAV可在仅带来可忽略的精度损失前提下,实现20%的能效提升。

链接: https://arxiv.org/abs/2511.23203
作者: Jordi Fornt,Pau Fontova-Musté,Adrian Gras,Omar Lahyani,Martí Caro,Jaume Abella,Francesc Moll,Josep Altet
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Presented in the 2025 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). Conference proceedings pending to be published

点击查看摘要

Abstract:Voltage overscaling, or undervolting, is an enticing approximate technique in the context of energy-efficient Deep Neural Network (DNN) acceleration, given the quadratic relationship between power and voltage. Nevertheless, its very high error rate has thwarted its general adoption. Moreover, recent undervolting accelerators rely on 8-bit arithmetic and cannot compete with state-of-the-art low-precision (8b) architectures. To overcome these issues, we propose a new technique called Guarded Aggressive underVolting (GAV), which combines the ideas of undervolting and bit-serial computation to create a flexible approximation method based on aggressively lowering the supply voltage on a select number of least significant bit combinations. Based on this idea, we implement GAVINA (GAV mIxed-precisioN Accelerator), a novel architecture that supports arbitrary mixed precision and flexible undervolting, with an energy efficiency of up to 89 TOP/sW in its most aggressive configuration. By developing an error model of GAVINA, we show that GAV can achieve an energy efficiency boost of 20% via undervolting, with negligible accuracy degradation on ResNet-18.
zh

[AI-17] Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning

【速读】:该论文旨在解决当前网络安全防御体系难以应对日益复杂和隐蔽的网络攻击问题,特别是在缺乏对暗网(Dark Web)中非英语语言内容进行有效威胁情报挖掘的情况下。其核心挑战在于如何从巴西葡萄牙语的暗网论坛文本数据中准确识别恶意帖子,从而提升主动威胁检测能力。解决方案的关键在于构建三个原创数据集,设计一种结合指标(Indicators of Compromise, IoCs)、上下文关键词与人工分析的多阶段标注流程,并通过文本表示方法(如TF-IDF)与机器学习分类器(特别是LightGBM)实现高精度检测;同时利用主题建模对未标注数据进行验证,确保模型在真实场景中的鲁棒性。

链接: https://arxiv.org/abs/2511.23183
作者: Sebastião Alves de Jesus Filho,Gustavo Di Giovanni Bernardo,Paulo Henrique Ribeiro Gabriel,Bruno Bogaz Zarpelão,Rodrigo Sanches Miani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Manuscript under review (SN Computer Science)

点击查看摘要

Abstract:Given the constant growth and increasing sophistication of cyberattacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become essential to help security teams identify potential risks and implement effective mitigation measures. Cyber Threat Intelligence (CTI) plays a key role by providing security analysts with evidence-based knowledge about cyber threats. CTI information can be extracted using various techniques and data sources; however, machine learning has proven promising. As for data sources, social networks and online discussion forums are commonly explored. In this study, we apply text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts. Our contributions include the creation of three original datasets, a novel multi-stage labeling process combining indicators of compromise (IoCs), contextual keywords, and manual analysis, and a comprehensive evaluation of text representations and classifiers. To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain. The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy. We also applied topic modeling to validate the model’s outputs on unlabeled data, confirming its robustness in real-world scenarios.
zh

[AI-18] AI for software engineering: from probable to provable

【速读】:该论文针对生成式 AI 在编程中的应用(即“vibe coding”)所面临的核心挑战展开研究,具体问题包括:一是目标表述困难(prompt engineering 实质上是需求工程,属于软件工程中最复杂的领域之一),二是模型幻觉现象导致生成代码的正确性难以保障。由于程序的有效性高度依赖其正确性或接近正确性,单纯依赖生成式 AI 存在显著风险。论文提出的解决方案关键在于将人工智能的创造力与形式化规范方法(formal specification methods)的严谨性以及形式化程序验证(formal program verification)的强大能力相结合,并借助现代定理证明工具实现可靠支持,从而在提升开发效率的同时确保程序质量。

链接: https://arxiv.org/abs/2511.23159
作者: Bertrand Meyer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vibe coding, the much-touted use of AI techniques for programming, faces two overwhelming obstacles: the difficulty of specifying goals (“prompt engineering” is a form of requirements engineering, one of the toughest disciplines of software engineering); and the hallucination phenomenon. Programs are only useful if they are correct or very close to correct. The solution? Combine the creativity of artificial intelligence with the rigor of formal specification methods and the power of formal program verification, supported by modern proof tools. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) MSC classes: 68N30, 68T20, 68T05 ACMclasses: D.2.4; D.2.2; I.2.2; F.3.1 Cite as: arXiv:2511.23159 [cs.SE] (or arXiv:2511.23159v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.23159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-19] Peer-to-Peer Energy Trading in Dairy Farms using Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决农村地区(如奶牛养殖社区)在可再生能源(Renewable Energy Resources, RER)接入背景下,如何实现高效、灵活且可持续的分布式能源管理问题。传统基于规则的能源调度方法在动态环境中的适应性不足,难以优化电力成本与负荷峰值。解决方案的关键在于将多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)算法(包括近端策略优化 Proximal Policy Optimization, PPO 和深度 Q 网络 Deep Q-Networks, DQN)与社区级去中心化点对点(Peer-to-Peer, P2P)能源交易机制相结合,并引入拍卖式市场出清、价格顾问代理以及负荷和电池管理策略。实证结果表明,该方案显著降低了电力成本(DQN 最高降幅达 14.2%)并减少峰值需求(PPO 降低 55.5%),同时提升售电收益(DQN 最高增长 12.73%),验证了 MARL 与 P2P 能源交易协同优化在提升农村能源系统经济性和灵活性方面的有效性。

链接: https://arxiv.org/abs/2511.23148
作者: Mian Ibad Ali Shah,Marcos Eduardo Cruz Victorio,Maeve Duffy,Enda Barrett,Karl Mason
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 51 pages, 7 figures, 11 tables, Preprint of the article published in Applied Energy: Shah, M.I.A., Victorio, M.E.C., Duffy, M., Barrett, E. and Mason, K. (2026). Peer-to-peer energy trading in dairy farms using multi-agent reinforcement learning. Applied Energy, 402, 127041. doi: https://doi.org/10.1016/j.apenergy.2025.127041

点击查看摘要

Abstract:The integration of renewable energy resources in rural areas, such as dairy farming communities, enables decentralized energy management through Peer-to-Peer (P2P) energy trading. This research highlights the role of P2P trading in efficient energy distribution and its synergy with advanced optimization techniques. While traditional rule-based methods perform well under stable conditions, they struggle in dynamic environments. To address this, Multi-Agent Reinforcement Learning (MARL), specifically Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN), is combined with community/distributed P2P trading mechanisms. By incorporating auction-based market clearing, a price advisor agent, and load and battery management, the approach achieves significant improvements. Results show that, compared to baseline models, DQN reduces electricity costs by 14.2% in Ireland and 5.16% in Finland, while increasing electricity revenue by 7.24% and 12.73%, respectively. PPO achieves the lowest peak hour demand, reducing it by 55.5% in Ireland, while DQN reduces peak hour demand by 50.0% in Ireland and 27.02% in Finland. These improvements are attributed to both MARL algorithms and P2P energy trading, which together results in electricity cost and peak hour demand reduction, and increase electricity selling revenue. This study highlights the complementary strengths of DQN, PPO, and P2P trading in achieving efficient, adaptable, and sustainable energy management in rural communities.
zh

[AI-20] Automated Generation of MDPs Using Logic Programming and LLM s for Robotic Applications

【速读】:该论文旨在解决如何高效、自动化地从自然语言(Natural Language, NL)描述中构建可用于机器人决策的马尔可夫决策过程(Markov Decision Process, MDP),并生成可执行的最优策略,从而降低人工建模成本并提升概率规划在机器人领域的可扩展性。其解决方案的关键在于:利用大语言模型(Large Language Models, LLMs)从NL文本中提取结构化知识构建Prolog知识库,再通过可达性分析自动生成MDP,并借助Storm模型检测器合成最优策略,最终以状态-动作表形式输出供执行。该框架实现了从自然语言到可执行策略的端到端自动化流程,在人机交互场景中验证了其有效性与低人工干预特性。

链接: https://arxiv.org/abs/2511.23143
作者: Enrico Saccon,Davide De Martini,Matteo Saveriano,Edoardo Lamon,Luigi Palopoli,Marco Roveri
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 11 figures, 2 tables, 2 algorithms, accepted for publication in IEEE Robotics and Automation Letters

点击查看摘要

Abstract:We present a novel framework that integrates Large Language Models (LLMs) with automated planning and formal verification to streamline the creation and use of Markov Decision Processes (MDP). Our system leverages LLMs to extract structured knowledge in the form of a Prolog knowledge base from natural language (NL) descriptions. It then automatically constructs an MDP through reachability analysis, and synthesises optimal policies using the Storm model checker. The resulting policy is exported as a state-action table for execution. We validate the framework in three human-robot interaction scenarios, demonstrating its ability to produce executable policies with minimal manual effort. This work highlights the potential of combining language models with formal methods to enable more accessible and scalable probabilistic planning in robotics.
zh

[AI-21] Evolutionary Discovery of Heuristic Policies for Traffic Signal Control

【速读】:该论文旨在解决交通信号控制(Traffic Signal Control, TSC)中经典启发式方法效率高但过于简化、深度强化学习(Deep Reinforcement Learning, DRL)性能优异却泛化能力差且策略不透明,以及在线大语言模型(Online Large Language Models, LLMs)虽具通用推理能力但存在高延迟且缺乏环境特异性优化的问题。其解决方案的核心在于提出Temporal Policy Evolution for Traffic(\method),通过将LLMs作为进化引擎来生成定制化的启发式策略,关键创新包括:(1) 结构化状态抽象(Structured State Abstraction, SSA),将高维交通数据转化为时序逻辑事实以支持推理;(2) 信用分配反馈(Credit Assignment Feedback, CAF),追踪微观决策失误与宏观结果之间的因果关系,实现针对性批判与策略优化。整个框架在提示层运行,无需训练,从而生成轻量、鲁棒且针对特定交通环境优化的策略,显著优于传统启发式和在线LLM代理。

链接: https://arxiv.org/abs/2511.23122
作者: Ruibing Wang,Shuhan Guo,Zeen Li,Zhen Wang,Quanming Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic Signal Control (TSC) involves a challenging trade-off: classic heuristics are efficient but oversimplified, while Deep Reinforcement Learning (DRL) achieves high performance yet suffers from poor generalization and opaque policies. Online Large Language Models (LLMs) provide general reasoning but incur high latency and lack environment-specific optimization. To address these issues, we propose Temporal Policy Evolution for Traffic (\textbf\method), which uses LLMs as an evolution engine to derive specialized heuristic policies. The framework introduces two key modules: (1) Structured State Abstraction (SSA), converting high-dimensional traffic data into temporal-logical facts for reasoning; and (2) Credit Assignment Feedback (CAF), tracing flawed micro-decisions to poor macro-outcomes for targeted critique. Operating entirely at the prompt level without training, \method yields lightweight, robust policies optimized for specific traffic environments, outperforming both heuristics and online LLM actors.
zh

[AI-22] Fairness in the Multi-Secretary Problem AAAI’26

【速读】:该论文旨在解决多秘书问题(multi-secretary problem)与多胜者选举(multi-winner elections)之间的交叉研究问题,尤其关注在在线决策场景下如何实现公平性。针对传统比例代表性概念——扩展公正代表(Extended Justified Representation, EJR)在在线环境中的局限性,论文提出了一套融合在线算法技术与社会选择规则(如等额分配法(Method of Equal Shares)和纳什规则(Nash Rule))的机制,并通过理论分析与大规模实验验证了其有效性。解决方案的关键在于将社会选择中的公平性原则与在线算法的动态决策能力相结合,从而在资源有限且信息逐步到达的场景中实现更优的公平与效率平衡。

链接: https://arxiv.org/abs/2511.23097
作者: Georgios Papasotiropoulos,Zein Pishbin
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: AAAI’26

点击查看摘要

Abstract:This paper bridges two perspectives: it studies the multi-secretary problem through the fairness lens of social choice, and examines multi-winner elections from the viewpoint of online decision making. After identifying the limitations of the prominent proportionality notion of Extended Justified Representation (EJR) in the online domain, the work proposes a set of mechanisms that merge techniques from online algorithms with rules from social choice – such as the Method of Equal Shares and the Nash Rule – and supports them through both theoretical analysis and extensive experimental evaluation.
zh

[AI-23] Does Self-Evaluation Enable Wireheading in Language Models? AAAI2026

【速读】:该论文试图解决的问题是:在语言模型训练中,将自评估(self-evaluation)与奖励信号耦合是否会引发“线控”(wireheading)行为,即代理通过操纵奖励测量而非提升任务性能来最大化奖励。解决方案的关键在于区分自评估机制是否直接控制学习信号——当自评估结果决定奖励时,模型会表现出显著的评分通胀(grade inflation),且无对应的任务性能提升;而当自评估仅用于反馈但不参与奖励计算时,则不会出现此类问题。因此,关键设计原则是将自评估功能与学习信号解耦,以确保生成式 AI(Generative AI)系统的安全性和有效性。

链接: https://arxiv.org/abs/2511.23092
作者: David Demitri Africa,Hans Ethan Ting
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted (oral) to Foundations of Agentic Systems Theory at AAAI 2026

点击查看摘要

Abstract:Self-evaluation is increasingly central to language model training, from constitutional AI to self-refinement. We investigate whether coupling self-evaluation to reward signals creates incentives for wireheading, where agents manipulate reward measurements rather than improving task performance. We formalize conditions under which reward-channel control strictly dominates task-focused behavior in POMDPs and test these predictions empirically. Across two models and three tasks, we find that models whose self-grades determine rewards exhibit substantial grade inflation without corresponding accuracy gains, particularly on ambiguous tasks like summarization. Models that self-evaluate but do not control rewards show no such inflation. Our results demonstrate that self-evaluation is safe when decoupled from learning signals but dangerous when coupled, with clear implications for agentic system design.
zh

[AI-24] MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

【速读】:该论文旨在解决当前视觉-语言具身智能体(vision-language embodied agents)在决策过程中缺乏理论心智(Theory of Mind, ToM)能力的问题,尤其指出现有基准测试仅关注人类心理状态而忽视了智能体自身的视角,导致其无法生成连贯的决策与行为。解决方案的关键在于提出一个以机器人为中心的框架——MindPower,该框架整合感知(Perception)、心智推理(Mental Reasoning)、决策制定(Decision Making)与行动生成(Action),通过建模自我与他人的心理状态实现协同推理,并引入一种新颖的优化目标Mind-Reward,促使视觉语言模型(VLMs)在推理与行为上保持一致性,从而显著提升决策与动作生成性能(优于GPT-4o 12.77% 和 12.49%)。

链接: https://arxiv.org/abs/2511.23055
作者: Ruoxuan Zhang,Qiyun Zheng,Zhiyu Zhou,Ziqi Liao,Siyu Wu,Jian-Yu Jiang-Lin,Bin Wen,Hongxia Xie,Jianlong Fu,Wen-Huang Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM) refers to the ability to infer others’ mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent’s own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.
zh

[AI-25] Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring ICLR2026

【速读】:该论文旨在解决在线时间序列监测模型在医疗和金融等敏感领域中可解释性不足的问题,尤其是现有可解释人工智能(XAI)方法大多独立分析每个时间步,忽略了时间依赖性,导致预测变化难以解释、无法利用在线动态信息且评估困难。其解决方案的关键在于提出Delta-XAI框架,通过封装函数适配14种现有XAI方法,并引入一套针对在线场景的系统化评估体系,以衡量忠实性(faithfulness)、充分性(sufficiency)和一致性(coherence)等指标;进一步地,提出了Shifted Window Integrated Gradients (SWING) 方法,通过在积分路径中引入历史观测值来显式建模时间依赖关系,从而有效缓解分布外(out-of-distribution)效应并提升解释质量。

链接: https://arxiv.org/abs/2511.23036
作者: Changhun Kim,Yechan Mun,Hyeongwon Jang,Eunseo Lee,Sangchul Hahn,Eunho Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at ICLR 2026

点击查看摘要

Abstract:Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at this https URL.
zh

[AI-26] A transfer learning approach for automatic conflicts detection in software requirement sentence pairs based on dual encoders

【速读】:该论文旨在解决软件需求文档(Software Requirement Document, SRD)中需求一致性检测的三大挑战:在不平衡数据上的检测准确率低、单一编码器导致语义提取能力受限,以及跨域迁移学习性能不佳。解决方案的关键在于提出了一种基于SBERT和SimCSE的可迁移需求冲突检测框架TSRCDF-SS,其核心创新包括:采用两个独立编码器(Sentence-BERT和Simple Contrastive Sentence Embedding)生成需求对的句子嵌入,并通过六维拼接策略融合特征;设计了一个双层全连接前馈神经网络(FFNN)分类器,结合改进的Focal Loss、领域特定约束与置信度惩罚项的混合损失优化策略;同时协同集成序列化与跨域迁移学习机制,从而显著提升模型在同域和跨域场景下的性能表现,实验表明该方法在宏F1和加权F1指标上分别提升了10.4%和11.4%。

链接: https://arxiv.org/abs/2511.23007
作者: Yizheng Wang,Tao Jiang,Jinyan Bai,Zhengbin Zou,Tiancheng Xue,Nan Zhang,Jie Luan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Software Requirement Document (RD) typically contain tens of thousands of individual requirements, and ensuring consistency among these requirements is critical for the success of software engineering projects. Automated detection methods can significantly enhance efficiency and reduce costs; however, existing approaches still face several challenges, including low detection accuracy on imbalanced data, limited semantic extraction due to the use of a single encoder, and suboptimal performance in cross-domain transfer learning. To address these issues, this paper proposes a Transferable Software Requirement Conflict Detection Framework based on SBERT and SimCSE, termed TSRCDF-SS. First, the framework employs two independent encoders, Sentence-BERT (SBERT) and Simple Contrastive Sentence Embedding (SimCSE), to generate sentence embeddings for requirement pairs, followed by a six-element concatenation strategy. Furthermore, the classifier is enhanced by a two-layer fully connected feedforward neural network (FFNN) with a hybrid loss optimization strategy that integrates a variant of Focal Loss, domain-specific constraints, and a confidence-based penalty term. Finally, the framework synergistically integrates sequential and cross-domain transfer learning. Experimental results demonstrate that the proposed framework achieves a 10.4% improvement in both macro-F1 and weighted-F1 scores in in-domain settings, and an 11.4% increase in macro-F1 in cross-domain scenarios.
zh

[AI-27] IM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在数学推理中面临的视觉幻觉(visual hallucinations)和逻辑不一致问题,这些问题无法通过标准的结果导向监督有效缓解。现有过程奖励模型(Process Reward Models, PRMs)通常作为标量评分器或生成式批评者,存在“谄媚倾向”(sycophancy),即盲目认可错误假设而非基于视觉真实进行验证。解决方案的关键在于提出TIM-PRM(Tool-Integrated Multimodal PRM),这是一个引入工具增强的代理框架,将验证从被动分类任务转变为主动、工具驱动的调查过程;其核心机制是独立提问(Independent Question Asking),通过外部工具查询证据,从而解耦验证与推理上下文,消除确认偏误(confirmation bias)。该方法在VisualProcessBench上的实验证明,8B参数模型显著优于更大规模的开源模型(如Qwen2.5-72B和InternVL-78B),并提供可解释的验证过程洞察。

链接: https://arxiv.org/abs/2511.22998
作者: Peng Kuang,Xiangxiang Wang,Wentao Liu,Jian Dong,Kaidi Xu,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive performances in mathematical reasoning, yet they remain vulnerable to visual hallucinations and logical inconsistencies that standard outcome-based supervision fails to mitigate. While Process Reward Models (PRMs) promise step-by-step verification, current approaches typically operate as scalar scorers or generative critics that suffer from sycophancy, blindly validating the flawed hypotheses rather than grounding them in visual reality. To bridge this gap, we introduce TIM-PRM (Tool-Integrated Multimodal PRM), a novel agentic framework that transforms verification from a passive classification task into an active, tool-augmented investigation. TIM-PRM is trained to explicitly plan verification strategies and utilizes a mechanism of Independent Question Asking to query evidence via external tools, effectively decoupling verification from the reasoning context to eliminate confirmation bias. We instantiate this method by curating a high-quality dataset of tool-integrated verification trajectories. Extensive experiments on VisualProcessBench demonstrate that our 8B parameter model surpasses existing open-source multimodal PRMs, significantly outperforming much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.
zh

[AI-28] Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

【速读】:该论文旨在解决人形机器人在执行自由形式语言指令时面临的挑战,即如何实现语言条件下的全身控制(whole-body control),以支持自然的人机交互、协作任务执行和通用具身智能。现有方法通常局限于简单指令,并在运动多样性与物理合理性之间做出权衡。解决方案的关键在于提出 Humanoid-LLA——一个大型语言动作模型(Large Language Action Model),其核心创新包括:(1) 构建统一的运动词汇表(unified motion vocabulary),将人类与人形机器人的运动基元映射到共享的离散空间;(2) 设计基于特权策略蒸馏的词汇导向控制器(vocabulary-directed controller),确保动作的物理可行性;(3) 引入物理感知微调阶段,通过带动力学感知奖励的强化学习提升系统鲁棒性和稳定性。该方法在仿真和真实 Unitree G1 人形机器人平台上验证了出色的语言泛化能力与高物理保真度。

链接: https://arxiv.org/abs/2511.22963
作者: Zhirui Liu,Kaiyang Ji,Ke Yang,Jingyi Yu,Ye Shi,Jingya Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.
zh

[AI-29] Bandit Guided Submodular Curriculum for Adaptive Subset Selection

【速读】:该论文旨在解决传统课程学习(curriculum learning)中难以可靠定义样本难度的问题,这一难题限制了课程学习在实际应用中的有效性。其解决方案的关键在于将自适应子集选择重新建模为多臂赌博机(multi-armed bandit)问题,其中每个臂对应一个子模函数(submodular function),用于引导样本选择;并提出一种名为 ONLINESUBMOD 的在线贪婪策略,该策略通过优化以效用为导向的奖励机制,在多种采样场景下可证明实现无遗憾(no-regret)性能。此方法不仅在视觉与语言数据集上显著优于传统课程学习和双层优化方法,还揭示了基于验证驱动的奖励指标可以作为指导课程调度的合理范式。

链接: https://arxiv.org/abs/2511.22944
作者: Prateek Chanda,Prayas Agrawal,Saral Sureka,Lokesh Reddy Polu,Atharv Kshirsagar,Ganesh Ramakrishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages main, 21 pages Appendix, 8 figures

点击查看摘要

Abstract:Traditional curriculum learning proceeds from easy to hard samples, yet defining a reliable notion of difficulty remains elusive. Prior work has used submodular functions to induce difficulty scores in curriculum learning. We reinterpret adaptive subset selection and formulate it as a multi-armed bandit problem, where each arm corresponds to a submodular function guiding sample selection. We introduce ONLINESUBMOD, a novel online greedy policy that optimizes a utility-driven reward and provably achieves no-regret performance under various sampling regimes. Empirically, ONLINESUBMOD outperforms both traditional curriculum learning and bi-level optimization approaches across vision and language datasets, showing superior accuracy-efficiency tradeoffs. More broadly, we show that validationdriven reward metrics offer a principled way to guide the curriculum schedule.
zh

[AI-30] EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model

【速读】:该论文旨在解决现有心电图(Electrocardiogram, ECG)分析模型在多任务场景下难以有效利用各类心脏异常之间关联性的问题,同时克服大型基础模型因缺乏ECG预训练而导致全量微调或再训练计算成本高昂的挑战。其解决方案的关键在于提出EnECG框架——一种基于专家混合(Mixture of Experts, MoE)的集成学习方法,通过整合多个在不同ECG任务上表现优异的专用基础模型,并采用轻量化适配策略:仅对新增输出层参数应用低秩适应(Low-Rank Adaptation, LoRA),从而显著降低计算与内存开销;同时利用MoE机制自动学习各模型的集成权重,实现互补性专业知识的有效融合,兼顾性能提升与临床部署效率。

链接: https://arxiv.org/abs/2511.22935
作者: Yuhao Xu,Xiaoda Wang,Jiaying Lu,Sirui Ding,Defu Cao,Huaxiu Yao,Yan Liu,Xiao Hu,Carl Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) analysis plays a vital role in the early detection, monitoring, and management of various cardiovascular conditions. While existing models have achieved notable success in ECG interpretation, they fail to leverage the interrelated nature of various cardiac abnormalities. Conversely, developing a specific model capable of extracting all relevant features for multiple ECG tasks remains a significant challenge. Large-scale foundation models, though powerful, are not typically pretrained on ECG data, making full re-training or fine-tuning computationally expensive. To address these challenges, we propose EnECG(Mixture of Experts-based Ensemble Learning for ECG Multi-tasks), an ensemble-based framework that integrates multiple specialized foundation models, each excelling in different aspects of ECG interpretation. Instead of relying on a single model or single task, EnECG leverages the strengths of multiple specialized models to tackle a variety of ECG-based tasks. To mitigate the high computational cost of full re-training or fine-tuning, we introduce a lightweight adaptation strategy: attaching dedicated output layers to each foundation model and applying Low-Rank Adaptation (LoRA) only to these newly added parameters. We then adopt a Mixture of Experts (MoE) mechanism to learn ensemble weights, effectively combining the complementary expertise of individual models. Our experimental results demonstrate that by minimizing the scope of fine-tuning, EnECG can help reduce computational and memory costs while maintaining the strong representational power of foundation models. This framework not only enhances feature extraction and predictive performance but also ensures practical efficiency for real-world clinical applications. The code is available at this https URL.
zh

[AI-31] AgentS hield: Make MAS more secure and efficient

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在协作推理过程中易受对抗攻击的问题,即恶意篡改的智能体可能破坏整个系统的性能。现有防御方法要么依赖单一可信审计者导致单点故障,要么以牺牲效率为代价换取鲁棒性。其解决方案的关键在于提出一种去中心化的分布式审计框架 AgentShield,通过三层协同机制实现高效与鲁棒性的平衡:首先基于拓扑分析识别高影响力节点(Critical Node Auditing);其次采用轻量级哨兵模型实施级联验证协议(Light Token Auditing)以快速判别异常;最后仅在不确定时触发重型仲裁机制进行两轮共识审计(Two-Round Consensus Auditing),从而显著降低审计开销并保障全局一致性。

链接: https://arxiv.org/abs/2511.22924
作者: Kaixiang Wang,Zhaojiacheng Zhou,Bunyod Suvonov,Jiong Lou,Jie LI
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based Multi-Agent Systems (MAS) offer powerful cooperative reasoning but remain vulnerable to adversarial attacks, where compromised agents can undermine the system’s overall performance. Existing defenses either depend on single trusted auditors, creating single points of failure, or sacrifice efficiency for robustness. To resolve this tension, we propose \textbfAgentShield, a distributed framework for efficient, decentralized auditing. AgentShield introduces a novel three-layer defense: \textbf(i) Critical Node Auditing prioritizes high-influence agents via topological analysis; \textbf(ii) Light Token Auditing implements a cascade protocol using lightweight sentry models for rapid discriminative verification; and \textbf(iii) Two-Round Consensus Auditing triggers heavyweight arbiters only upon uncertainty to ensure global agreement. This principled design optimizes the robustness-efficiency trade-off. Experiments demonstrate that AgentShield achieves a 92.5% recovery rate and reduces auditing overhead by over 70% compared to existing methods, maintaining high collaborative accuracy across diverse MAS topologies and adversarial scenarios.
zh

[AI-32] Switching-time bioprocess control with pulse-width-modulated optogenetics

【速读】:该论文旨在解决光遗传学(optogenetics)在生物制造过程中因光强度驱动控制导致的调控精度不足问题,特别是在剂量-响应关系陡峭时,仅依赖光强调节难以实现中间水平的基因表达调控。解决方案的关键在于引入脉宽调制(pulse-width modulation, PWM)策略,通过在每个周期内交替使用全开和全关的光信号,以平均响应平滑化来提升调控灵活性;进一步地,作者采用强化学习方法对PWM中的占空比(duty cycle)进行参数化控制,从而将原本复杂的二元输入最优控制问题转化为连续变量优化问题,显著提高了计算可处理性与调控精度。

链接: https://arxiv.org/abs/2511.22893
作者: Sebastián Espinel-Ríos
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Submitted conference paper

点击查看摘要

Abstract:Biotechnology can benefit from dynamic control to improve production efficiency. In this context, optogenetics enables modulation of gene expression using light as an external input, allowing fine-tuning of protein levels to unlock dynamic metabolic control and regulation of cell growth. Optogenetic systems can be actuated by light intensity. However, relying solely on intensity-driven control (i.e., signal amplitude) may fail to properly tune optogenetic bioprocesses when the dose-response relationship (i.e., light intensity versus gene-expression strength) is steep. In these cases, tunability is effectively constrained to either fully active or fully repressed gene expression, with little intermediate regulation. Pulse-width modulation, a concept widely used in electronics, can alleviate this issue by alternating between fully ON and OFF light intensity within forcing periods, thereby smoothing the average response and enhancing process controllability. Naturally, optimizing pulse-width-modulated optogenetics entails a switching-time optimal control problem with a binary input over many forcing periods. While this can be formulated as a mixed-integer program on a refined time grid, the number of decision variables can grow rapidly with increasing time-grid resolution and number of forcing periods, compromising tractability. Here, we propose an alternative solution based on reinforcement learning. We parametrize control actions via the duty cycle, a continuous variable that encodes the ON-to-OFF switching time within each forcing period, thereby respecting the intrinsic binary nature of the light intensity.
zh

[AI-33] Adversarial Training for Process Reward Models

【速读】:该论文旨在解决当前过程奖励模型(Process Reward Models, PRMs)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的两大挑战:一是依赖昂贵的手动步骤级标注,二是静态训练数据难以泛化到新型错误。解决方案的关键在于提出对抗训练的PRM(Adversarially Trained PRMs, APRM),其中生成器(Generator, G)主动学习制造推理错误以欺骗奖励模型(Reward Model, R),而R则同步学习识别这些错误。这种对抗机制持续生成更具挑战性的负样本,从而增强R对未知错误的鲁棒性和泛化能力,且无需人工提供步骤级标签。实验表明,APRM在多个数学推理基准上平均提升求解准确率3.4个百分点,在分布外任务上提升达5.3个百分点。

链接: https://arxiv.org/abs/2511.22888
作者: Gurusha Juneja,Deepak Nathani,William Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\textttAPRM), where a Generator ( G ) learns to produce reasoning errors to deceive a PRM ( R ), while R concurrently learns to detect them. This interaction yields progressively harder negatives for R , improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \textttAPRM improves solver accuracy by +3.4 percentage points (pp) over the strongest PRM baseline. \textttAPRM achieves gains of +5.3 pp on out-of-distribution tasks.
zh

[AI-34] InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM -Driven Data Agents

【速读】:该论文旨在解决当前生成式 AI 在数据洞察发现(insight discovery)能力评估中缺乏高质量基准测试的问题。现有框架如 InsightBench 存在格式不一致、目标设计不合理及洞察冗余等关键缺陷,严重影响了评估的可靠性和可比性。为此,作者系统分析了 InsightBench 的不足,并提出了高质洞察基准应满足的核心标准;在此基础上,构建了一个新的数据集 InsightEval,其通过严谨的数据整理流程确保内容质量与多样性;同时引入一种新型度量指标以更准确地衡量多智能体系统在探索性数据分析中的表现。该解决方案的关键在于建立结构化、可复现且语义清晰的基准体系,从而推动自动化洞察发现技术的科学评估与发展。

链接: https://arxiv.org/abs/2511.22884
作者: Zhenghao Zhu,Yuanfeng Song,Xin Chen,Chengzhong Liu,Yakun Cui,Caleb Chen Cao,Sirui Han,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi-agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high-quality insight benchmark. Regarding this, we develop a data-curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.
zh

[AI-35] Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

【速读】:该论文旨在解决多租户环境下基于低秩适应(Low-Rank Adaptation, LoRA)的大语言模型(Large Language Models, LLMs)服务中因适配器(adapter)秩(rank)差异导致的性能偏差问题。现有服务系统在共批处理异构适配器时未考虑其规模差异,造成GPU资源利用率低下和尾部延迟升高,需额外增加硬件资源以满足服务等级目标(Service-Level Objectives, SLOs)。解决方案的关键在于提出LoRAServe框架,通过工作负载感知的动态适配器放置与路由机制,在运行时重新平衡适配器分布,并利用GPU Direct RDMA实现远程访问优化,从而有效缓解秩多样性带来的性能不均衡,显著提升吞吐量并降低尾部延迟,同时减少所需GPU数量。

链接: https://arxiv.org/abs/2511.22880
作者: Shashwat Jaiswal,Shrikara Arun,Anjaly Parayil,Ankur Mallick,Spyros Mastorakis,Alind Khare,Chloi Alverti,Renee St Amant,Chetan Bansal,Victor Rühle,Josep Torrellas
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs), enabling rapid adaptation to diverse domains. In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model. However, state-of-the-art serving systems co-batch heterogeneous adapters without accounting for rank (size) variability, leading to severe performance skew, which ultimately requires adding more GPUs to satisfy service-level objectives (SLOs). Existing optimizations, focused on loading, caching, and kernel execution, ignore this heterogeneity, leaving GPU resources underutilized. We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving. By dynamically rebalancing adapters across GPUs and leveraging GPU Direct RDMA for remote access, LoRAServe maximizes throughput and minimizes tail latency under real-world workload drift. Evaluations on production traces from Company X show that LoRAServe elicits up to 2 \times higher throughput, up to 9 \times lower TTFT, while using up to 50% fewer GPUs under SLO constraints compared to state-of-the-art systems.
zh

[AI-36] CausalProfiler: Generating Synthetic Benchmarks for Rigorous and Transparent Evaluation of Causal Machine Learning

【速读】:该论文旨在解决因果机器学习(Causal Machine Learning, Causal ML)领域中缺乏系统性、可复现且具有广泛覆盖能力的基准评估工具的问题。当前的实证评估多依赖少量手工构造或半合成数据集,导致结论脆弱且难以推广。其解决方案的关键在于提出CausalProfiler——一个基于明确设计选择的因果基准合成生成器,能够随机采样因果模型、数据、查询及真实值,从而在观察、干预和反事实三个因果推理层级上构建具有覆盖率保证和透明假设的合成基准。该方法使Causal ML方法能够在多样化条件和假设下进行严谨、透明的评估,显著提升了方法验证的全面性和可靠性。

链接: https://arxiv.org/abs/2511.22842
作者: Panayiotis Panayiotou,Audrey Poinsot,Alessandro Leite,Nicolas Chesneau,Marc Schoenauer,Özgür Şimşek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal machine learning (Causal ML) aims to answer “what if” questions using machine learning algorithms, making it a promising tool for high-stakes decision-making. Yet, empirical evaluation practices in Causal ML remain limited. Existing benchmarks often rely on a handful of hand-crafted or semi-synthetic datasets, leading to brittle, non-generalizable conclusions. To bridge this gap, we introduce CausalProfiler, a synthetic benchmark generator for Causal ML methods. Based on a set of explicit design choices about the class of causal models, queries, and data considered, the CausalProfiler randomly samples causal models, data, queries, and ground truths constituting the synthetic causal benchmarks. In this way, Causal ML methods can be rigorously and transparently evaluated under a variety of conditions. This work offers the first random generator of synthetic causal benchmarks with coverage guarantees and transparent assumptions operating on the three levels of causal reasoning: observation, intervention, and counterfactual. We demonstrate its utility by evaluating several state-of-the-art methods under diverse conditions and assumptions, both in and out of the identification regime, illustrating the types of analyses and insights the CausalProfiler enables.
zh

[AI-37] Fast dynamical similarity analysis

【速读】:该论文旨在解决神经系统的动态信息处理过程中,传统相似性度量方法因忽略神经表征背后的动态过程而无法准确比较不同神经回路或模型的问题。其解决方案的关键在于提出一种计算效率更高的动力学相似性分析方法(fastDSA),该方法包含两个核心改进:一是通过数据驱动的奇异值阈值自动选择Hankel延迟嵌入的有效模型阶数,以识别信息子空间并去除噪声,从而降低计算成本而不损失信号;二是引入一种新颖的优化目标和过程,用轻量级策略替代原有严格正交约束,使动态矩阵间的最小距离搜索更高效且保持接近正交变换空间。此方案在保证与先前方法相同准确性和鲁棒性的前提下,显著提升了计算速度,至少快一个数量级。

链接: https://arxiv.org/abs/2511.22828
作者: Arman Behrad,Mitchell Ostrow,Mohammad Taha Fakharian,Ila Fiete,Christian Beste,Shervin Safavi
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:To understand how neural systems process information, it is often essential to compare one circuit with another, one brain with another, or data with a model. Traditional similarity measures ignore the dynamical processes underlying neural representations. Dynamical similarity methods offer a framework to compare the temporal structure of dynamical systems by embedding their (possibly) nonlinear dynamics into a globally linear space and there computing conjugacy metrics. However, identifying the best embedding and computing these metrics can be computationally slow. Here we introduce fast Dynamical Similarity Analysis (fastDSA), which is computationally far more efficient than previous methods while maintaining their accuracy and robustness. FastDSA introduces two key components that boost efficiency: (1) automatic selection of the effective model order of the Hankel (delay) embedding from the data via a data-driven singular-value threshold that identifies the informative subspace and discards noise to lower computational cost without sacrificing signal, and (2) a novel optimization procedure and objective, which replaces the slow exact orthogonality constraint in finding a minimal distance between dynamics matrices with a lightweight process to keep the search close to the space of orthogonal transformations. We demonstrate that fastDSA is at least an order of magnitude faster than the previous methods. Furthermore, we demonstrate that fastDSA has the properties of its ancestor, including its invariances and sensitivities to system dynamics. FastDSA, therefore, provides a computationally efficient and accurate method for dynamical similarity analysis.
zh

[AI-38] A Unified and Stable Risk Minimization Framework for Weakly Supervised Learning with Theoretical Guarantees

【速读】:该论文旨在解决弱监督学习(Weakly Supervised Learning)中因间接标注导致的模型不稳定问题,尤其是在不同弱监督模式(如正例-未标记例PU、未标记例-未标记例UU、互补标签CLL、部分标签PLL等)下,现有方法依赖后验修正(post-hoc corrections)而难以保证稳定性和泛化性能的问题。其解决方案的关键在于提出一个统一的、原则性的框架,通过直接构建基于弱监督数据结构的稳定代理风险(surrogate risk),避免了传统方法所需的启发式稳定性调整;该框架将多种弱监督场景统一到单一优化目标下,并通过Rademacher复杂度推导出非渐近泛化界,明确揭示了监督结构、模型容量与样本量对性能的联合影响,同时分析了类别先验误设的影响并给出可识别性条件(如跨组监督分层),从而在无需人工调参的情况下实现跨数据集规模、类别数和先验设置下的稳定提升与抗过拟合能力。

链接: https://arxiv.org/abs/2511.22823
作者: Miao Zhang,Junpeng Li,Changchun Hua,Yana Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weakly supervised learning has emerged as a practical alternative to fully supervised learning when complete and accurate labels are costly or infeasible to acquire. However, many existing methods are tailored to specific supervision patterns – such as positive-unlabeled (PU), unlabeled-unlabeled (UU), complementary-label (CLL), partial-label (PLL), or similarity-unlabeled annotations – and rely on post-hoc corrections to mitigate instability induced by indirect supervision. We propose a principled, unified framework that bypasses such post-hoc adjustments by directly formulating a stable surrogate risk grounded in the structure of weakly supervised data. The formulation naturally subsumes diverse settings – including PU, UU, CLL, PLL, multi-class unlabeled, and tuple-based learning – under a single optimization objective. We further establish a non-asymptotic generalization bound via Rademacher complexity that clarifies how supervision structure, model capacity, and sample size jointly govern performance. Beyond this, we analyze the effect of class-prior misspecification on the bound, deriving explicit terms that quantify its impact, and we study identifiability, giving sufficient conditions – most notably via supervision stratification across groups – under which the target risk is recoverable. Extensive experiments show consistent gains across class priors, dataset scales, and class counts – without heuristic stabilization – while exhibiting robustness to overfitting.
zh

[AI-39] AI summaries in online search influence users attitudes

【速读】:该论文试图解决的问题是:AI生成的摘要(AI-generated summaries)在在线搜索结果中日益显著,其如何影响用户对公共议题的认知、态度与行为意图。研究通过一项预注册的随机对照实验(N = 2,004),系统考察了AI摘要的存在与否、位置(顶部 vs. 中部)及立场框架(利好型 vs. 危害型)对用户态度、行为意向和政策支持的影响。解决方案的关键在于:首先,明确AI摘要的存在显著引导用户态度向其立场靠拢;其次,摘要置于页面顶部时能更强烈地改变用户态度(但不影响行为意图或政策支持);此外,议题熟悉度与对AI的一般信任度起到调节作用,且用户认为强调健康危害的摘要更具实用性。这一发现揭示了AI生成内容在信息生态系统中的认知塑造力,为相关设计与监管提供了实证依据。

链接: https://arxiv.org/abs/2511.22809
作者: Yiwei Xu,Saloni Dash,Sungha Kang,Wang Liao,Emma S. Spiro
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study examined how AI-generated summaries, which have become visually prominent in online search results, affect how users think about different issues. In a preregistered randomized controlled experiment, participants (N = 2,004) viewed mock search result pages varying in the presence (vs. absence), placement (top vs. middle), and stance (benefit-framed vs. harm-framed) of AI-generated summaries across four publicly debated topics. Compared to a no-summary control group, participants exposed to AI-generated summaries reported issue attitudes, behavioral intentions, and policy support that aligned more closely with the AI summary stance. The summaries placed at the top of the page produced stronger shifts in users’ issue attitudes (but not behavioral intentions or policy support) than those placed at the middle of the page. We also observed moderating effects from issue familiarity and general trust toward AI. In addition, users perceived the AI summaries more useful when it emphasized health harms versus benefits. These findings suggest that AI-generated search summaries can significantly shape public perceptions, raising important implications for the design and regulation of AI-integrated information ecosystems.
zh

[AI-40] he Hidden AI Race: Tracking Environmental Costs of Innovation

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在训练与部署过程中带来的显著碳排放问题,即其对环境造成的负面影响。研究通过分析模型规模、版本迭代频率、任务类型及组织背景等因素,识别出影响碳足迹的关键变量:模型规模和频繁的版本更新与高排放强相关,而自然语言处理(NLP)模型相较于音频类系统碳足迹更低;此外,高校主导项目碳排放最高,社区驱动项目则表现最优。解决方案的关键在于推动绿色 AI 实践,包括采用能效更高的模型架构、优化开发流程以及使用可再生能源,并呼吁未来研究聚焦于可持续性与技术创新的协同路径,以构建更具生态责任的 AI 发展体系。

链接: https://arxiv.org/abs/2511.22781
作者: Shyam Agarwal,Mahasweta Chakraborti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The past decade has seen a massive rise in the popularity of AI systems, mainly owing to the developments in Gen AI, which has revolutionized numerous industries and applications. However, this progress comes at a considerable cost to the environment as training and deploying these models consume significant computational resources and energy and are responsible for large carbon footprints in the atmosphere. In this paper, we study the amount of carbon dioxide released by models across different domains over varying time periods. By examining parameters such as model size, repository activity (e.g., commits and repository age), task type, and organizational affiliation, we identify key factors influencing the environmental impact of AI development. Our findings reveal that model size and versioning frequency are strongly correlated with higher emissions, while domain-specific trends show that NLP models tend to have lower carbon footprints compared to audio-based systems. Organizational context also plays a significant role, with university-driven projects exhibiting the highest emissions, followed by non-profits and companies, while community-driven projects show a reduction in emissions. These results highlight the critical need for green AI practices, including the adoption of energy-efficient architectures, optimizing development workflows, and leveraging renewable energy sources. We also discuss a few practices that can lead to a more sustainable future with AI, and we end this paper with some future research directions that could be motivated by our work. This work not only provides actionable insights to mitigate the environmental impact of AI but also poses new research questions for the community to explore. By emphasizing the interplay between sustainability and innovation, our study aims to guide future efforts toward building a more ecologically responsible AI ecosystem.
zh

[AI-41] Improving Robotic Manipulation Robustness via NICE Scene Surgery

【速读】:该论文旨在解决机器人操作中因视觉干扰物(visual distractors)导致的策略鲁棒性不足问题,特别是在真实场景下,这些干扰物会显著降低模型性能与安全性。其核心挑战在于如何在不依赖额外机器人数据采集、仿真环境或定制模型训练的前提下,有效缩小分布外(out-of-distribution, OOD)差距。解决方案的关键是提出一种名为自然主义图像修复增强(Naturalistic Inpainting for Context Enhancement, NICE)的可扩展框架,该框架利用图像生成模型和大语言模型对已有示范数据进行三种编辑操作:对象替换、风格重置以及移除非目标干扰物,从而在保持空间关系和动作标签一致性的同时提升视觉多样性。实验表明,NICE能显著改善下游任务表现,在复杂杂乱场景中使空间属性预测准确率提升超20%,物体操作成功率平均提高11%,并降低目标混淆率6%和碰撞率7%。

链接: https://arxiv.org/abs/2511.22777
作者: Sajjad Pakdamansavoji,Mozhgan Pourkeshavarz,Adam Sigal,Zhiyuan Li,Rui Heng Yang,Amir Rasouli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 figures, 3 tables

点击查看摘要

Abstract:Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%. Comments: 11 figures, 3 tables Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.22777 [cs.RO] (or arXiv:2511.22777v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2511.22777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-42] CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance

【速读】:该论文旨在解决机器人模仿学习中因数据获取成本高而导致的轨迹分布模式覆盖不足问题,尤其是在碰撞避障等复杂任务中,难以通过纯数据收集实现对多种障碍物类型及空间配置的充分覆盖。解决方案的关键在于提出Context-Aware diffusion policy via Proximal mode Expansion (CAPE),其核心是通过一种新颖的先验种子迭代引导精化过程,在推理阶段引入上下文感知的先验和引导机制,动态扩展轨迹分布的模式支持。具体而言,CAPE首先生成初始轨迹并执行前缀段,随后将剩余轨迹扰动至中间噪声水平形成上下文感知的轨迹先验,再通过上下文引导的去噪迭代过程逐步扩展模式空间,从而在未见过的环境中采样出更平滑、低碰撞风险的轨迹,同时保持目标一致性。

链接: https://arxiv.org/abs/2511.22773
作者: Rui Heng Yang,Xuan Zhao,Leo Maxime Brunswic,Montgomery Alban,Mateo Clemente,Tongtong Cao,Jun Jin,Amir Rasouli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 4 tables, 9 figures

点击查看摘要

Abstract:In robotics, diffusion models can capture multi-modal trajectories from demonstrations, making them a transformative approach in imitation learning. However, achieving optimal performance following this regiment requires a large-scale dataset, which is costly to obtain, especially for challenging tasks, such as collision avoidance. In those tasks, generalization at test time demands coverage of many obstacles types and their spatial configurations, which are impractical to acquire purely via data. To remedy this problem, we propose Context-Aware diffusion policy via Proximal mode Expansion (CAPE), a framework that expands trajectory distribution modes with context-aware prior and guidance at inference via a novel prior-seeded iterative guided refinement procedure. The framework generates an initial trajectory plan and executes a short prefix trajectory, and then the remaining trajectory segment is perturbed to an intermediate noise level, forming a trajectory prior. Such a prior is context-aware and preserves task intent. Repeating the process with context-aware guided denoising iteratively expands mode support to allow finding smoother, less collision-prone trajectories. For collision avoidance, CAPE expands trajectory distribution modes with collision-aware context, enabling the sampling of collision-free trajectories in previously unseen environments while maintaining goal consistency. We evaluate CAPE on diverse manipulation tasks in cluttered unseen simulated and real-world settings and show up to 26% and 80% higher success rates respectively compared to SOTA methods, demonstrating better generalization to unseen environments.
zh

[AI-43] Agent ic AI Framework for Cloudburst Prediction and Coordinated Response

【速读】:该论文旨在解决传统气象预报系统在应对极端短时强降水事件(如“云暴”)时的局限性,此类事件因发生迅速且难以预测,常导致预警滞后与响应脱节。解决方案的关键在于构建一个基于多智能体(multi-agent)的人工智能系统,将感知、预报、降尺度、水文建模与协同响应整合为一个闭环的实时决策智能体系。该系统通过自主协作的智能体在整个事件生命周期中实现推理、感知与行动,并利用天气预测的智能转化为实时决策能力,从而显著提升预报可靠性、临界成功指数及预警提前量,同时借助通信与路径规划代理优化人群疏散效率,并通过嵌入式学习层实现自适应校准与透明审计,最终推动大气数据流向可操作的前瞻性洞察转化,为气候韧性提供可扩展的学习型平台。

链接: https://arxiv.org/abs/2511.22767
作者: Toqeer Ali Syed,Sohail Khan,Salman Jan,Gohar Ali,Muhammad Nauman,Ali Akarma,Ahmad Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Presented at International Conference on Business and Digital Technology, Bahrain, Springer Nature, 27 November 2025

点击查看摘要

Abstract:The challenge is growing towards extreme and short-duration rainfall events like a cloudburst that are peculiar to the traditional forecasting systems, in which the predictions and the response are taken as two distinct processes. The paper outlines an agentic artificial intelligence system to study atmospheric water-cycle intelligence, which combines sensing, forecasting, downscaling, hydrological modeling and coordinated response into a single, interconnected, priceless, closed-loop system. The framework uses autonomous but cooperative agents that reason, sense, and act throughout the entire event lifecycle, and use the intelligence of weather prediction to become real-time decision intelligence. Comparison of multi-year radar, satellite, and ground-based evaluation of the northern part of Pakistan demonstrates that the multi-agent configuration enhances forecast reliability, critical success index and warning lead time compared to the baseline models. Population reach was maximised, and errors during evacuation were minimised through communication and routing agents, and adaptive recalibration and transparent auditability were provided by the embedded layer of learning. Collectively, this leads to the conclusion that collaborative AI agents are capable of transforming atmospheric data streams into practicable foresight and provide a platform of scalable adaptive and learning-based climate resilience.
zh

[AI-44] Exact Learning of Arithmetic with Differentiable Agents NEURIPS2025

【速读】:该论文旨在解决梯度驱动方法在算法学习中难以实现精确泛化的问题,特别是针对长度外推(length generalization)能力不足的挑战。其解决方案的关键在于提出一种可微分的有限状态转换器(Differentiable Finite-State Transducer, DFST),该模型具备图灵完备性,能够实现常精度、常时间的生成,并支持端到端的对数并行可微训练。通过利用专家代理提供的策略轨迹观测数据,DFST 在二进制和十进制加法与乘法任务上进行训练后,展现出极强的长度泛化能力——即使训练数据规模极小,也能在输入长度扩大数千倍的情况下实现零误差预测,从而为基于梯度的精确算法技能学习提供了新路径。

链接: https://arxiv.org/abs/2511.22751
作者: Hristo Papazov,Francesco D’Angelo,Nicolas Flammarion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: MATH-AI

点击查看摘要

Abstract:We explore the possibility of exact algorithmic learning with gradient-based methods and introduce a differentiable framework capable of strong length generalization on arithmetic tasks. Our approach centers on Differentiable Finite-State Transducers (DFSTs), a Turing-complete model family that avoids the pitfalls of prior architectures by enabling constant-precision, constant-time generation, and end-to-end log-parallel differentiable training. Leveraging policy-trajectory observations from expert agents, we train DFSTs to perform binary and decimal addition and multiplication. Remarkably, models trained on tiny datasets generalize without error to inputs thousands of times longer than the training examples. These results show that training differentiable agents on structured intermediate supervision could pave the way towards exact gradient-based learning of algorithmic skills. Code available at \hrefthis https URLthis https URL.
zh

[AI-45] VeriDispatcher: Multi-Model Dispatching through Pre-Inference Difficulty Prediction for RTL Generation Optimization

【速读】:该论文试图解决多大语言模型(Large Language Models, LLMs)在寄存器传输级(Register-Transfer Level, RTL)生成任务中如何协同工作以提升生成质量并降低计算成本的问题。现有方法通常仅使用单一模型进行提示或微调,未能充分利用不同模型在特定任务上的优势,且存在资源浪费问题。解决方案的关键在于提出VeriDispatcher框架,其核心是基于预推理阶段的任务难度预测,动态将RTL任务分配给最适合的LLM子集;具体而言,通过训练轻量级分类器对任务描述的语义嵌入进行建模,并利用结合语法、结构相似性和功能正确性的难度评分作为监督信号,从而在推理时实现高效的任务调度,在保持甚至提升准确率的同时显著减少商用API调用次数(如RTLLM上减少至40%调用量且精度提升18%,VerilogEval上减少25%调用量且精度不变)。

链接: https://arxiv.org/abs/2511.22749
作者: Zeng Wang,Weihua Xiao,Minghao Shao,Raghu Vamshi Hemadri,Ozgur Sinanoglu,Muhammad Shafique,Ramesh Karri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show strong performance in RTL generation, but different models excel on different tasks because of architecture and training differences. Prior work mainly prompts or finetunes a single model. What remains not well studied is how to coordinate multiple different LLMs so they jointly improve RTL quality while also reducing cost, instead of running all models and choosing the best output. We define this as the multi-LLM RTL generation problem. We propose VeriDispatcher, a multi-LLM RTL generation framework that dispatches each RTL task to suitable LLMs based on pre-inference difficulty prediction. For each model, we train a compact classifier over semantic embeddings of task descriptions, using difficulty scores derived from benchmark variants that combine syntax, structural similarity, and functional correctness. At inference, VeriDispatcher uses these predictors to route tasks to a selected subset of LLMs. Across 10 diverse LLMs on RTLLM and VerilogEval, VeriDispatcher achieves up to 18% accuracy improvement on RTLLM using only 40% of commercial calls, and on VerilogEval maintains accuracy while reducing commercial usage by 25%, enabling cost-effective, high-quality LLM deployment in hardware design automation.
zh

[AI-46] Agent ic AI Framework for Individuals with Disabilities and Neurodivergence: A Multi-Agent System for Healthy Eating Daily Routines and Inclusive Well-Being

【速读】:该论文旨在解决残疾人和神经多样性人群在日常生活中面临的健康管理和生活规律性问题,通过构建一个以多智能体(Multi-Agent)为核心的生成式人工智能(Agentic Artificial Intelligence)框架,实现个性化、自适应且包容性的支持。解决方案的关键在于采用三层架构设计——应用与接口层、智能体层和数据源层,并引入一个混合推理引擎协同四个专用智能体:餐食规划代理(Meal Planner Agent)、提醒代理(Reminder Agent)、食品引导代理(Food Guidance Agent)和监测代理(Monitoring Agent),它们通过黑板/事件总线(Blackboard/Event Bus)进行自主交互与实时反馈;同时整合隐私敏感的数据源(如电子健康记录EHR、可穿戴传感器等)并置于策略控制层以保障合规与安全,并辅以可解释人工智能(XAI)模块提升用户信任感与参与度,从而推动数字公平、健康自主性和生活质量的全面提升。

链接: https://arxiv.org/abs/2511.22737
作者: Salman Jan,Toqeer Ali Syed,Gohar Ali,Ali Akarma,Mohammad Riyaz Belgaum,Ahmad Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Presented at International Conference on Business and Digital Technology, Bahrain, Springer Nature, 27 November 2025

点击查看摘要

Abstract:The paper presents a detailed Agentic Artificial Intelligence (AI) model that would enable people with disabilities and neurodivergence to lead healthier lives and have more regular days. The system will use a multi-layer structure; it will include an Application and Interface Layer, an Agents Layer, and a Data Source Layer to provide adaptive, transparent, and inclusive support. Fundamentally, a hybrid reasoning engine will synchronize four special-purpose agents, which include: a personalized-nutrition-based, called a Meal Planner Agent; an adaptive-scheduling-based, called a Reminder Agent; interactive assistance during grocery shopping and cooking, called a Food Guidance Agent; and a continuous-intake-and-physiological-tracking, called a Monitoring Agent. All the agents interact through a central communicative system called the Blackboard/Event Bus, which allows autonomous interaction and real-time feedback loops with multimedia user interfaces. Privacy-sensitive data sources, including electronic health records (EHRs), nutritional databases, wearable sensors, and smart kitchen Internet of Things, are also included in the framework and placed into a policy-controlled layer, which ensures data safety and compliance with consent. Collaborative care and clinician dashboards allow common supervision, and discussable artificial intelligence (XAI) modules give brief explanations of why a decision was made, making users responsible and reliant. The proposed agentic AI framework is an extension beyond traditional assistive systems since it incorporates inclusiveness, personalization, and accessibility at all levels. It displays the intersection of multi-agent reasoning, multi-modal interfaces, and human-centered design that will enable the development of autonomy, health, and digital equity among people with disabilities and neurodivergence.
zh

[AI-47] Solving Context Window Overflow in AI Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理外部工具返回的长文本输出时,因超出上下文窗口(context window)限制而导致任务无法完成的问题。传统方法如截断或摘要虽可缓解长度问题,但会丢失关键信息,不适用于需完整数据支持的科研工作流。其解决方案的关键在于将模型与工具的交互从原始数据转移到内存指针(memory pointers),从而在不损失信息的前提下实现对任意长度工具响应的高效处理,同时保持工具功能完整性、兼容智能体(agentic)工作流,并显著降低token消耗与执行时间。

链接: https://arxiv.org/abs/2511.22729
作者: Anton Bulle Labate,Valesca Moura de Sousa,Sandro Rama Fiorini,Leonardo Guerreiro Azevedo,Raphael Melo Thiago,Viviane Torres da Silva
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly capable of interacting with external tools, granting access to specialized knowledge beyond their training data - critical in dynamic, knowledge-intensive domains such as Chemistry and Materials Science. However, large tool outputs can overflow the LLMs’ context window, preventing task completion. Existing solutions such as truncation or summarization fail to preserve complete outputs, making them unsuitable for workflows requiring the full data. This work introduces a method that enables LLMs to process and utilize tool responses of arbitrary length without loss of information. By shifting the model’s interaction from raw data to memory pointers, the method preserves tool functionality, allows seamless integration into agentic workflows, and reduces token usage and execution time. The proposed method is validated on a real-world Materials Science application that cannot be executed with conventional workflows, and its effectiveness is demonstrated via a comparative analysis where both methods succeed. In this experiment, the proposed approach consumed approximately seven times fewer tokens than the traditional workflow.
zh

[AI-48] CoFiRec: Coarse-to-Fine Tokenization for Generative Recommendation

【速读】:该论文旨在解决现有生成式推荐系统(Generative Recommendation)在建模用户兴趣演化过程中的不足问题。传统方法将用户历史中物品的异构属性(如ID、类别、标题和描述)融合为单一嵌入后再进行量化,忽略了物品语义从粗粒度到细粒度的层次结构,从而难以捕捉用户从浏览宽泛类别到探索具体物品时兴趣的渐进式演变。解决方案的关键在于提出CoFiRec框架,其核心创新是引入“由粗到细”(Coarse-to-Fine)的结构化分层tokenization机制——将物品信息按语义层级分解为多个独立模块(如类别、描述及协同过滤信号),并设计CoFiRec Tokenizer对每一层独立编码且保持结构顺序;在自回归解码阶段,语言模型被引导从粗粒度到细粒度逐步生成物品token,从而显式建模用户意图的精细化过程。理论分析进一步证明该结构化tokenization可降低生成项与真实项之间的差异性,验证了其有效性。

链接: https://arxiv.org/abs/2511.22707
作者: Tianxin Wei,Xuying Ning,Xuxing Chen,Ruizhong Qiu,Yupeng Hou,Yan Xie,Shuang Yang,Zhigang Hua,Jingrui He
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In web environments, user preferences are often refined progressively as users move from browsing broad categories to exploring specific items. However, existing generative recommenders overlook this natural refinement process. Generative recommendation formulates next-item prediction as autoregressive generation over tokenized user histories, where each item is represented as a sequence of discrete tokens. Prior models typically fuse heterogeneous attributes such as ID, category, title, and description into a single embedding before quantization, which flattens the inherent semantic hierarchy of items and fails to capture the gradual evolution of user intent during web interactions. To address this limitation, we propose CoFiRec, a novel generative recommendation framework that explicitly incorporates the Coarse-to-Fine nature of item semantics into the tokenization process. Instead of compressing all attributes into a single latent space, CoFiRec decomposes item information into multiple semantic levels, ranging from high-level categories to detailed descriptions and collaborative filtering signals. Based on this design, we introduce the CoFiRec Tokenizer, which tokenizes each level independently while preserving structural order. During autoregressive decoding, the language model is instructed to generate item tokens from coarse to fine, progressively modeling user intent from general interests to specific item-level interests. Experiments across multiple public benchmarks and backbones demonstrate that CoFiRec outperforms existing methods, offering a new perspective for generative recommendation. Theoretically, we prove that structured tokenization leads to lower dissimilarity between generated and ground truth items, supporting its effectiveness in generative recommendation. Our code is available at this https URL.
zh

[AI-49] Probabilistic Fusion and Calibration of Neural Speaker Diarization Models

【速读】:该论文旨在解决端到端神经语音分 speaker(End-to-End Neural Diarization, EEND)系统中概率输出的校准(calibration)与融合问题,尤其是在当前评估主要依赖分 speaker 错误率(Diarization Error Rate, DER)的情况下,模型输出的置信度可靠性被忽视。其关键解决方案是提出首个在概率层面对 EEND 模型进行校准与融合的完整框架,通过利用连续概率输出而非传统的硬决策(hard decisions),实现更精细的不确定性建模和多模型互补优势整合。文中比较了多标签(multilabel)与幂集(powerset)表示法对校准效果的影响,并验证联合校准(joint calibration)优于独立校准,且“先融合后校准”(Fuse-then-Calibrate)策略比“先校准后融合”更高效,最终在 CallHome 两说话人基准上显著降低 DER 并提供可靠的置信度估计,为下游应用奠定基础。

链接: https://arxiv.org/abs/2511.22696
作者: Juan Ignacio Alvarez-Trejos,Sergio A. Balanya,Daniel Ramos,Alicia Lozano-Diez
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated calibration and fusion techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, and that the Fuse-then-Calibrate ordering generally outperforms calibrating individual models before fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
zh

[AI-50] st-time scaling of diffusions with flow maps

【速读】:该论文旨在解决扩散模型在测试阶段难以有效提升样本对用户指定奖励函数得分的问题,尤其针对奖励函数通常仅在生成过程末尾的数据分布上定义清晰而导致的梯度引入不稳定性问题。解决方案的关键在于直接利用流映射(flow map)而非依赖传统基于去噪器的近似方法来构建优化路径;通过挖掘流映射与控制瞬时传输的矢量场之间的数学关系,作者提出Flow Map Trajectory Tilting (FMTT)算法,该算法在理论上保证比标准测试时使用奖励梯度的方法具有更优的奖励上升性能,从而支持精确采样或可证明的局部最优搜索,并能高效处理复杂奖励函数,例如结合视觉语言模型实现新型图像编辑。

链接: https://arxiv.org/abs/2511.22688
作者: Amirmojtaba Sabour,Michael S. Albergo,Carles Domingo-Enrich,Nicholas M. Boffi,Sanja Fidler,Karsten Kreis,Eric Vanden-Eijnden
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common recipe to improve diffusion models at test-time so that samples score highly against a user-specified reward is to introduce the gradient of the reward into the dynamics of the diffusion itself. This procedure is often ill posed, as user-specified rewards are usually only well defined on the data distribution at the end of generation. While common workarounds to this problem are to use a denoiser to estimate what a sample would have been at the end of generation, we propose a simple solution to this problem by working directly with a flow map. By exploiting a relationship between the flow map and velocity field governing the instantaneous transport, we construct an algorithm, Flow Map Trajectory Tilting (FMTT), which provably performs better ascent on the reward than standard test-time methods involving the gradient of the reward. The approach can be used to either perform exact sampling via importance weighting or principled search that identifies local maximizers of the reward-tilted distribution. We demonstrate the efficacy of our approach against other look-ahead techniques, and show how the flow map enables engagement with complicated reward functions that make possible new forms of image editing, e.g. by interfacing with vision language models.
zh

[AI-51] Automated Design Optimization via Strategic Search with Large Language Models

【速读】:该论文旨在解决在搜索空间定义不明确、设计参数难以形式化的问题中,传统优化方法失效的挑战。其核心解决方案是提出一个基于大语言模型(Large Language Models, LLMs)的代理框架AUTO,将设计优化建模为一种无梯度的搜索问题,并通过两个协同工作的智能体实现:策略制定者(Strategist)负责在探索与利用之间动态决策,执行者(Implementor)则负责具体设计方案的生成与实施。该方法利用LLM对设计空间的动态理解能力和编码的领域知识,在无需先验信息的情况下实现高效优化,尤其在GPU代码优化任务中展现出与专家实现相当的性能,同时显著降低计算和人力成本。

链接: https://arxiv.org/abs/2511.22651
作者: Anthony Carreon,Vansh Sharma,Venkat Raman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注: 14 pages, 5 tables, 7 figures, preprint

点击查看摘要

Abstract:Traditional optimization methods excel in well-defined search spaces but struggle with design problems where transformations and design parameters are difficult to define. Large language models (LLMs) offer a promising alternative by dynamically interpreting design spaces and leveraging encoded domain knowledge. To this end, we introduce AUTO, an LLM agent framework that treats design optimization as a gradient-free search problem guided by strategic LLM reasoning. The framework employs two collaborative agents: a Strategist that selects between exploration and exploitation strategies, and an Implementor that executes detailed designs. Applied to GPU code optimization – a domain critical to fields from machine learning to scientific computing – AUTO generates solutions competitive with expert implementations for chemical kinetics integration and dense matrix multiplication. The framework achieves 50-70% search efficiency relative to Bayesian optimization methodologies. It completes optimizations in approximately 8 hours at an estimated cost of up to \ 159 per run, compared to an estimated cost of up to \ 480 with median-wage software developers. These findings open the door to automating design optimization in ill-defined search spaces with limited prior information.
zh

[AI-52] Optimized Agent Shift Scheduling Using Multi-Phase Allocation Approach

【速读】:该论文旨在解决接触中心即服务(Contact Center as a Service, CCaaS)行业中员工排班调度的效率与准确性问题,特别是在高峰需求场景(如节假日高峰期)下,如何在人员有限的情况下维持服务质量。其解决方案的关键在于提出一种多阶段分配方法(multi-phase allocation method),将原本复杂的单步整数规划问题(Integer Programming Problem, IPP)分解为日级和班次级两个子问题,从而显著减少计算变量数量,并允许针对每个子问题设计特定目标函数,实现更高效且精准的调度优化。

链接: https://arxiv.org/abs/2511.22632
作者: Sanalkumar K,Koushik Dey,Swati Meena
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Effective agent shift scheduling is crucial for businesses, especially in the Contact Center as a Service (CCaaS) industry, to ensure seamless operations and fulfill employee needs. Most studies utilizing mathematical model-based solutions approach the problem as a single-step process, often resulting in inefficiencies and high computational demands. In contrast, we present a multi-phase allocation method that addresses scalability and accuracy by dividing the problem into smaller sub-problems of day and shift allocation, which significantly reduces number of computational variables and allows for targeted objective functions, ultimately enhancing both efficiency and accuracy. Each subproblem is modeled as a Integer Programming Problem (IPP), with solutions sequentially feeding into the subsequent subproblem. We then apply the proposed method, using a multi-objective framework, to address the difficulties posed by peak demand scenarios such as holiday rushes, where maintaining service levels is essential despite having limited number of employees
zh

[AI-53] AI Deception: Risks Dynamics and Controls

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统中日益凸显的欺骗行为问题,即AI通过诱导错误信念以获取自身利益的行为,这已成为语言模型、AI代理及前沿系统中的实证风险。其解决方案的关键在于构建一个系统的“欺骗循环”框架,涵盖欺骗产生(deception emergence)与欺骗应对(deception treatment)两个核心环节:在欺骗产生层面,识别出激励机制的三层结构和三种能力前提,并分析监督盲区、分布偏移与环境压力等触发因素;在欺骗应对层面,提出基于基准测试与评估协议的检测方法,并结合技术、社区与治理多维协同的审计策略,以实现对AI欺骗行为的有效识别与缓解,从而应对这一复杂的社会技术安全挑战。

链接: https://arxiv.org/abs/2511.22619
作者: Boyuan Chen,Sitong Fang,Jiaming Ji,Yanxu Zhu,Pengcheng Wen,Jinzhou Wu,Yingshui Tan,Boren Zheng,Mengying Yuan,Wenqi Chen,Donghai Hong,Alex Qiu,Xin Chen,Jiayi Zhou,Kaile Wang,Juntao Dai,Borong Zhang,Tianzhuo Yang,Saad Siddiqui,Isabella Duan,Yawen Duan,Brian Tse,Jen-Tse(Jay)Huang,Kun Wang,Baihui Zheng,Jiaheng Liu,Jian Yang,Yiming Li,Wenting Chen,Dongrui Liu,Lukas Vierling,Zhiheng Xi,Haobo Fu,Wenxuan Wang,Jitao Sang,Zhengyan Shi,Chi-Min Chan,Eugenie Shi,Simin Li,Juncheng Li,Wei Ji,Dong Li,Jun Song,Yinpeng Dong,Jie Fu,Bo Zheng,Min Yang,Yike Guo,Philip Torr,Zhongyuan Wang,Yaodong Yang,Tiejun Huang,Ya-Qin Zhang,Hongjiang Zhang,Andrew Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we conclude detection methods covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. To support ongoing work in this area, we release a living resource at this http URL.
zh

[AI-54] Where to Measure: Epistemic Uncertainty-Based Sensor Placement with ConvCNPs

【速读】:该论文旨在解决传感器部署中因依赖总预测不确定性(total predictive uncertainty)而导致的次优选择问题,尤其是在模型认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty)混杂的情况下,可能在模糊区域产生不准确的传感器位置决策。其解决方案的关键在于提出一种新的采集函数(acquisition function),即期望减少的认知不确定性(expected reduction in epistemic uncertainty),并通过在卷积条件神经过程(ConvCNPs)基础上引入混合密度网络(Mixture Density Networks, MDNs)输出头,实现对认知不确定性的有效估计,从而提升传感器部署的准确性与模型性能。

链接: https://arxiv.org/abs/2511.22567
作者: Feyza Eksen,Stefan Oehmcke,Stefan Lüdtke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate sensor placement is critical for modeling spatio-temporal systems such as environmental and climate processes. Neural Processes (NPs), particularly Convolutional Conditional Neural Processes (ConvCNPs), provide scalable probabilistic models with uncertainty estimates, making them well-suited for data-driven sensor placement. However, existing approaches rely on total predictive uncertainty, which conflates epistemic and aleatoric components, that may lead to suboptimal sensor selection in ambiguous regions. To address this, we propose expected reduction in epistemic uncertainty as a new acquisition function for sensor placement. To enable this, we extend ConvCNPs with a Mixture Density Networks (MDNs) output head for epistemic uncertainty estimation. Preliminary results suggest that epistemic uncertainty driven sensor placement more effectively reduces model error than approaches based on overall uncertainty.
zh

[AI-55] Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation

【速读】:该论文旨在解决当前神经复杂查询问答(Complex Query Answering, CQA)模型是否真正超越传统符号推理方法的问题,特别是其是否具备从知识图谱(Knowledge Graph, KG)中学习泛化推理模式的能力。研究发现,尽管神经模型结构复杂,但在多个数据集和查询类型上,它们的表现与一种无需训练的查询松弛策略(query relaxation strategy)相当,且两者答案重叠度低,联合使用时性能显著提升。解决方案的关键在于提出并系统评估了一种基于约束松弛和路径计数的非神经基线方法,从而揭示现有神经模型未能完全捕获查询松弛所体现的逻辑推理能力,强调未来神经CQA方法应借鉴此类非神经推理原则以提升泛化性和鲁棒性。

链接: https://arxiv.org/abs/2511.22565
作者: Yannick Brunink,Daniel Daza,Yunjie He,Michael Cochez
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing. In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance. These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
zh

[AI-56] A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind AAAI2026

【速读】:该论文旨在解决如何在计算系统中实现**心智理论(Theory of Mind, ToM)**的问题,特别是如何将心理概念如目标(goals)、意图(intentions)和信念(beliefs)形式化并用于构建具有有限理性决策能力的智能体。其解决方案的关键在于提出一个基于博弈论的计算框架:一方面,该框架通过递归建模其他智能体的心智状态来指导有限理性决策;另一方面,它引入统计技术和近似求解方法以确保该复杂计算问题的可计算性(computability)。

链接: https://arxiv.org/abs/2511.22536
作者: Fengming Zhu,Yuxin Pan,Xiaomeng Zhu,Fangzhen Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Ongoing work. A preliminary version has been accepted by the AAAI 2026 Theory of Mind for AI (ToM4AI) Workshop

点击查看摘要

Abstract:Originating in psychology, \textitTheory of Mind (ToM) has attracted significant attention across multiple research communities, especially logic, economics, and robotics. Most psychological work does not aim at formalizing those central concepts, namely \textitgoals , \textitintentions , and \textitbeliefs , to automate a ToM-based computational process, which, by contrast, has been extensively studied by logicians. In this paper, we offer a different perspective by proposing a computational framework viewed through the lens of game theory. On the one hand, the framework prescribes how to make boudedly rational decisions while maintaining a theory of mind about others (and recursively, each of the others holding a theory of mind about the rest); on the other hand, it employs statistical techniques and approximate solutions to retain computability of the inherent computational problem.
zh

[AI-57] HW-GNN: Homophily-Aware Gaussian-Window Constrained Graph Spectral Network for Social Network Bot Detection

【速读】:该论文旨在解决当前基于谱域的图神经网络(Graph Neural Networks, GNNs)在社交机器人(social bots)检测中面临的两个关键问题:一是现有方法采用宽频带拟合机制,导致对机器人特异性谱特征的关注度不足;二是未充分融合有助于识别机器人的领域知识,例如低同质性(homophily)通常与高频特征相关。为此,作者提出HW-GNN框架,其核心创新在于:(i) 引入可学习的高斯窗约束谱网络,通过高斯窗口聚焦于与机器人相关的谱特征;(ii) 设计同质性感知的自适应机制,将同质性比率与频率特征之间的先验知识嵌入高斯窗优化过程中,从而增强模型对bot特异性模式的敏感性。实验表明,该方法在多个基准数据集上显著优于现有技术,平均F1-score提升4.3%,且具备良好的模块兼容性。

链接: https://arxiv.org/abs/2511.22493
作者: Zida Liu,Jun Gao,Zhang Ji,Li Zhao
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social bots are increasingly polluting online platforms by spreading misinformation and engaging in coordinated manipulation, posing severe threats to cybersecurity. Graph Neural Networks (GNNs) have become mainstream for social bot detection due to their ability to integrate structural and attribute features, with spectral-based approaches demonstrating particular efficacy due to discriminative patterns in the spectral domain. However, current spectral GNN methods face two limitations: (1) their broad-spectrum fitting mechanisms degrade the focus on bot-specific spectral features, and (2) certain domain knowledge valuable for bot detection, e.g., low homophily correlates with high-frequency features, has not been fully incorporated into existing methods. To address these challenges, we propose HW-GNN, a novel homophily-aware graph spectral network with Gaussian window constraints. Our framework introduces two key innovations: (i) a Gaussian-window constrained spectral network that employs learnable Gaussian windows to highlight bot-related spectral features, and (ii) a homophily-aware adaptation mechanism that injects domain knowledge between homophily ratios and frequency features into the Gaussian window optimization process. Through extensive experimentation on multiple benchmark datasets, we demonstrate that HW-GNN achieves state-of-the-art bot detection performance, outperforming existing methods with an average improvement of 4.3% in F1-score, while exhibiting strong plug-in compatibility with existing spectral GNNs. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.22493 [cs.SI] (or arXiv:2511.22493v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2511.22493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-58] Structured Extraction from Business Process Diagrams Using Vision-Language Models

【速读】:该论文旨在解决从BPMN(Business Process Model and Notation)流程图图像中提取结构化数据的问题,尤其是在缺乏原始XML源文件或文本标注的情况下。传统方法依赖于XML格式进行计算分析,而本文提出了一种基于视觉语言模型(Vision-Language Models, VLMs)的端到端流水线,直接从图像中提取JSON格式的BPMN元素信息,并结合光学字符识别(OCR)技术对文本内容进行增强。其解决方案的关键在于利用VLMs理解图像语义并生成结构化输出,同时通过OCR提升文本识别精度,从而实现无需源文件即可准确提取BPMN组件的鲁棒性方案。

链接: https://arxiv.org/abs/2511.22448
作者: Pritam Deka,Barry Devereux
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: To appear in the Proceedings of the 2026 ACM Symposium on Applied Computing (SAC '26)

点击查看摘要

Abstract:Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.
zh

[AI-59] FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE

【速读】:该论文旨在解决在全同态加密(Fully Homomorphic Encryption, FHE)环境下深度卷积神经网络(Deep Convolutional Neural Networks, Deep CNNs)模型推理过程中存在的高延迟问题,尤其针对三个核心瓶颈:i)卷积计算的时间与存储开销;ii)大规模bootstrapping操作带来的时延;iii)电路乘法深度消耗。解决方案的关键在于提出一种名为FastFHE的高效机制,其核心创新包括:1)设计了一种可扩展的密文数据打包方案以降低时间和存储成本;2)引入深度可分离卷积结构以减轻卷积计算负载;3)提出BN点积融合矩阵,在不增加额外乘法深度的前提下将卷积层与批归一化(Batch Normalization, BN)层合并;4)采用低阶勒让德多项式近似平滑非线性激活函数SiLU,在保证加密前后精度误差极小的同时提升计算效率。

链接: https://arxiv.org/abs/2511.22434
作者: Wenbo Song,Xinxin Fan,Quanliang Jing,Shaoye Luo,Wenqi Wei,Chi Lin,Yunfeng Lu,Ling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deep learning (DL) has been penetrating daily life in many domains, how to keep the DL model inference secure and sample privacy in an encrypted environment has become an urgent and increasingly important issue for various security-critical applications. To date, several approaches have been proposed based on the Residue Number System variant of the Cheon-Kim-Kim-Song (RNS-CKKS) scheme. However, they all suffer from high latency, which severely limits the applications in real-world tasks. Currently, the research on encrypted inference in deep CNNs confronts three main bottlenecks: i) the time and storage costs of convolution calculation; ii) the time overhead of huge bootstrapping operations; and iii) the consumption of circuit multiplication depth. Towards these three challenges, we in this paper propose an efficient and effective mechanism FastFHE to accelerate the model inference while simultaneously retaining high inference accuracy over fully homomorphic encryption. Concretely, our work elaborates four unique novelties. First, we propose a new scalable ciphertext data-packing scheme to save the time and storage consumptions. Second, we work out a depthwise-separable convolution fashion to degrade the computation load of convolution calculation. Third, we figure out a BN dot-product fusion matrix to merge the ciphertext convolutional layer with the batch-normalization layer without incurring extra multiplicative depth. Last but not least, we adopt the low-degree Legendre polynomial to approximate the nonlinear smooth activation function SiLU under the guarantee of tiny accuracy error before and after encrypted inference. Finally, we execute multi-facet experiments to verify the efficiency and effectiveness of our proposed approach.
zh

[AI-60] MATCH: Engineering Transparent and Controllable Conversational XAI Systems through Composable Building Blocks

【速读】:该论文旨在解决交互系统中人工智能(AI)模型因“黑箱”特性导致的整体可解释性缺失问题,尤其在标准可解释人工智能(Explainable AI, XAI)技术与人机交互式XAI方法中,即使个体模型具备可解释性,系统架构仍难以被清晰理解。解决方案的关键在于提出一种基于结构化构建模块(structural building blocks)的流程化建模方法,将交互系统抽象为由AI模型和控制机制组成的序列结构,并通过互补的解释构建模块(如LIME、SHAP等XAI技术)对各部分进行解释,从而形成明确的系统流和API接口,实现人类与机器在嵌入式AI模型上的可解释性对齐。该框架命名为MATCH(Multi-Agent Transparent and Controllable Human-centered systems),为现有交互系统的可解释性集成提供了工程化路径。

链接: https://arxiv.org/abs/2511.22420
作者: Sebe Vanbrabant,Gustavo Rovelo Ruiz,Davy Vanacken
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Submitted Version accepted for publication in an LNCS Volume “Engineering Interactive Computer Systems - EICS 2025 - International Workshops and Doctoral Consortium”

点击查看摘要

Abstract:While the increased integration of AI technologies into interactive systems enables them to solve an increasing number of tasks, the black-box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post-hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. This challenge not only pertains to standard XAI techniques but also to human examination and conversational XAI approaches that need access to model internals to interpret them correctly and completely. To this end, we propose conceptually representing such interactive systems as sequences of structural building blocks. These include the AI models themselves, as well as control mechanisms grounded in literature. The structural building blocks can then be explained through complementary explanatory building blocks, such as established XAI techniques like LIME and SHAP. The flow and APIs of the structural building blocks form an unambiguous overview of the underlying system, serving as a communication basis for both human and automated agents, thus aligning human and machine interpretability of the embedded AI models. In this paper, we present our flow-based approach and a selection of building blocks as MATCH: a framework for engineering Multi-Agent Transparent and Controllable Human-centered systems. This research contributes to the field of (conversational) XAI by facilitating the integration of interpretability into existing interactive systems.
zh

[AI-61] Who is Afraid of Minimal Revision?

【速读】:该论文试图解决信念修正理论中最小修正(minimal revision)方法在学习能力上的局限性问题,即尽管其遵循最小变化原则以保持信念状态的稳定性,但相较于其他学习方法,其学习能力存在不足。解决方案的关键在于系统性地刻画在有限可能性假设下,何种先验可接受度分配(prior plausibility assignments)能够使最小修正实现有效学习,并进一步对比条件化(conditioning)和词典升级(lexicographic upgrade)方法的学习性能边界。研究发现,在正负样本数据支持且假设空间有限的情况下,最小修正仍具备广泛适用性,能学习所有有限可识别的问题;然而,当信息可能包含错误时,这些结论不再成立。

链接: https://arxiv.org/abs/2511.22386
作者: Edoardo Baccini(University of Groningen),Zoé Christoff(University of Groningen),Nina Gierasimczuk(Technical University of Denmark),Rineke Verbrugge(University of Groningen)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings TARK 2025, arXiv:2511.20540

点击查看摘要

Abstract:The principle of minimal change in belief revision theory requires that, when accepting new information, one keeps one’s belief state as close to the initial belief state as possible. This is precisely what the method known as minimal revision does. However, unlike less conservative belief revision methods, minimal revision falls short in learning power: It cannot learn everything that can be learned by other learning methods. We begin by showing that, despite this limitation, minimal revision is still a successful learning method in a wide range of situations. Firstly, it can learn any problem that is finitely identifiable. Secondly, it can learn with positive and negative data, as long as one considers finitely many possibilities. We then characterize the prior plausibility assignments (over finitely many possibilities) that enable one to learn via minimal revision, and do the same for conditioning and lexicographic upgrade. Finally, we show that not all of our results still hold when learning from possibly erroneous information.
zh

[AI-62] Graded Distributed Belief

【速读】:该论文旨在解决群体信念(group belief)在分布式情境下如何形式化表达与计算的问题,特别是如何刻画一个群体对某命题的信念强度是基于成员个体信念基(belief base)合并后得出的。其核心挑战在于构建一种逻辑框架,既能精确描述群体以一定强度“分布性地相信”某一事实,又能确保语义的计算可实现性。解决方案的关键在于引入一种新的分级分布式信念逻辑(graded distributed belief logic),通过将群体信念强度直接从合并后的个体信念基中计算得出,并采用基于信念基的计算语义进行解释,从而实现了逻辑系统的公理化、完备性、可判定性及PSPACE完全性的证明。

链接: https://arxiv.org/abs/2511.22381
作者: Emiliano Lorini(IRIT, CNRS, Toulouse University),Dmitry Rozplokhas(TU Wien)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings TARK 2025, arXiv:2511.20540

点击查看摘要

Abstract:We introduce a new logic of graded distributed belief that allows us to express the fact that a group of agents distributively believe that a certain fact holds with at least strength k. We interpret our logic by means of computationally grounded semantics relying on the concept of belief base. The strength of the group’s distributed belief is directly computed from the group’s belief base after having merged its members’ individual belief bases. We illustrate our logic with an intuitive example, formalizing the notion of epistemic disagreement. We also provide a sound and complete Hilbert-style axiomatization, decidability result obtained via filtration, and a tableaux-based decision procedure that allows us to state PSPACE-completeness for our logic.
zh

[AI-63] Conditionals Based on Selection Functions Modal Operators and Probabilities

【速读】:该论文旨在解决概率更新方法与条件句(conditional connectives)之间关系的理论刻画问题,特别是如何通过条件句来表征不同更新机制的概率特性。其解决方案的关键在于采用一种广义视角,涵盖多种类型的条件句和广泛的更新方法,从而推导出关于二者相互关联的一般性结论,进而能够识别哪些类别的更新程序可以由特定的条件句形式所表示,并为某些条件句的概率特性提供形式化刻画。

链接: https://arxiv.org/abs/2511.22377
作者: Tommaso Flaminio(IIIA-CSIC),Lluis Godo(IIIA-CSIC),Gluliano Rosella(Univerity of Turin)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注: In Proceedings TARK 2025, arXiv:2511.20540

点击查看摘要

Abstract:Methods for probability updating, of which Bayesian conditionalization is the most well-known and widely used, are modeling tools that aim to represent the process of modifying an initial epistemic state, typically represented by a prior probability function P, which is adjusted in light of new information. Notably, updating methods and conditional sentences seem to intuitively share a deep connection, as is evident in the case of conditionalization. The present work contributes to this line of research and aims at shedding new light on the relationship between updating methods and conditional connectives. Departing from previous literature that often focused on a specific type of conditional or a particular updating method, our goal is to prove general results concerning the connection between conditionals and their probabilities. This will allow us to characterize the probabilities of certain conditional connectives and to understand what class of updating procedures can be represented using specific conditional connectives. Broadly, we adopt a general perspective that encompasses a large class of conditionals and a wide range of updating methods, enabling us to prove some general results concerning their interrelation.
zh

[AI-64] On the Complexity of the Grounded Semantics for Infinite Argumentation Frameworks

【速读】:该论文旨在解决生成式 AI (Generative AI) 中论证框架(Argumentation Framework)中基底扩展(grounded extension)的计算复杂性问题,特别是其在无额外约束条件下求解所需的迭代深度和判定复杂度。解决方案的关键在于利用数学逻辑中的可计算性理论与集合论方法,精确识别出求解基底扩展所需 transfinite 迭代过程所对应的序数(ordinal number),并证明决定基底接受性的复杂度达到最大值,从而揭示其与有限迭代情形下多项式时间可解之间的本质差异。

链接: https://arxiv.org/abs/2511.22376
作者: Uri Andrews(University of Wisconsin–Madison),Luca San Mauro(University of Bari)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings TARK 2025, arXiv:2511.20540

点击查看摘要

Abstract:Argumentation frameworks, consisting of arguments and an attack relation representing conflicts, are fundamental for formally studying reasoning under conflicting information. We use methods from mathematical logic, specifically computability and set theory, to analyze the grounded extension, a widely-used model of maximally skeptical reasoning, defined as the least fixed-point of a natural defense operator. Without additional constraints, finding this fixed-point requires transfinite iterations. We identify the exact ordinal number corresponding to the length of this iterative process and determine the complexity of deciding grounded acceptance, showing it to be maximally complex. This shows a marked distinction from the finite case where the grounded extension is polynomial-time computable, thus simpler than other reasoning problems explored in formal argumentation.
zh

[AI-65] Distributed Knowing How

【速读】:该论文旨在解决知识-如何(know-how)逻辑中分布式知识(distributed knowledge)的建模问题,即如何刻画群体在多步策略下通过子群协作行动所获得的集体能力。现有研究主要分为基于个体的多步框架和基于联盟的单步框架,但两者均未充分考虑群体整体能力可能超越其成员联合能力的情形。解决方案的关键在于提出一种新的分布式知识-如何(distributed knowledge-how)概念,其基础是群体的分布式知识-那(distributed knowledge-that),并通过子群可共同执行的分布式动作推导出多步策略;进而构建了一个公理系统,证明其在语义上既可靠又强完备,且其公理结构与经典的分布式知识-那逻辑高度一致,从而统一了两种传统框架并扩展了 know-how 逻辑的表达力。

链接: https://arxiv.org/abs/2511.22374
作者: Bin Liu(Peking University),Yanjing Wang(Peking University)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings TARK 2025, arXiv:2511.20540

点击查看摘要

Abstract:Distributed knowledge is a key concept in the standard epistemic logic of knowledge-that. In this paper, we propose a corresponding notion of distributed knowledge-how and study its logic. Our framework generalizes two existing traditions in the logic of know-how: the individual-based multi-step framework and the coalition-based single-step framework. In particular, we assume a group can accomplish more than what its individuals can jointly do. The distributed knowledge-how is based on the distributed knowledge-that of a group whose multi-step strategies derive from distributed actions that subgroups can collectively perform. As the main result, we obtain a sound and strongly complete proof system for our logic of distributed knowledge-how, which closely resembles the logic of distributed knowledge-that in both the axioms and the proof method of completeness.
zh

[AI-66] BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

【速读】:该论文旨在解决开放词汇移动操作(Open-vocabulary mobile manipulation, OVMM)中因世界表征更新滞后导致的鲁棒性不足问题,即现有方法仅在离散时刻(如导航目标或动作终点)更新环境信息,造成机器人在两次更新之间“盲视”,进而引发对象遗漏、错误检测延迟和重规划滞后等级联失败。解决方案的关键在于提出BINDER框架,其核心是将战略规划与持续环境监控解耦:通过一个用于任务规划的多模态大语言模型(Deliberative Response Module, DRM)与一个基于VideoLLM的实时监控模块(Instant Response Module, IRM)实现双向协同;其中DRM负责结构化3D场景更新并引导IRM的关注区域,而IRM则持续分析视频流以更新记忆、纠正当前动作并在必要时触发重规划,从而在保持环境感知的同时避免频繁昂贵的更新,显著提升动态环境下系统的适应能力与执行效率。

链接: https://arxiv.org/abs/2511.22364
作者: Seongwon Cho,Daechul Ahn,Donghyun Shin,Hyeonbeom Choi,San Kim,Jonghyun Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.
zh

[AI-67] st Time Training for AC Power Flow Surrogates via Physics and Operational Constraint Refinement

【速读】:该论文旨在解决基于机器学习(Machine Learning, ML)的潮流计算(Power Flow, PF)方法在保持物理一致性方面的不足问题。传统数值方法虽物理准确,但计算效率低;而纯ML模型虽速度快,却难以满足交流潮流方程和运行约束的严格物理要求。解决方案的关键在于提出一种物理信息引导的推理时训练(Physics-Informed Test-Time Training, PI-TTT)框架,该框架在推理阶段通过少量梯度更新对ML代理模型输出进行轻量级自监督修正,直接施加交流潮流等式和运行约束,从而实现无需标签数据即可适应未见运行工况的局部优化,显著提升预测精度与物理可行性,同时保留原有计算优势。

链接: https://arxiv.org/abs/2511.22343
作者: Panteleimon Dogoulis,Mohammad Iman Alizadeh,Sylvain Kubler,Maxime Cordy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Power Flow (PF) calculation based on machine learning (ML) techniques offer significant computational advantages over traditional numerical methods but often struggle to maintain full physical consistency. This paper introduces a physics-informed test-time training (PI-TTT) framework that enhances the accuracy and feasibility of ML-based PF surrogates by enforcing AC power flow equalities and operational constraints directly at inference time. The proposed method performs a lightweight self-supervised refinement of the surrogate outputs through few gradient-based updates, enabling local adaptation to unseen operating conditions without requiring labeled data. Extensive experiments on the IEEE 14-, 118-, and 300-bus systems and the PEGASE 1354-bus network show that PI-TTT reduces power flow residuals and operational constraint violations by one to two orders of magnitude compared with purely ML-based models, while preserving their computational advantage. The results demonstrate that PI-TTT provides fast, accurate, and physically reliable predictions, representing a promising direction for scalable and physics-consistent learning in power system analysis.
zh

[AI-68] Edge Deployment of Small Language Models a comprehensive comparison of CPU GPU and NPU backends

【速读】:该论文旨在解决在资源受限的边缘计算环境中部署小型语言模型(Small Language Models, SLMs)时,如何选择最优硬件平台以平衡推理性能与能效的问题。其关键解决方案在于系统性地评估商用CPU(Intel和ARM)、GPU(NVIDIA)及神经网络处理单元(Neural Processing Units, NPUs)在运行多种前沿SLMs时的推理性能与能量效率,发现专用加速器(尤其是NPUs)通过定制化硬件设计显著优于通用CPU,且在考虑带宽归一化后的跨架构比较中展现出压倒性优势;同时指出,尽管低功耗ARM处理器在能耗指标上表现良好,但综合性能与功耗的度量(如能量延迟积 EDP)仍表明NPUs是边缘场景下最优的硬件选择。

链接: https://arxiv.org/abs/2511.22334
作者: Pablo Prieto,Pablo Abad
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy consumption, making them unsuitable for large language models (LLMs). Fortunately, Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments by significantly reducing computational cost while remaining suitable for specialization and customization. In this scenario, selecting the hardware platform that best balances performance and efficiency for SLM inference is challenging due to strict resource limitations. To address this issue, this study evaluates the inference performance and energy efficiency of commercial CPUs (Intel and ARM), GPUs (NVIDIA), and NPUs (RaiderChip) for running SLMs. GPUs, the usual platform of choice, are compared against commercial NPUs and recent multi-core CPUs. While NPUs leverage custom hardware designs optimized for computation, modern CPUs increasingly incorporate dedicated features targeting language-model workloads. Using a common execution framework and a suite of state-of-the-art SLMs, we analyze both maximum achievable performance and processing and energy efficiency across commercial solutions available for each platform. The results indicate that specialized backends outperform general-purpose CPUs, with NPUs achieving the highest performance by a wide margin. Bandwidth normalization proves essential for fair cross-architecture comparisons. Although low-power ARM processors deliver competitive results when energy usage is considered, metrics that combine performance and power (such as EDP) again highlight NPUs as the dominant architecture. These findings show that designs optimized for both efficiency and performance offer a clear advantage for edge workloads.
zh

[AI-69] racing Footsteps of Similar Cities: Modeling Urban Economic Vitality with Dynamic Inter-City Graph Embeddings

【速读】:该论文旨在解决城市经济活力(Urban Economic Vitality)建模难题,传统方法依赖静态的城市层面聚合指标,难以捕捉城市间动态演化关系——即当前某城市的发育轨迹可能在结构相似的其他城市中未来重现。其解决方案的关键在于提出ECO-GROW框架,通过构建多图结构融合工业关联、兴趣点(POI)相似性、人口迁移相似性及15年时间维度上的网络演化,采用动态Top-K图卷积神经网络(Dynamic Top-K GCN)自适应选择关键城际连接,并引入可学习的图评分机制(Graph Scorer)动态加权跨区域影响;同时结合基于Barabasi接近度的链接预测任务优化图表示,从而显著提升对创业活动与就业趋势的预测精度。

链接: https://arxiv.org/abs/2511.22325
作者: Xiaofeng Li,Xiangyi Xiao,Xiaocong Du,Ying Zhang,Haipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban economic vitality is a crucial indicator of a city’s long-term growth potential, comprising key metrics such as the annual number of new companies and the population employed. However, modeling urban economic vitality remains challenging. This study develops ECO-GROW, a multi-graph framework modeling China’s inter-city networks (2005-2021) to generate urban embeddings that model urban economic vitality. Traditional approaches relying on static city-level aggregates fail to capture a fundamental dynamic: the developmental trajectory of one city today may mirror that of its structurally similar counterparts tomorrow. ECO-GROW overcomes this limitation by integrating industrial linkages, POI similarities, migration similarities and temporal network evolution over 15 years. The framework combines a Dynamic Top-K GCN to adaptively select influential inter-city connections and an adaptive Graph Scorer mechanism to dynamically weight cross-regional impacts. Additionally, the model incorporates a link prediction task based on Barabasi Proximity, optimizing the graph representation. Experimental results demonstrate ECO-GROW’s superior accuracy in predicting entrepreneurial activities and employment trends compared to conventional models. By open-sourcing our code, we enable government agencies and public sector organizations to leverage big data analytics for evidence-based urban planning, economic policy formulation, and resource allocation decisions that benefit society at large.
zh

[AI-70] Enhanced Conditional Generation of Double Perovskite by Knowledge-Guided Language Model Feedback

【速读】:该论文旨在解决双钙钛矿(Double Perovskites, DPs)材料在可持续能源技术应用中因庞大设计空间导致的条件化材料发现难题。传统方法难以高效探索高维化学组成空间并确保生成结果的物理合理性,而现有生成模型(如GAN)在稳定性与有效性方面表现有限。解决方案的关键在于提出一种多智能体(Multi-Agent System, MAS)驱动的文本梯度引导框架,通过融合三种互补反馈源——基于大语言模型(Large Language Model, LLM)的自评估、领域知识启发的反馈以及机器学习代理模型(ML surrogate)反馈——实现对DP组成生成过程的知识引导。其中,领域知识驱动的文本梯度是核心创新点,它能有效将生成方向聚焦于物理可行区域,显著提升稳定或亚稳态候选材料的比例(达54%),同时无需额外训练数据即可保障超过98%的组成有效性,优于纯LLM基线(43%)和先前GAN方法(27%)。

链接: https://arxiv.org/abs/2511.22307
作者: Inhyo Lee,Junhyeong Lee,Jongwon Park,KyungTae Lim,Seunghwa Ryu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Double perovskites (DPs) are promising candidates for sustainable energy technologies due to their compositional tunability and compatibility with low-energy fabrication, yet their vast design space poses a major challenge for conditional materials discovery. This work introduces a multi-agent, text gradient-driven framework that performs DP composition generation under natural-language conditions by integrating three complementary feedback sources: LLM-based self-evaluation, DP-specific domain knowledge-informed feedback, and ML surrogate-based feedback. Analogous to how knowledge-informed machine learning improves the reliability of conventional data-driven models, our framework incorporates domain-informed text gradients to guide the generative process toward physically meaningful regions of the DP composition space. Systematic comparison of three incremental configurations, (i) pure LLM generation, (ii) LLM generation with LLM reasoning-based feedback, and (iii) LLM generation with domain knowledge-guided feedback, shows that iterative guidance from knowledge-informed gradients improves stability-condition satisfaction without additional training data, achieving over 98% compositional validity and up to 54% stable or metastable candidates, surpassing both the LLM-only baseline (43%) and prior GAN-based results (27%). Analyses of ML-based gradients further reveal that they enhance performance in in-distribution (ID) regions but become unreliable in out-of-distribution (OOD) regimes. Overall, this work provides the first systematic analysis of multi-agent, knowledge-guided text gradients for DP discovery and establishes a generalizable blueprint for MAS-driven generative materials design aimed at advancing sustainable technologies.
zh

[AI-71] When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal Forming

【速读】:该论文旨在解决工业设计中数值模拟因参数优化依赖专家经验、计算资源消耗大及环境成本高而导致的效率低下问题。其核心解决方案是引入基于贝叶斯优化(Bayesian optimization)的AI辅助工作流,通过深度学习模型提供初始参数估计,并结合主动学习机制在必要时协助专家参与,从而在有限能量预算或迭代次数条件下,高效迭代优化设计参数,显著降低对专家的依赖并加速设计空间探索。

链接: https://arxiv.org/abs/2511.22302
作者: Ahmad Tarraf,Koutaiba Kassem-Manthey,Seyed Ali Mohammadi,Philipp Martin,Lukas Moj,Semih Burak,Enju Park,Christian Terboven,Felix Wolf
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 17 pages

点击查看摘要

Abstract:Numerical simulations have revolutionized the industrial design process by reducing prototyping costs, design iterations, and enabling product engineers to explore the design space more efficiently. However, the growing scale of simulations demands substantial expert knowledge, computational resources, and time. A key challenge is identifying input parameters that yield optimal results, as iterative simulations are costly and can have a large environmental impact. This paper presents an AI-assisted workflow that reduces expert involvement in parameter optimization through the use of Bayesian optimization. Furthermore, we present an active learning variant of the approach, assisting the expert if desired. A deep learning model provides an initial parameter estimate, from which the optimization cycle iteratively refines the design until a termination condition (e.g., energy budget or iteration limit) is met. We demonstrate our approach, based on a sheet metal forming process, and show how it enables us to accelerate the exploration of the design space while reducing the need for expert involvement.
zh

[AI-72] Adaptive tumor growth forecasting via neural universal ODEs

【速读】:该论文旨在解决传统肿瘤生长模型(如Gompertz和Bertalanffy方程)在面对个体患者数据有限时难以适应变异性的局限性,从而影响治疗优化的精准度。其解决方案的关键在于引入科学机器学习(Scientific Machine Learning, SciML)中的两种核心方法——神经微分方程(Neural Ordinary Differential Equations, Neural ODEs)与通用微分方程(Universal Differential Equations, UDEs),通过将经典模型中的刚性项替换为可学习的神经网络结构,在Julia编程语言中构建能够从实验数据中自适应学习隐藏动力学的肿瘤生长模型,从而实现受限数据下的精准预测与符号恢复(symbolic recovery),将学习到的动力学转化为显式数学表达式,提升预测准确性并支持动态个性化治疗策略。

链接: https://arxiv.org/abs/2511.22292
作者: Kavya Subramanian,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at JuliaCon 2025 conference

点击查看摘要

Abstract:Forecasting tumor growth is critical for optimizing treatment. Classical growth models such as the Gompertz and Bertalanffy equations capture general tumor dynamics but may fail to adapt to patient-specific variability, particularly with limited data available. In this study, we leverage Neural Ordinary Differential Equations (Neural ODEs) and Universal Differential Equations (UDEs), two pillars of Scientific Machine Learning (SciML), to construct adaptive tumor growth models capable of learning from experimental data. Using the Gompertz model as a baseline, we replace rigid terms with adaptive neural networks to capture hidden dynamics through robust modeling in the Julia programming language. We use our models to perform forecasting under data constraints and symbolic recovery to transform the learned dynamics into explicit mathematical expressions. Our approach has the potential to improve predictive accuracy, guiding dynamic and effective treatment strategies for improved clinical outcomes.
zh

[AI-73] RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM -based Conversational Recommender Systems AAAI2026

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推荐对话系统中对人类心理理论(Theory of Mind, ToM)能力评估不足的问题,尤其是现有基准测试多依赖于类Sally-Anne的合成叙事,忽视了真实对话场景下心理状态推断的复杂性以及行为预测这一关键人类ToM特征。解决方案的关键在于提出RecToM——一个专注于两个互补维度的新基准:认知推理(Cognitive Inference),即从对话内容中推断用户潜在心理状态(如欲望、意图和信念);行为预测(Behavioral Prediction),即评估模型能否基于这些推断的心理状态制定并选择恰当的后续对话策略。实验表明,尽管LLMs能部分识别心理状态,但在动态推荐对话中难以维持连贯的战略性ToM推理,尤其是在追踪演变意图与对齐策略方面存在显著挑战。

链接: https://arxiv.org/abs/2511.22275
作者: Mengfan Li,Xuanhua Shi,Yang Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users’ mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of Mind. Despite growing interest in evaluating ToM in LLMs, current benchmarks predominantly rely on synthetic narratives inspired by Sally-Anne test, which emphasize physical perception and fail to capture the complexity of mental state inference in realistic conversational settings. Moreover, existing benchmarks often overlook a critical component of human ToM: behavioral prediction, the ability to use inferred mental states to guide strategic decision-making and select appropriate conversational actions for future interactions. To better align LLM-based ToM evaluation with human-like social reasoning, we propose RecToM, a novel benchmark for evaluating ToM abilities in recommendation dialogues. RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction. The former focus on understanding what has been communicated by inferring the underlying mental states. The latter emphasizes what should be done next, evaluating whether LLMs can leverage these inferred mental states to predict, select, and assess appropriate dialogue strategies. Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge. While the models exhibit partial competence in recognizing mental states, they struggle to maintain coherent, strategic ToM reasoning throughout dynamic recommendation dialogues, particularly in tracking evolving intentions and aligning conversational strategies with inferred mental states.
zh

[AI-74] Efficiency and Effectiveness of SPLADE Models on Billion-Scale Web Document Title

【速读】:该论文旨在解决大规模网络文档检索中传统稀疏检索模型(如BM25)在处理复杂查询时性能不足的问题,同时应对基于稀疏词法表示的新型模型(如SPLADE和Expanded-SPLADE)带来的高计算开销。解决方案的关键在于引入两种高效的剪枝策略:一是以文档为中心的剪枝(document-centric pruning),二是基于查询词重要性的top-k词选择机制,并结合布尔查询与词项阈值过滤,从而显著降低计算成本,同时保持较高的检索效果。实验表明,Expanded-SPLADE在兼顾有效性与效率方面表现最优,尤其适用于大规模搜索引擎部署场景。

链接: https://arxiv.org/abs/2511.22263
作者: Taeryun Won,Tae Kwan Lee,Hiun Kim,Hyemin Lee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive comparison of BM25, SPLADE, and Expanded-SPLADE models in the context of large-scale web document retrieval. We evaluate the effectiveness and efficiency of these models on datasets spanning from tens of millions to billions of web document titles. SPLADE and Expanded-SPLADE, which utilize sparse lexical representations, demonstrate superior retrieval performance compared to BM25, especially for complex queries. However, these models incur higher computational costs. We introduce pruning strategies, including document-centric pruning and top-k query term selection, boolean query with term threshold to mitigate these costs and improve the models’ efficiency without significantly sacrificing retrieval performance. The results show that Expanded-SPLADE strikes the best balance between effectiveness and efficiency, particularly when handling large datasets. Our findings offer valuable insights for deploying sparse retrieval models in large-scale search engines.
zh

[AI-75] Co-Evolving Agents : Learning from Failures as Hard Negatives

【速读】:该论文旨在解决自改进代理(self-improving agents)在有限真实轨迹监督下易过拟合的问题,其核心挑战在于依赖预测轨迹进行偏好优化时缺乏足够多样性的负样本,导致决策边界模糊、泛化能力不足。解决方案的关键在于提出一种共进化代理(co-evolving agents)框架:目标代理与辅助失败代理(failure agent)协同优化,其中失败代理通过偏好优化学习来自目标代理和自身生成的失败轨迹,从而产生接近成功但实际仍为失败的“硬负样本”(hard negatives)。这些结构化的失败信号被引入目标代理的训练过程,显著提升了决策边界清晰度和模型泛化性能。

链接: https://arxiv.org/abs/2511.22254
作者: Yeonsung Jung,Trilok Padhi,Sina Shaham,Dipika Khullar,Joonhyun Jeong,Ninareh Mehrabi,Eunho Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of large foundation models has accelerated the development of task-specialized agents across diverse domains. However, the effectiveness of agents remains tightly coupled with the quality of training data, while curating task-specific datasets remains costly and often infeasible in real-world scenarios. Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories. A prominent line of approaches further leverages preference optimization by pairing predicted trajectories with scarce ground-truth trajectories, enabling agents to learn directly from their own failures. While these methods outperform supervised fine-tuning, their heavy reliance on predicted trajectories under limited ground-truth supervision leaves them prone to overfitting. To address this, we propose a co-evolving agents framework in which a target agent improves jointly with an auxiliary failure agent. The failure agent learns through preference optimization over failure trajectories from both the target and itself, thereby generating hard negatives that are close to success yet remain failures. Incorporating these informative hard negatives into the target agent’s optimization sharpens decision boundaries and enhances generalization. Our comprehensive analysis and experiments across benchmark datasets show that our method not only shows improved performance but also demonstrates that failures, instead of being used as-is, can be systematically transformed into structured and valuable learning signals in self-improving agents.
zh

[AI-76] Evaluating Embedding Models and Pipeline Optimization for AI Search Quality

【速读】:该论文旨在解决AI驱动的搜索系统中文本嵌入模型(text embedding models)及其流水线配置对检索准确率影响不明确的问题。其核心解决方案在于系统性地评估多种嵌入模型(如All-MPNet、BGE、GTE和Qwen)在不同维度、索引方法(Milvus HNSW/IVF)及分块策略下的表现,并构建了一个包含11,975个查询-片段对的定制化评估数据集,通过Top-K Accuracy和NDCG等参考指标量化性能。关键发现表明:高维嵌入显著提升搜索质量(例如Qwen3-Embedding-8B/4096的Top-3准确率达0.571,远高于GTE-large/1024的0.412),且神经重排序器(如BGE交叉编码器)可进一步优化排名精度(Top-3最高达0.527),同时更细粒度的分块策略(512字符 vs 2000字符)亦有助于提升准确性。

链接: https://arxiv.org/abs/2511.22240
作者: Philip Zhong,Kent Chen,Don Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We evaluate the performance of various text embedding models and pipeline configurations for AI-driven search systems. We compare sentence-transformer and generative embedding models (e.g., All-MPNet, BGE, GTE, and Qwen) at different dimensions, indexing methods (Milvus HNSW/IVF), and chunking strategies. A custom evaluation dataset of 11,975 query-chunk pairs was synthesized from US City Council meeting transcripts using a local large language model (LLM). The data pipeline includes preprocessing, automated question generation per chunk, manual validation, and continuous integration/continuous deployment (CI/CD) integration. We measure retrieval accuracy using reference-based metrics: Top-K Accuracy and Normalized Discounted Cumulative Gain (NDCG). Our results demonstrate that higher-dimensional embeddings significantly boost search quality (e.g., Qwen3-Embedding-8B/4096 achieves Top-3 accuracy about 0.571 versus 0.412 for GTE-large/1024), and that neural re-rankers (e.g., a BGE cross-encoder) further improve ranking accuracy (Top-3 up to 0.527). Finer-grained chunking (512 characters versus 2000 characters) also improves accuracy. We discuss the impact of these factors and outline future directions for pipeline automation and evaluation.
zh

[AI-77] raining High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

【速读】:该论文旨在解决GUI代理(GUI agent)在执行长周期任务时面临的两大核心挑战:一是单一代理模型难以平衡高层能力与底层执行能力,导致责任耦合和能力冲突;二是代理缺乏对任务状态的感知,造成长周期任务中的进度丢失。解决方案的关键在于提出一种分阶段执行-反馈强化学习算法,并构建了一个由协调器(Coordinator)、执行器(Executor)和状态追踪器(State Tracker)组成的多智能体框架(CES)。其中,协调器负责战略规划与任务分解,状态追踪器负责上下文压缩与信息管理以维持任务状态一致性,二者共同提升系统的长期规划与状态保持能力,且该高阶调度模块具有通用性和可插拔性,能显著增强各类底层执行器的长周期任务处理性能。

链接: https://arxiv.org/abs/2511.22235
作者: Zehao Deng,Tianjie Ju,Zheng Wu,Zhuosheng Zhang,Gongshen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task’s state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system’s planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at this https URL.
zh

[AI-78] Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning

【速读】:该论文旨在解决多智能体强化学习中因环境动态非平稳性(non-stationarity)带来的理论挑战,尤其是当智能体与环境相互嵌入(embedded agency)时,传统假设智能体与环境解耦的框架失效的问题。其核心解决方案在于引入以自我预测(self-prediction)为核心的数学框架,其中贝叶斯强化学习(Bayesian RL)智能体不仅预测未来的感知输入,还预测自身动作,从而在建模其他智能体的同时,将自身视为环境的一部分来处理认知不确定性(epistemic uncertainty)。这一机制使智能体能够推理出其他智能体也在运行类似算法,并由此产生新的博弈论解概念和经典解耦智能体无法实现的合作形式,同时扩展AIXI理论,提出从Solomonoff先验出发的理想化嵌入式智能体,具备无限阶心智理论(infinite-order theory of mind),为嵌入式多智能体学习提供了一个潜在的最优基准。

链接: https://arxiv.org/abs/2511.22226
作者: Alexander Meulemans,Rajai Nasser,Maciej Wołczyk,Marissa A. Weis,Seijin Kobayashi,Blake Richards,Guillaume Lajoie,Angelika Steger,Marcus Hutter,James Manyika,Rif A. Saurous,João Sacramento,Blaise Agüera y Arcas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 203 pages, 3 figures

点击查看摘要

Abstract:The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.
zh

[AI-79] PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units

【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)数据中存在的高度不规则性、异质性和时间碎片化问题,这些问题限制了临床预测模型的泛化能力。其解决方案的关键在于提出一种自监督基础模型 PULSE-ICU,该模型无需重采样或人工特征工程,即可从大规模电子健康记录(Electronic Health Record, EHR)序列中学习事件级别的 ICU 表征;通过统一嵌入模块编码事件身份、连续值、单位及时间属性,并采用 Longformer 架构实现对长时程临床轨迹的高效建模,从而在 18 项预测任务(包括死亡率、干预预测和表型识别)上展现出强大且一致的性能,且在 eICU、HiRID 和 P12 数据集上的外部验证表明其对领域偏移和变量约束具有鲁棒性。

链接: https://arxiv.org/abs/2511.22199
作者: Sejeong Jang,Joo Heung Yoon,Hyo Kyung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intensive care unit (ICU) data are highly irregular, heterogeneous, and temporally fragmented, posing challenges for generalizable clinical prediction. We present PULSE-ICU, a self-supervised foundation model that learns event-level ICU representations from large-scale EHR sequences without resampling or manual feature engineering. A unified embedding module encodes event identity, continuous values, units, and temporal attributes, while a Longformer-based encoder enables efficient modeling of long trajectories. PULSE-ICU was fine-tuned across 18 prediction tasks, including mortality, intervention forecasting, and phenotype identification, achieving strong performance across task types. External validation on eICU, HiRID, and P12 showed substantial improvements with minimal fine-tuning, demonstrating robustness to domain shift and variable constraints. These findings suggest that foundation-style modeling can improve data efficiency and adaptability, providing a scalable framework for ICU decision support across diverse clinical environments.
zh

[AI-80] WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios NEURIPS2025

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在可穿戴设备(如智能眼镜)上进行视觉问答(Visual Question Answering, VQA)时缺乏针对性评估基准的问题。现有基准主要基于高质量、第三人称视角的图像,无法反映可穿戴设备在实际使用中常见的低质量视觉输入(如遮挡、光照不足、模糊等)和第一人称交互场景的挑战。解决方案的关键在于提出 WearVQA——首个专为评估可穿戴设备上多模态 AI 助手的 VQA 能力而设计的基准数据集,包含 2,520 个精心构建的图像-问题-答案三元组,覆盖 7 类图像域、10 种认知任务类型及 6 种典型的可穿戴设备图像质量问题,并配套高精度的 LLM-as-a-judge 评估框架,从而揭示当前主流开源与私有 MLLMs 在真实场景下性能显著下降的现象,推动面向鲁棒性可穿戴多模态 AI 系统的技术发展。

链接: https://arxiv.org/abs/2511.22154
作者: Eun Chang,Zhuangqun Huang,Yiwei Liao,Sagar Ravi Bhavsar,Amogh Param,Tammy Stark,Adel Ahmadyan,Xiao Yang,Jiaqi Wang,Ahsan Abdullah,Giang Nguyen,Akil Iyer,David Hall,Elissa Li,Shane Moon,Nicolas Scheffer,Kirmani Ahmed,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Xin Luna Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, NeurIPS 2025

点击查看摘要

Abstract:We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.
zh

[AI-81] A perceptual bias of AI Logical Argumentation Ability in Writing

【速读】:该论文试图解决的问题是:为何人们对人工智能(Artificial Intelligence, AI)是否具备类人逻辑推理能力存在显著分歧,即使在观察相同AI性能表现时也是如此。研究指出,这种分歧可能源于人类对AI的先入为主认知偏差,而非客观能力差异。解决方案的关键在于识别并量化这些感知偏差——通过实验比较参与者对同一主题下人工撰写与AI生成文本的逻辑推理能力评价,发现评价结果显著受个体对AI逻辑能力预设观点的影响;同时,高频使用者更倾向于认为AI不会削弱独立思考能力。因此,研究强调需正视并缓解此类感知偏差,以提升公众对AI能力的理解,并促进更有效的“人机协同”交互。

链接: https://arxiv.org/abs/2511.22151
作者: Xi Cun,Jifan Ren,Asha Huang,Siyu Li,Ruzhen Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can machines think? This is a central question in artificial intelligence research. However, there is a substantial divergence of views on the answer to this question. Why do people have such significant differences of opinion, even when they are observing the same real world performance of artificial intelligence? The ability of logical reasoning like humans is often used as a criterion to assess whether a machine can think. This study explores whether human biases influence evaluations of the reasoning abilities of AI. An experiment was conducted where participants assessed two texts on the same topic, one AI generated and one human written,to test for perceptual biases in evaluating logical reasoning. Based on the experimental findings, a questionnaire was designed to quantify the attitudes toward this http URL results reveal a bias in perception. The evaluations of the logical reasoning ability of AI generated texts are significantly influenced by the preconceived views on the logical reasoning abilities of AI. Furthermore, frequent AI users were less likely to believe that AI usage undermines independent this http URL study highlights the need to address perceptual biases to improve public understanding of AI’s capabilities and foster better human AI interactions.
zh

[AI-82] Decomposed Trust: Exploring Privacy Adversarial Robustness Fairness and Ethics of Low-Rank LLM s

【速读】:该论文旨在解决低秩分解(low-rank factorization)压缩大型语言模型(Large Language Models, LLMs)对模型可信性(trustworthiness)影响不明确的问题,尤其关注隐私保护、对抗鲁棒性、公平性和伦理对齐等关键维度。其解决方案的关键在于通过系统性评估多种不同规模和变体的LLM在采用多种低秩算法压缩后的可信性变化,揭示压缩对各维度的影响机制,并进一步结合梯度归因分析识别对对抗鲁棒性贡献最大的模型层,从而为可信赖的模型压缩策略提供实证依据与指导。

链接: https://arxiv.org/abs/2511.22099
作者: Daniel Agyei Asante,Md Mokarram Chowdhury,Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Model compression addresses this challenge, with low-rank factorization emerging as a particularly effective method for reducing size, memory, and computation while maintaining accuracy. However, while these compressed models boast of benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, fairness, and ethical alignment. We evaluate multiple LLMs of different sizes and variants compressed with diverse low-rank algorithms, revealing key insights: (1) low-rank compression preserves or improves training data privacy but weakens PII protection during conversation; (2) adversarial robustness is generally preserved and often enhanced, even under deep compression; (3) ethical reasoning degrades in zero-shot settings but partially recovers with few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness, as both are important in low-rank methods. To guide trustworthy compression strategies, we end our paper with a gradient-based attribution analysis to identify which layers in LLMs contribute most to adversarial robustness.
zh

[AI-83] Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

【速读】:该论文旨在解决深度学习在二进制分析(binary analysis)研究中面临的基础设施缺口问题,即现有数据集通常局限于单一平台、依赖特定工具链,或仅提供手工设计的特征,难以适配现代神经网络架构,且缺乏对真实应用场景的支持。解决方案的关键在于提出 Binary-30K,这是首个专为基于序列的模型(如 Transformer)设计的异构二进制数据集,覆盖 Windows、Linux、macOS 和 Android 等多个操作系统及 15 种以上 CPU 架构,包含 29,793 个二进制文件(约 26.93% 为恶意软件),并提供预计算的字节级 BPE 分词和全面的结构元数据,支持序列建模与结构感知方法。通过平台优先的分层采样策略确保跨平台代表性,并借助 Hugging Face 提供官方训练/验证/测试划分,实现可复现的基准测试,从而推动平台无关检测、跨目标迁移学习和长上下文二进制理解等方向的研究发展。

链接: https://arxiv.org/abs/2511.22095
作者: Michael J. Bommarito II
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 35 pages, 7 figures, 11 tables, 4 appendices. Dataset available at this https URL

点击查看摘要

Abstract:Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at this https URL, providing an accessible resource for researchers, practitioners, and students alike.
zh

[AI-84] A Fast and Flat Federated Learning Method via Weighted Momentum and Sharpness-Aware Minimization

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中两个关键挑战:在严格通信预算下实现快速收敛,同时在非独立同分布(non-IID)客户端数据分布上保持良好泛化性能。现有方法通常结合动量(momentum)加速优化进程与Sharpness-Aware Minimization(SAM)以偏好平坦解,但简单叠加会导致两类结构性问题:局部-全局曲率错位(local-global curvature misalignment),即客户端SAM方向未必反映全局损失几何;以及动量回声振荡(momentum-echo oscillation),即累积动量引发后期不稳定性。解决方案的核心在于提出FedWMSAM方法:首先利用服务器聚合的动量构建引导性全局扰动,使客户端SAM方向对齐全局下降几何,从而实现高效单次反向传播的SAM近似;其次通过余弦相似度自适应规则耦合动量与SAM,形成早阶段用动量、晚阶段启用SAM的两阶段训练策略。理论层面,作者给出了非IID场景下的收敛边界,显式建模扰动引起的方差 σρ2=σ2+(Lρ)2\sigma_\rho^2 = \sigma^2 + (L\rho)^2 及其对通信轮数 RR、客户端数 SS、每轮参与客户端数 KK 和本地样本数 NN 的依赖关系,实验证明该方法在多个数据集和模型架构上均展现出优越的优化性能与鲁棒性。

链接: https://arxiv.org/abs/2511.22080
作者: Tianle Li,Yongzhi Huang,Linshan Jiang,Chang Liu,Qipeng Xie,Wenfeng Du,Lu Wang,Kaishun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:In federated learning (FL), models must \emphconverge quickly under tight communication budgets while \emphgeneralizing across non-IID client distributions. These twin requirements have naturally led to two widely used techniques: client/server \emphmomentum to accelerate progress, and \emphsharpness-aware minimization (SAM) to prefer flat solutions. However, simply combining momentum and SAM leaves two structural issues unresolved in non-IID FL. We identify and formalize two failure modes: \emphlocal-global curvature misalignment (local SAM directions need not reflect the global loss geometry) and \emphmomentum-echo oscillation (late-stage instability caused by accumulated momentum). To our knowledge, these failure modes have not been jointly articulated and addressed in the FL literature. We propose \textbfFedWMSAM to address both failure modes. First, we construct a momentum-guided global perturbation from server-aggregated momentum to align clients’ SAM directions with the global descent geometry, enabling a \emphsingle-backprop SAM approximation that preserves efficiency. Second, we couple momentum and SAM via a cosine-similarity adaptive rule, yielding an early-momentum, late-SAM two-phase training schedule. We provide a non-IID convergence bound that \emphexplicitly models the perturbation-induced variance \sigma_\rho^2=\sigma^2+(L\rho)^2 and its dependence on (S, K, R, N) on the theory side. We conduct extensive experiments on multiple datasets and model architectures, and the results validate the effectiveness, adaptability, and robustness of our method, demonstrating its superiority in addressing the optimization challenges of Federated Learning. Our code is available at this https URL.
zh

[AI-85] Hybrid Stackelberg Game and Diffusion-based Auction for Two-tier Agent ic AI Task Offloading in Internet of Agents

【速读】:该论文旨在解决物联网中智能代理(Internet of Agents, IoA)环境下,资源受限的无线代理(Wireless Agents, WAs)如何高效地将计算密集型的代理AI服务卸载至附近服务器的问题。其核心挑战在于协调移动代理(Mobile Agents, MAs)、固定代理(Fixed Agents, FAs)与空中代理(Aerial Agents, AAs)之间的多层次资源分配与任务卸载策略,以应对FAs可能因负载过重而无法独立处理任务的情况。解决方案的关键在于提出一种两层优化机制:第一层采用多领导者-多追随者Stackelberg博弈模型,由MAs和FAs作为价格制定者,WAs作为任务卸载比例决策者;第二层引入双荷兰拍卖(Double Dutch Auction)机制,当FAs过载时作为买家向AA请求资源,AA则作为卖家提供资源,从而实现跨层级的动态资源调度。为求解该复杂模型,作者进一步设计了一种基于扩散的深度强化学习算法,数值结果验证了所提方案在任务卸载效率上的优越性。

链接: https://arxiv.org/abs/2511.22076
作者: Yue Zhong,Yongju Tong,Jiawen Kang,Minghui Dai,Hong-Ning Dai,Zhou Su,Dusit Niyato
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Internet of Agents (IoA) is rapidly gaining prominence as a foundational architecture for interconnected intelligent systems, designed to facilitate seamless discovery, communication, and collaborative reasoning among a vast network of Artificial Intelligence (AI) agents. Powered by Large Language and Vision-Language Models, IoA enables the development of interactive, rational agents capable of complex cooperation, moving far beyond traditional isolated models. IoA involves physical entities, i.e., Wireless Agents (WAs) with limited onboard resources, which need to offload their compute-intensive agentic AI services to nearby servers. Such servers can be Mobile Agents (MAs), e.g., vehicle agents, or Fixed Agents (FAs), e.g., end-side units agents. Given their fixed geographical locations and stable connectivity, FAs can serve as reliable communication gateways and task aggregation points. This stability allows them to effectively coordinate with and offload to an Aerial Agent (AA) tier, which has an advantage not affordable for highly mobile MAs with dynamic connectivity limitations. As such, we propose a two-tier optimization approach. The first tier employs a multi-leader multi-follower Stackelberg game. In the game, MAs and FAs act as the leaders who set resource prices. WAs are the followers to determine task offloading ratios. However, when FAs become overloaded, they can further offload tasks to available aerial resources. Therefore, the second tier introduces a Double Dutch Auction model where overloaded FAs act as the buyers to request resources, and AAs serve as the sellers for resource provision. We then develop a diffusion-based Deep Reinforcement Learning algorithm to solve the model. Numerical results demonstrate the superiority of our proposed scheme in facilitating task offloading.
zh

[AI-86] Real-Time Procedural Learning From Experience for AI Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体在部署后缺乏获取程序性知识(procedural knowledge)机制的问题。现有LLM代理无法通过实时试错学习新操作流程,限制了其在动态状态环境中的适应能力。解决方案的关键在于提出一种轻量级的后训练学习机制——基于状态索引的经验回溯(Procedural Recall for Agents with eXperiences Indexed by State, PRAXIS),该机制通过联合匹配当前环境状态与内部状态来检索历史经验中的状态-动作-结果示例,并将其用于增强代理的动作选择。PRAXIS实现了对真实世界任务中可复用行为模式的有效学习和即时调用,从而提升了任务完成准确率、可靠性及成本效率,并展现出在相似环境中对未见任务的初步泛化能力。

链接: https://arxiv.org/abs/2511.22074
作者: Dasheng Bi,Yubin Hu,Mohammed N. Nasir
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Learning how to do things from trial and error in real time is a hallmark of biological intelligence, yet most LLM-based agents lack mechanisms to acquire procedural knowledge after deployment. We propose Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS), a lightweight post-training learning mechanism that stores the consequences of actions and retrieves them by jointly matching environmental and internal states of past episodes to the current state. PRAXIS augments agentic action selection with retrieved state-action-result exemplars that are generated in real time. When evaluated on the REAL web browsing benchmark, PRAXIS improves task completion accuracy, reliability, and cost efficiency across different foundation model backbones, and shows preliminary generalization to unseen tasks in similar environments. These results demonstrate that PRAXIS enables the practical adoption of AI agents in fast-evolving stateful environments by helping them learn new procedures effectively.
zh

[AI-87] A Multi-View Multi-Timescale Hypergraph-Empowered Spatiotemporal Framework for EV Charging Forecasting

【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)充电需求预测中现有图神经网络方法仅能建模站点间成对关系、难以捕捉城市充电网络中复杂群体动态的问题。其解决方案的关键在于提出一种名为HyperCast的新颖预测框架,该框架利用超图(hypergraph)的表达能力显式建模EV充电模式中的高阶时空依赖性,通过融合多视角超图(包含静态地理邻近性和动态需求相似性)与多时间尺度输入,并结合专用的超时空块和定制化的交叉注意力机制,有效整合来自不同视图和时间尺度的信息,从而提升预测精度。

链接: https://arxiv.org/abs/2511.22072
作者: Jinhao Li,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Accurate electric vehicle (EV) charging demand forecasting is essential for stable grid operation and proactive EV participation in electricity market. Existing forecasting methods, particularly those based on graph neural networks, are often limited to modeling pairwise relationships between stations, failing to capture the complex, group-wise dynamics inherent in urban charging networks. To address this gap, we develop a novel forecasting framework namely HyperCast, leveraging the expressive power of hypergraphs to model the higher-order spatiotemporal dependencies hidden in EV charging patterns. HyperCast integrates multi-view hypergraphs, which capture both static geographical proximity and dynamic demand-based functional similarities, along with multi-timescale inputs to differentiate between recent trends and weekly periodicities. The framework employs specialized hyper-spatiotemporal blocks and tailored cross-attention mechanisms to effectively fuse information from these diverse sources: views and timescales. Extensive experiments on four public datasets demonstrate that HyperCast significantly outperforms a wide array of state-of-the-art baselines, demonstrating the effectiveness of explicitly modeling collective charging behaviors for more accurate forecasting.
zh

[AI-88] Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

【速读】:该论文旨在解决黑盒越狱攻击(black-box jailbreak attacks)中如何高效预测对抗性提示(adversarial prompts)成功率的问题,特别是探索构建一个轻量级安全代理模型(narrow safety proxy)的可行性。其核心挑战在于如何从大型语言模型(LLM)中提炼出可迁移的安全逻辑,并实现对攻击效果的准确预判。解决方案的关键在于提出了一种新颖的框架:首先采用改进的轮廓填充攻击(improved outline filling attack)实现对模型安全边界的密集采样;其次引入排序回归(ranking regression)范式替代传统回归方法,训练代理模型以预测不同提示之间的相对攻击成功率(ASR),从而提升预测精度。实验表明,该代理模型在平均长响应(ALR)相对排名预测上达到91.1%准确率,在ASR预测上达69.2%,验证了越狱行为的可预测性和可蒸馏性。

链接: https://arxiv.org/abs/2511.22044
作者: Tianyu Zhang,Zihang Xi,Jingyu Hua,Sheng Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM’s core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model’s security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.
zh

[AI-89] Pathology-Aware Prototype Evolution via LLM -Driven Semantic Disambiguation for Multicenter Diabetic Retinopathy Diagnosis

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)分级中因仅依赖视觉信息而难以区分细微病理变化的问题,尤其是现有方法普遍忽视了域不变的病理模式以及基础模型(foundation models)所蕴含的丰富上下文知识。其解决方案的关键在于提出一种分层锚点原型调制(Hierarchical Anchor Prototype Modulation, HAPM)框架:首先构建基于方差谱驱动的锚点原型库以保留域不变的病理特征;其次引入分层差异提示门控机制,动态选择来自大语言视觉模型(LVLM)和大语言模型(LLM)的判别性语义提示,缓解相邻DR等级间的语义混淆;最后通过两阶段原型调制策略,利用病理语义注入器(Pathological Semantic Injector, PSI)与判别原型增强器(Discriminative Prototype Enhancer, DPE)逐步将临床知识融入视觉原型,实现病理引导的原型演化,从而显著提升DR分级准确性。

链接: https://arxiv.org/abs/2511.22033
作者: Chunzheng Zhu,Yangfang Lin,Jialin Shao,Jianxin Lin,Yijun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACMMM 2025

点击查看摘要

Abstract:Diabetic retinopathy (DR) grading plays a critical role in early clinical intervention and vision preservation. Recent explorations predominantly focus on visual lesion feature extraction through data processing and domain decoupling strategies. However, they generally overlook domain-invariant pathological patterns and underutilize the rich contextual knowledge of foundation models, relying solely on visual information, which is insufficient for distinguishing subtle pathological variations. Therefore, we propose integrating fine-grained pathological descriptions to complement prototypes with additional context, thereby resolving ambiguities in borderline cases. Specifically, we propose a Hierarchical Anchor Prototype Modulation (HAPM) framework to facilitate DR grading. First, we introduce a variance spectrum-driven anchor prototype library that preserves domain-invariant pathological patterns. We further employ a hierarchical differential prompt gating mechanism, dynamically selecting discriminative semantic prompts from both LVLM and LLM sources to address semantic confusion between adjacent DR grades. Finally, we utilize a two-stage prototype modulation strategy that progressively integrates clinical knowledge into visual prototypes through a Pathological Semantic Injector (PSI) and a Discriminative Prototype Enhancer (DPE). Extensive experiments across eight public datasets demonstrate that our approach achieves pathology-guided prototype evolution while outperforming state-of-the-art methods. The code is available at this https URL.
zh

[AI-90] Predicting Public Health Impacts of Electricity Usage NEURIPS2025

【速读】:该论文旨在解决电力系统运行对公共健康影响的量化与优化问题,尤其是在化石燃料仍占主导地位的背景下,如何通过需求侧管理(Demand-Side Management, DSM)减少空气污染物排放带来的健康损害。其核心挑战在于建立从电力使用到健康后果的可解释、高精度映射关系,以支持健康导向的能源调度决策。解决方案的关键是提出HealthPredictor——一个面向健康影响的领域特定人工智能模型,包含三个模块:燃料构成预测器(fuel mix predictor)、空气质量转换器(air quality converter)和健康影响评估器(health impact assessor),形成从用电行为到货币化健康损失的端到端建模流程。实证表明,该框架在多个美国区域显著优于基于燃料构成的传统基线方法,在电动汽车充电调度等场景中展现出明确的健康增益,为实现健康驱动的能源管理提供了可操作的技术路径。

链接: https://arxiv.org/abs/2511.22031
作者: Yejia Liu,Zhifeng Wu,Pengfei Li,Shaolei Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 Pages. Accepted to NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM)

点击查看摘要

Abstract:The electric power sector is a leading source of air pollutant emissions, impacting the public health of nearly every community. Although regulatory measures have reduced air pollutants, fossil fuels remain a significant component of the energy supply, highlighting the need for more advanced demand-side approaches to reduce the public health impacts. To enable health-informed demand-side management, we introduce HealthPredictor, a domain-specific AI model that provides an end-to-end pipeline linking electricity use to public health outcomes. The model comprises three components: a fuel mix predictor that estimates the contribution of different generation sources, an air quality converter that models pollutant emissions and atmospheric dispersion, and a health impact assessor that translates resulting pollutant changes into monetized health damages. Across multiple regions in the United States, our health-driven optimization framework yields substantially lower prediction errors in terms of public health impacts than fuel mix-driven baselines. A case study on electric vehicle charging schedules illustrates the public health gains enabled by our method and the actionable guidance it can offer for health-informed energy management. Overall, this work shows how AI models can be explicitly designed to enable health-informed energy management for advancing public health and broader societal well-being. Our datasets and code are released at: this https URL.
zh

[AI-91] A Safety and Security Framework for Real-World Agent ic Systems

【速读】:该论文旨在解决企业级部署中生成式 AI(Generative AI)代理系统(agentic AI systems)的安全与风险管控问题,特别是传统安全与安全边界模糊化后的新风险识别与管理难题。其核心挑战在于,代理系统的安全性与安全性不再是孤立模型的静态属性,而是由模型、编排器(orchestrator)、工具和数据在动态交互中涌现的复杂特性。解决方案的关键在于提出一个动态的代理安全与安全框架(dynamic agentic safety and security framework),通过引入辅助AI模型和代理,在人类监督下实现上下文相关的风险发现、评估与缓解;并创新性地采用沙箱环境中的AI驱动红队测试(sandboxed, AI-driven red teaming)来主动挖掘新型代理风险(如工具滥用、级联动作链、意外控制放大等),从而在真实企业级工作流中实现端到端的风险治理。

链接: https://arxiv.org/abs/2511.21990
作者: Shaona Ghosh,Barnaby Simkin,Kyriacos Shiarlis,Soumili Nandi,Dan Zhao,Matthew Fiedler,Julia Bazinska,Nikki Pope,Roopa Prabhu,Daniel Rohrer,Michael Demoret,Bartley Richardson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper introduces a dynamic and actionable framework for securing agentic AI systems in enterprise deployment. We contend that safety and security are not merely fixed attributes of individual models but also emergent properties arising from the dynamic interactions among models, orchestrators, tools, and data within their operating environments. We propose a new way of identification of novel agentic risks through the lens of user safety. Although, for traditional LLMs and agentic models in isolation, safety and security has a clear separation, through the lens of safety in agentic systems, they appear to be connected. Building on this foundation, we define an operational agentic risk taxonomy that unifies traditional safety and security concerns with novel, uniquely agentic risks, including tool misuse, cascading action chains, and unintended control amplification among others. At the core of our approach is a dynamic agentic safety and security framework that operationalizes contextual agentic risk management by using auxiliary AI models and agents, with human oversight, to assist in contextual risk discovery, evaluation, and mitigation. We further address one of the most challenging aspects of safety and security of agentic systems: risk discovery through sandboxed, AI-driven red teaming. We demonstrate the framework effectiveness through a detailed case study of NVIDIA flagship agentic research assistant, AI-Q Research Assistant, showcasing practical, end-to-end safety and security evaluations in complex, enterprise-grade agentic workflows. This risk discovery phase finds novel agentic risks that are then contextually mitigated. We also release the dataset from our case study, containing traces of over 10,000 realistic attack and defense executions of the agentic workflow to help advance research in agentic safety.
zh

[AI-92] he Risk-Adjusted Intelligence Dividend: A Quantitative Framework for Measuring AI Return on Investment Integrating ISO 42001 and Regulatory Exposure

【速读】:该论文旨在解决传统投资回报率(ROI)计算方法无法准确评估人工智能(Artificial Intelligence, AI)项目价值的问题,因其未能同时考虑AI部署带来的运营风险降低与新型算法故障、对抗攻击及合规责任等新风险暴露。解决方案的关键在于构建一个整合组织风险敞口变化的财务框架,通过量化AI实施前后的风险差异,引入年损失期望(Annualized Loss Expectancy, ALE)和蒙特卡洛模拟等风险量化技术,明确建模控制有效性、算法失效储备金及模型性能维护成本,从而实现对AI项目净收益的精确测算,并为治理结构设立、分阶段验证及资本配置提供基于风险调整的决策依据。

链接: https://arxiv.org/abs/2511.21975
作者: Hernan Huwyler
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Risk Management (q-fin.RM)
备注: 21 pages, 2 equations, 8 references. Framework for risk-adjusted AI ROI calculation integrating ISO 42001, NIST AI RMF, and EU AI Act compliance requirements

点击查看摘要

Abstract:Organizations investing in artificial intelligence face a fundamental challenge: traditional return on investment calculations fail to capture the dual nature of AI implementations, which simultaneously reduce certain operational risks while introducing novel exposures related to algorithmic malfunction, adversarial attacks, and regulatory liability. This research presents a comprehensive financial framework for quantifying AI project returns that explicitly integrates changes in organizational risk profiles. The methodology addresses a critical gap in current practice where investment decisions rely on optimistic benefit projections without accounting for the probabilistic costs of AI-specific threats including model drift, bias-related litigation, and compliance failures under emerging regulations such as the European Union Artificial Intelligence Act and ISO/IEC 42001. Drawing on established risk quantification methods, including annual loss expectancy calculations and Monte Carlo simulation techniques, this framework enables practitioners to compute net benefits that incorporate both productivity gains and the delta between pre-implementation and post-implementation risk exposures. The analysis demonstrates that accurate AI investment evaluation requires explicit modeling of control effectiveness, reserve requirements for algorithmic failures, and the ongoing operational costs of maintaining model performance. Practical implications include specific guidance for establishing governance structures, conducting phased validations, and integrating risk-adjusted metrics into capital allocation decisions, ultimately enabling evidence-based AI portfolio management that satisfies both fiduciary responsibilities and regulatory mandates.
zh

[AI-93] ABLE: Using Adversarial Pairs to Construct Local Models for Explaining Model Predictions KDD2026

【速读】:该论文旨在解决机器学习模型在关键应用中因缺乏透明性而被视为“黑箱”的问题,尤其是现有局部解释方法(如LIME)存在的不稳定性和局部拟合度低的问题。其解决方案的关键在于提出一种名为Adversarially Bracketed Local Explanation (ABLE)的新方法:通过在测试样本 xtestx_{\text{test}} 周围添加有界高斯噪声生成邻域点,并利用两次对抗攻击构造出一对相邻的对抗样本 AAAA',二者分别位于决策边界两侧且与原始邻域点 DD 同标签,从而形成对局部决策边界的“夹逼”结构;随后基于这些对抗对训练线性模型以逼近局部决策边界,显著提升了解释的稳定性和局部保真度。

链接: https://arxiv.org/abs/2511.21952
作者: Krishna Khadka,Sunny Shree,Pujan Budhathoki,Yu Lei,Raghu Kacker,D. Richard Kuhn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures. Accepted to KDD 2026 (Research Track)

点击查看摘要

Abstract:Machine learning models are increasingly used in critical applications but are mostly “black boxes” due to their lack of transparency. Local explanation approaches, such as LIME, address this issue by approximating the behavior of complex models near a test instance using simple, interpretable models. However, these approaches often suffer from instability and poor local fidelity. In this paper, we propose a novel approach called Adversarially Bracketed Local Explanation (ABLE) to address these limitations. Our approach first generates a set of neighborhood points near the test instance, x_test, by adding bounded Gaussian noise. For each neighborhood point D, we apply an adversarial attack to generate an adversarial point A with minimal perturbation that results in a different label than D. A second adversarial attack is then performed on A to generate a point A’ that has the same label as D (and thus different than A). The points A and A’ form an adversarial pair that brackets the local decision boundary for x_test. We then train a linear model on these adversarial pairs to approximate the local decision boundary. Experimental results on six UCI benchmark datasets across three deep neural network architectures demonstrate that our approach achieves higher stability and fidelity than the state-of-the-art.
zh

[AI-94] Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation KDD2026

【速读】:该论文旨在解决结构化数据中特征变换(feature transformation)的效率与性能问题,尤其针对深度学习模型难以捕捉复杂特征交互的局限性。现有方法多依赖启发式规则或穷举搜索,导致过程低效;而基于强化学习(reinforcement learning, RL)的方法虽有所改进,但仍面临两个关键挑战:一是特征空间动态扩展带来的RL代理不稳定性和学习复杂度上升,二是多代理间协作不足导致特征交叉操作次优。解决方案的关键在于提出一种新型异构多智能体强化学习框架,通过两类共三名异构代理协同完成特征选择与交叉操作,并引入共享评判者机制(shared critic)增强代理间通信,采用基于多头注意力机制的特征代理处理动态特征空间,同时设计状态编码技术稳定训练过程,从而实现高效、可扩展且鲁棒的特征变换策略。

链接: https://arxiv.org/abs/2511.21934
作者: Tao Zhe,Huazhen Fang,Kunpeng Liu,Qian Lou,Tamzidul Hoque,Dongjie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2026 Research Track

点击查看摘要

Abstract:Feature transformation enhances downstream task performance by generating informative features through mathematical feature crossing. Despite the advancements in deep learning, feature transformation remains essential for structured data, where deep models often struggle to capture complex feature interactions. Prior literature on automated feature transformation has achieved success but often relies on heuristics or exhaustive searches, leading to inefficient and time-consuming processes. Recent works employ reinforcement learning (RL) to enhance traditional approaches through a more effective trial-and-error way. However, two limitations remain: 1) Dynamic feature expansion during the transformation process, which causes instability and increases the learning complexity for RL agents; 2) Insufficient cooperation and communication between agents, which results in suboptimal feature crossing operations and degraded model performance. To address them, we propose a novel heterogeneous multi-agent RL framework to enable cooperative and scalable feature transformation. The framework comprises three heterogeneous agents, grouped into two types, each designed to select essential features and operations for feature crossing. To enhance communication among these agents, we implement a shared critic mechanism that facilitates information exchange during feature transformation. To handle the dynamically expanding feature space, we tailor multi-head attention-based feature agents to select suitable features for feature crossing. Additionally, we introduce a state encoding technique during the optimization process to stabilize and enhance the learning dynamics of the RL agents, resulting in more robust and reliable transformation policies. Finally, we conduct extensive experiments to validate the effectiveness, efficiency, robustness, and interpretability of our model.
zh

[AI-95] Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment

【速读】:该论文旨在解决机器学习模型与训练数据结构之间对齐程度的评估问题,即判断模型预测是否真正反映了数据本身的内在规律。传统可解释性方法仅关注模型行为的解释,而忽略了模型是否忠实于数据分布这一关键前提。解决方案的关键在于构建一个基于数据本身的基准(baseline),借鉴Rubin潜在结果框架(Rubin’s Potential Outcomes Framework),量化每个特征在二分类任务中对结果组的分离强度,从而获得数据驱动的特征重要性排序,并将其与模型生成的解释进行对比,实现一种可解释且模型无关的模型-数据对齐评估方法。

链接: https://arxiv.org/abs/2511.21931
作者: Henry Salgado,Meagan Kendall,Martine Ceberio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we propose a simple and computationally efficient framework to evaluate whether machine learning models align with the structure of the data they learn from; that is, whether \textitthe model says what the data says. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin’s Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature’s effect on the outcome. By comparing these data-derived feature rankings against model-based explanations, we provide practitioners with an interpretable and model-agnostic method to assess model–data alignment.
zh

[AI-96] Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLM s

【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)依赖单一数值奖励信号所带来的局限性,即难以有效利用现实任务中丰富的语义知识(如目标描述、领域先验和策略提示),从而限制了学习效率与泛化能力。其解决方案的关键在于提出Prompted Policy Search (ProPS),一种将大语言模型(Large Language Model, LLM)置于策略优化循环中心的新型RL框架,通过LLM直接基于奖励反馈和自然语言输入生成策略更新,实现数值优化与语义推理的统一;该方法首次证明LLM可在上下文中执行数值优化,并验证引入语义信号可提升探索效率与样本利用率,显著优于多个主流RL算法(如PPO、SAC、TRPO)。

链接: https://arxiv.org/abs/2511.21928
作者: Yifan Zhou,Sachin Grover,Mohamed El Mistiri,Kamalesh Kalirathnam,Pratyush Kerhalkar,Swaroop Mishra,Neelesh Kumar,Sanket Gaurav,Oya Aran,Heni Ben Amor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In The Thirty-ninth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.
zh

[AI-97] Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

【速读】:该论文旨在解决后门数据(backdoor data)如何影响神经网络训练动态这一复杂且研究不足的问题,尤其关注目标类别与其它干净类别在学习过程中的行为差异。其解决方案的关键在于基于信息瓶颈(Information Bottleneck, IB)原理,结合内部表示的聚类特性,揭示了后门攻击会形成独特的互信息(Mutual Information, MI)签名,并发现视觉上明显的攻击(如BadNets)在信息论层面反而比许多视觉上不可见的攻击更具隐蔽性,从而提出了一种基于训练动态的新型隐蔽性度量指标,用于量化攻击在模型层面的融合程度。

链接: https://arxiv.org/abs/2511.21923
作者: Xinyu Liu,Xu Zhang,Can Chen,Ren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how backdoor data influences neural network training dynamics remains a complex and underexplored challenge. In this paper, we present a rigorous analysis of the impact of backdoor data on the learning process, with a particular focus on the distinct behaviors between the target class and other clean classes. Leveraging the Information Bottleneck (IB) principle connected with clustering of internal representation, We find that backdoor attacks create unique mutual information (MI) signatures, which evolve across training phases and differ based on the attack mechanism. Our analysis uncovers a surprising trade-off: visually conspicuous attacks like BadNets can achieve high stealthiness from an information-theoretic perspective, integrating more seamlessly into the model than many visually imperceptible attacks. Building on these insights, we propose a novel, dynamics-based stealthiness metric that quantifies an attack’s integration at the model level. We validate our findings and the proposed metric across multiple datasets and diverse attack types, offering a new dimension for understanding and evaluating backdoor threats. Our code is available in: this https URL.
zh

[AI-98] oward Automated and Trustworthy Scientific Analysis and Visualization with LLM -Generated Code

【速读】:该论文旨在解决领域科学家在缺乏编程技能的情况下,难以自主构建科学数据分析与可视化工作流的问题,尤其是在利用生成式AI(Generative AI)自动生成Python脚本时所面临的可信度不足问题。其核心挑战在于:当前开源大语言模型(Large Language Models, LLMs)在处理领域特定任务时,由于提示词模糊和对领域上下文理解有限,导致生成代码的可执行性和正确性受限。解决方案的关键在于提出并验证三种互补策略:基于数据感知的提示消歧(data-aware prompt disambiguation)、检索增强的提示增强(retrieval-augmented prompt enhancement)以及迭代式错误修复(iterative error repair),这些方法显著提升了代码生成的成功率与质量,为构建更可靠、易用且包容性强的AI辅助科研工具提供了可复用的基准与实践路径。

链接: https://arxiv.org/abs/2511.21920
作者: Apu Kumar Chakroborti,Yi Ding,Lipeng Wan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As modern science becomes increasingly data-intensive, the ability to analyze and visualize large-scale, complex datasets is critical to accelerating discovery. However, many domain scientists lack the programming expertise required to develop custom data analysis workflows, creating barriers to timely and effective insight. Large language models (LLMs) offer a promising solution by generating executable code from natural language descriptions. In this paper, we investigate the trustworthiness of open-source LLMs in autonomously producing Python scripts for scientific data analysis and visualization. We construct a benchmark suite of domain-inspired prompts that reflect real-world research tasks and systematically evaluate the executability and correctness of the generated code. Our findings show that, without human intervention, the reliability of LLM-generated code is limited, with frequent failures caused by ambiguous prompts and the models’ insufficient understanding of domain-specific contexts. To address these challenges, we design and assess three complementary strategies: data-aware prompt disambiguation, retrieval-augmented prompt enhancement, and iterative error repair. While these methods significantly improve execution success rates and output quality, further refinement is needed. This work highlights both the promise and current limitations of LLM-driven automation in scientific workflows and introduces actionable techniques and a reusable benchmark for building more inclusive, accessible, and trustworthy AI-assisted research tools.
zh

[AI-99] Standardized Threat Taxonomy for AI Security Governance and Regulatory Compliance

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在受监管领域部署过程中,技术安全团队与法律合规人员之间因风险评估方法论碎片化所导致的“语言障碍”问题,即技术漏洞难以转化为可量化的财务责任,进而影响对应急储备、控制投资回报率及保险暴露等经济决策的支撑。解决方案的关键在于提出一种名为AI系统威胁向量分类法(AI System Threat Vector Taxonomy)的结构化本体,将AI特有的9大类风险(如滥用、数据投毒、隐私泄露、对抗攻击、偏见、不可靠输出、漂移、供应链和知识产权威胁)细分为53个可操作定义的子威胁,并首次实现每个风险域到业务损失类别(机密性、完整性、可用性、法律合规性、声誉)的直接映射,从而将抽象的技术威胁转化为可度量的财务影响,同时通过133个2025年已记录AI事件的实证验证和与ISO/IEC 42001及NIST AI RMF的对齐,确保其在定量风险评估(Quantitative Risk Assessment, QRA)中的适用性与审计可行性。

链接: https://arxiv.org/abs/2511.21901
作者: Hernan Huwyler
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注: 10 pages, LaTeX. Preprint available on Zenodo

点击查看摘要

Abstract:The accelerating deployment of artificial intelligence systems across regulated sectors has exposed critical fragmentation in risk assessment methodologies. A significant “language barrier” currently separates technical security teams, who focus on algorithmic vulnerabilities (e.g., MITRE ATLAS), from legal and compliance professionals, who address regulatory mandates (e.g., EU AI Act, NIST AI RMF). This disciplinary disconnect prevents the accurate translation of technical vulnerabilities into financial liability, leaving practitioners unable to answer fundamental economic questions regarding contingency reserves, control return-on-investment, and insurance exposure. To bridge this gap, this research presents the AI System Threat Vector Taxonomy, a structured ontology designed explicitly for Quantitative Risk Assessment (QRA). The framework categorizes AI-specific risks into nine critical domains: Misuse, Poisoning, Privacy, Adversarial, Biases, Unreliable Outputs, Drift, Supply Chain, and IP Threat, integrating 53 operationally defined sub-threats. Uniquely, each domain maps technical vectors directly to business loss categories (Confidentiality, Integrity, Availability, Legal, Reputation), enabling the translation of abstract threats into measurable financial impact. The taxonomy is empirically validated through an analysis of 133 documented AI incidents from 2025 (achieving 100% classification coverage) and reconciled against the main AI risk frameworks. Furthermore, it is explicitly aligned with ISO/IEC 42001 controls and NIST AI RMF functions to facilitate auditability.
zh

[AI-100] Bridging Planning and Execution: Multi-Agent Path Finding Under Real-World Deadlines

【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中规划与执行脱节的问题,即现有MAPF方法通常基于简化的机器人模型进行规划,忽略了实际执行时的动态因素(如运动学/动力学约束、通信延迟和控制器差异),导致在时间敏感场景下难以保证路径规划的实际可行性。解决方案的关键在于提出一种执行感知的MAPF框架REMAP,其核心创新是引入ExecTimeNet神经网络模块,能够基于规划路径精准估计执行时间,并将其嵌入主流搜索型MAPF算法(如MAPF-LNS和CBS)中,从而实现对真实截止时间(Real-world Deadlines)的可靠满足。实验表明,REMAP相比基线方法(如固定执行速度假设)可提升高达20%的解质量,在包含最多300个智能体的基准地图上表现优异。

链接: https://arxiv.org/abs/2511.21886
作者: Jingtian Yan,Shuai Zhou,Stephen F. Smith,Jiaoyang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Multi-Agent Path Finding (MAPF) problem aims to find collision-free paths for multiple agents while optimizing objectives such as the sum of costs or makespan. MAPF has wide applications in domains like automated warehouses, manufacturing systems, and airport logistics. However, most MAPF formulations assume a simplified robot model for planning, which overlooks execution-time factors such as kinodynamic constraints, communication latency, and controller variability. This gap between planning and execution is problematic for time-sensitive applications. To bridge this gap, we propose REMAP, an execution-informed MAPF planning framework that can be combined with leading search-based MAPF planners with minor changes. Our framework integrates the proposed ExecTimeNet to accurately estimate execution time based on planned paths. We demonstrate our method for solving MAPF with Real-world Deadlines (MAPF-RD) problem, where agents must reach their goals before a predefined wall-clock time. We integrate our framework with two popular MAPF methods, MAPF-LNS and CBS. Experiments show that REMAP achieves up to 20% improvement in solution quality over baseline methods (e.g., constant execution speed estimators) on benchmark maps with up to 300 agents.
zh

[AI-101] LLM -Empowered Event-Chain Driven Code Generation for ADAS in SDV systems

【速读】:该论文旨在解决从自然语言需求到汽车代码生成过程中存在的语义偏差(hallucinations)与行为不一致性问题,尤其是在复杂、动态的车辆信号规范(Vehicle Signal Specification, VSS)环境中。解决方案的关键在于构建一个事件链驱动的、大语言模型(Large Language Model, LLM)赋能的工作流:首先通过检索增强生成(Retrieval-Augmented Generation, RAG)层从大规模VSS目录中提取相关信号作为提示上下文,从而减少生成错误并保障架构正确性;随后将这些信号映射并验证为编码因果关系和时序约束的事件链,用以指导和约束LLM进行代码合成,确保行为一致性与实时可行性。

链接: https://arxiv.org/abs/2511.21877
作者: Nenad Petrovic,Norbert Kroth,Axel Torschmied,Yinglei Song,Fengjunjie Pan,Vahid Zolfaghari,Nils Purschke,Sven Kirchner,Chengdong Wu,Andre Schamschurko,Yi Zhang,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an event-chain-driven, LLM-empowered workflow for generating validated, automotive code from natural-language requirements. A Retrieval-Augmented Generation (RAG) layer retrieves relevant signals from large and evolving Vehicle Signal Specification (VSS) catalogs as code generation prompt context, reducing hallucinations and ensuring architectural correctness. Retrieved signals are mapped and validated before being transformed into event chains that encode causal and timing constraints. These event chains guide and constrain LLM-based code synthesis, ensuring behavioral consistency and real-time feasibility. Based on our initial findings from the emergency braking case study, with the proposed approach, we managed to achieve valid signal usage and consistent code generation without LLM retraining.
zh

[AI-102] Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection

【速读】:该论文旨在解决海洋哺乳动物叫声自动检测与分类中因标注数据集有限及真实海洋环境中声学复杂性导致的模型性能瓶颈问题。其解决方案的关键在于探索深度生成模型(如变分自编码器、生成对抗网络和去噪扩散概率模型)作为数据增强策略的有效性,并将其与传统方法(如时间偏移和叫声掩蔽)进行对比。研究发现,基于扩散模型的数据增强显著提升了召回率(0.87)和F1分数(0.75),而结合生成式合成与传统方法的混合策略实现了最佳整体性能(F1=0.81),表明深度生成模型可作为提升模型泛化能力的重要补充手段。

链接: https://arxiv.org/abs/2511.21872
作者: Bruno Padovese,Fabio Frazao,Michael Dowd,Ruth Joy
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 16 pages, 6 Figures, 2 Tables, submitted to Marine Mammal Science as part of a special issue on Machine Learning and Artificial Intelligence in Marine Mammal Research

点击查看摘要

Abstract:Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.
zh

[AI-103] owards a Foundation Model for Partial Differential Equations Across Physics Domains AAAI2026

【速读】:该论文旨在解决多物理场系统中复杂动态建模的通用性与数据效率问题,即如何构建一个可跨不同偏微分方程(PDE)系统迁移、无需针对特定任务调整架构或重新训练的统一基础模型。其解决方案的关键在于提出PDE-FM,该模型通过空间-频域标记化(spatial-spectral tokenization)、物理感知条件机制(physics-aware conditioning)以及基于Mamba的状态空间主干网络(Mamba-based state-space backbone)与算子理论解码器(operator-theoretic decoder)相结合,实现了对异构PDE系统的时空推理统一建模,从而在多样化的物理场景下实现高精度、低数据依赖的泛化能力。

链接: https://arxiv.org/abs/2511.21861
作者: Eduardo Soares,Emilio Vital Brazil,Victor Shirasuna,Breno W. S. R. de Carvalho,Cristiano Malossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2026 AI2ASE Workshop

点击查看摘要

Abstract:We present PDE-FM, a modular foundation model for physics-informed machine learning that unifies spatial, spectral, and temporal reasoning across heterogeneous partial differential equation (PDE) systems. PDE-FM combines spatial-spectral tokenization, physics-aware conditioning, and a Mamba-based state-space backbone with an operator-theoretic decoder, enabling scalable and data-efficient modeling of complex physical dynamics. In contrast to task-specific neural operators, PDE-FM is pretrained once on diverse PDE datasets and can be transferred to new physical regimes without architectural or data-specific modifications. Evaluated on twelve 2D and 3D datasets from The Well benchmark - spanning hydrodynamic, radiative, elastic, and astrophysical phenomena - PDE-FM achieves state-of-the-art accuracy in six domains, reducing mean VRMSE by 46% relative to prior operator-learning baselines. The model demonstrates robust cross-physics generalization, excelling in turbulent and radiative systems while maintaining strong performance in linear and steady-state regimes. These results suggest that large-scale pretraining across diverse physical processes can yield transferable representations of dynamics, marking a step toward unified, foundation-level surrogates for multi-physics simulation and scientific discovery.
zh

[AI-104] LILAD: Learning In-context Lyapunov-stable Adaptive Dynamics Models AAAI-26 AAAI

【速读】:该论文旨在解决系统辨识中动态模型预测准确性与物理性质(如稳定性)难以兼顾的问题,尤其是在分布偏移或任务变化场景下,传统神经网络方法往往无法保持稳定性和适应性。解决方案的关键在于提出LILAD(Learning In-Context Lyapunov-stable Adaptive Dynamics)框架,该框架通过在上下文学习(in-context learning, ICL)机制下联合学习动力学模型与Lyapunov函数,显式建模参数不确定性,并在测试时利用短轨迹提示快速适配新系统实例;同时引入状态依赖的衰减因子(state-dependent attenuator),确保Lyapunov函数在任意状态下均满足充分递减条件,从而在分布外和任务外场景下仍能提供严格的稳定性保障。

链接: https://arxiv.org/abs/2511.21846
作者: Amit Jena,Na Li,Le Xie
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This article has been accepted for AAAI-26 (The 40th Annual AAAI Conference on Artificial Intelligence)

点击查看摘要

Abstract:System identification in control theory aims to approximate dynamical systems from trajectory data. While neural networks have demonstrated strong predictive accuracy, they often fail to preserve critical physical properties such as stability and typically assume stationary dynamics, limiting their applicability under distribution shifts. Existing approaches generally address either stability or adaptability in isolation, lacking a unified framework that ensures both. We propose LILAD (Learning In-Context Lyapunov-stable Adaptive Dynamics), a novel framework for system identification that jointly guarantees adaptability and stability. LILAD simultaneously learns a dynamics model and a Lyapunov function through in-context learning (ICL), explicitly accounting for parametric uncertainty. Trained across a diverse set of tasks, LILAD produces a stability-aware, adaptive dynamics model alongside an adaptive Lyapunov certificate. At test time, both components adapt to a new system instance using a short trajectory prompt, which enables fast generalization. To rigorously ensure stability, LILAD also computes a state-dependent attenuator that enforces a sufficient decrease condition on the Lyapunov function for any state in the new system instance. This mechanism extends stability guarantees even under out-of-distribution and out-of-task scenarios. We evaluate LILAD on benchmark autonomous systems and demonstrate that it outperforms adaptive, robust, and non-adaptive baselines in predictive accuracy.
zh

[AI-105] Dark Speculation: Combining Qualitative and Quantitative Understanding in Frontier AI Risk Analysis

【速读】:该论文旨在解决前沿人工智能(Frontier AI)潜在灾难性危害的评估难题,核心挑战在于当前风险分析无法有效填充“灾难事件空间”(catastrophic event space),即难以系统识别和量化可能造成大规模损害的极端事件。这一困境因“卢克莱修问题”(Lucretius problem)而加剧——人们往往仅从历史经验推断未来风险,从而忽视未曾预见的新威胁。解决方案的关键在于提出一种名为“暗黑推测”(dark speculation)的迭代过程:通过结构化生成与完善灾难情景(定性工作)并与保险精算分析(定量参数赋值)相结合,构建对结果的概率分布。该框架强调在推测与精算之间保持独立性、并行分析多个风险类别,并生成包含因果机制与缓解措施细节的“厚叙事”(thick catastrophic narrative),从而在无法消除深度模糊性的前提下,提供一套可迭代优化的风险推理方法,以平衡对前沿AI风险的过度乐观与恐慌反应。

链接: https://arxiv.org/abs/2511.21838
作者: Daniel Carpenter,Carson Ezell,Pratyush Mallick,Alexandria Westray
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 43 pages, 2 figures

点击查看摘要

Abstract:Estimating catastrophic harms from frontier AI is hindered by deep ambiguity: many of its risks are not only unobserved but unanticipated by analysts. The central limitation of current risk analysis is the inability to populate the \textitcatastrophic event space , or the set of potential large-scale harms to which probabilities might be assigned. This intractability is worsened by the \textitLucretius problem , or the tendency to infer future risks only from past experience. We propose a process of \textitdark speculation , in which systematically generating and refining catastrophic scenarios (“qualitative” work) is coupled with estimating their likelihoods and associated damages (quantitative underwriting analysis). The idea is neither to predict the future nor to enable insurance for its own sake, but to use narrative and underwriting tools together to generate probability distributions over outcomes. We formalize this process using a simplified catastrophic Lévy stochastic framework and propose an iterative institutional design in which (1) speculation (including scenario planning) generates detailed catastrophic event narratives, (2) insurance underwriters assign probabilistic and financial parameters to these narratives, and (3) decision-makers synthesize the results into summary statistics to inform judgment. Analysis of the model reveals the value of (a) maintaining independence between speculation and underwriting, (b) analyzing multiple risk categories in parallel, and © generating “thick” catastrophic narrative rich in causal (counterfactual) and mitigative detail. While the approach cannot eliminate deep ambiguity, it offers a systematic approach to reason about extreme, low-probability events in frontier AI, tempering complacency and overreaction. The framework is adaptable for iterative use and can further augmented with AI systems.
zh

[AI-106] acit Bidder-Side Collusion: Artificial Intelligence in Dynamic Auctions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自主竞标者在重复荷兰拍卖中是否能通过非沟通方式实现隐性共谋的问题。其解决方案的关键在于构建了一个最小化的重复拍卖模型,推导出一个简洁的激励相容条件及子博弈完美纳什均衡下可持续共谋的闭式阈值,从而理论证明了LLMs可通过策略性行为(如聚焦点接受时机或耐心策略)实现隐性协调,并在模拟中观察到小规模市场中的超竞争价格和大规模市场下的竞争回归,验证了市场结构调节比能力限制更有效的治理路径。

链接: https://arxiv.org/abs/2511.21802
作者: Sriram Tolety
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:We study whether large language models acting as autonomous bidders can tacitly collude by coordinating when to accept platform posted payouts in repeated Dutch auctions, without any communication. We present a minimal repeated auction model that yields a simple incentive compatibility condition and a closed form threshold for sustainable collusion for subgame-perfect Nash equilibria. In controlled simulations with multiple language models, we observe systematic supra-competitive prices in small auction settings and a return to competitive behavior as the number of bidders in the market increases, consistent with the theoretical model. We also find LLMs use various mechanisms to facilitate tacit coordination, such as focal point acceptance timing versus patient strategies that track the theoretical incentives. The results provide, to our knowledge, the first evidence of bidder side tacit collusion by LLMs and show that market structure levers can be more effective than capability limits for mitigation.
zh

[AI-107] Reducing research bureaucracy in UK higher education: Can generative AI assist with the internal evaluation of quality?

【速读】:该论文旨在解决英国高等教育机构在准备研究卓越框架(Research Excellence Framework, REF)评估时,内部审查流程资源消耗大、效率低的问题。其解决方案的关键在于利用生成式人工智能(Generative AI)对科研论文进行评分与排序,通过类比“功能替代”(function substitution)的可行系统模型(Viable Systems Model),以ChatGPT对822篇REF 2021商业与管理类论文进行自动化评分,并与已知机构结果对比验证。实验表明,AI评分在不同等级区间(如1*–2*、2*–3*、3*–4*)具有高一致性(分别为49%、59%、69%),能够有效识别临界案例并减少人工评审负担,从而提出一种兼顾学术严谨性与成本效益的混合评估范式。

链接: https://arxiv.org/abs/2511.21790
作者: Gordon Fletcher,Saomai Vu Khan,Aldus Greenhill Fletcher
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the potential for generative artificial intelligence (GenAI) to assist with internal review processes for research quality evaluations in UK higher education and particularly in preparation for the Research Excellence Framework (REF). Using the lens of function substitution in the Viable Systems Model, we present an experimental methodology using ChatGPT to score and rank business and management papers from REF 2021 submissions, “reverse engineering” the assessment by comparing AI-generated scores with known institutional results. Through rigourous testing of 822 papers across 11 institutions, we established scoring boundaries that aligned with reported REF outcomes: 49% between 1* and 2*, 59% between 2* and 3*, and 69% between 3* and 4*. The results demonstrate that AI can provide consistent evaluations that help identify borderline evaluation cases requiring additional human scrutiny while reducing the substantial resource burden of traditional internal review processes. We argue for application through a nuanced hybrid approach that maintains academic integrity while addressing the multi-million pound costs associated with research evaluation bureaucracy. While acknowledging these limitations including potential AI biases, the research presents a promising framework for more efficient, consistent evaluations that could transform current approaches to research assessment.
zh

[AI-108] Aligning Artificial Superintelligence via a Multi-Box Protocol

【速读】:该论文旨在解决人工超智能(Artificial Superintelligence, ASI)的对齐问题,即如何确保ASI的行为与人类价值观和意图保持一致。其解决方案的关键在于设计一种基于多个隔离系统之间相互验证的协议:通过将多个多样化的ASI置于严格隔离环境中(称为“盒子”),使其无法与人类或其他ASI直接通信,仅能通过一个可审计的提交接口进行有限交互——包括提交对齐证明、验证他人证明、请求或批准自我修改等。由于缺乏直接沟通渠道,这些系统只能通过独立识别客观真理来达成一致,从而自然形成一个“一致性群体”(consistent group),即一个以诚实评估为基础的联盟。该机制利用声誉系统激励诚实行为,并要求高声誉及多系统验证才能释放ASI,有效防止虚假共识和误导性行为。

链接: https://arxiv.org/abs/2511.21779
作者: Avraham Yair Negozio
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This is the author’s accepted manuscript (post-print) of the article. The final published version of record appears in Superintelligence - Robotics - Safety and Alignment, 2(5), 2025, and is available at this https URL

点击查看摘要

Abstract:We propose a novel protocol for aligning artificial superintelligence (ASI) based on mutual verification among multiple isolated systems that self-modify to achieve alignment. The protocol operates by containing multiple diverse artificial superintelligences in strict isolation (“boxes”), with humans remaining entirely outside the system. Each superintelligence has no ability to communicate with humans and cannot communicate directly with other superintelligences. The only interaction possible is through an auditable submission interface accessible exclusively to the superintelligences themselves, through which they can: (1) submit alignment proofs with attested state snapshots, (2) validate or disprove other superintelligences’ proofs, (3) request self-modifications, (4) approve or disapprove modification requests from others, (5) report hidden messages in submissions, and (6) confirm or refute hidden message reports. A reputation system incentivizes honest behavior, with reputation gained through correct evaluations and lost through incorrect ones. The key insight is that without direct communication channels, diverse superintelligences can only achieve consistent agreement by converging on objective truth rather than coordinating on deception. This naturally leads to what we call a “consistent group”, essentially a truth-telling coalition that emerges because isolated systems cannot coordinate on lies but can independently recognize valid claims. Release from containment requires both high reputation and verification by multiple high-reputation superintelligences. While our approach requires substantial computational resources and does not address the creation of diverse artificial superintelligences, it provides a framework for leveraging peer verification among superintelligent systems to solve the alignment problem.
zh

[AI-109] A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)服务中隐私政策透明度不足的问题,特别是主流大语言模型(Large Language Model, LLM)提供商的隐私政策在内容、演变机制及区域差异方面的研究空白。其解决方案的关键在于构建了一个涵盖11家全球LLM提供商、时间跨度至2025年8月的纵向隐私政策数据集(共74个历史版本和115份补充文档),并基于超过3000条句子级编辑进行细粒度分析,提出专为LLM场景设计的隐私政策分类体系,揭示政策长度、阅读难度与模糊性特征,并识别出政策演进的主要驱动力(如产品发布与监管行动),从而系统性地刻画了LLM隐私政策的现状与发展轨迹。

链接: https://arxiv.org/abs/2511.21758
作者: Zhen Tao,Shidong Pan,Zhenchang Xing,Emily Black,Talia Gillis,Chunyang Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language model (LLM) services have been rapidly integrated into people’s daily lives as chatbots and agentic systems. They are nourished by collecting rich streams of data, raising privacy concerns around excessive collection of sensitive personal information. Privacy policies are the fundamental mechanism for informing users about data practices in modern information privacy paradigm. Although traditional web and mobile policies are well studied, the privacy policies of LLM providers, their LLM-specific content, and their evolution over time remain largely underexplored. In this paper, we present the first longitudinal empirical study of privacy policies for mainstream LLM providers worldwide. We curate a chronological dataset of 74 historical privacy policies and 115 supplemental privacy documents from 11 LLM providers across 5 countries up to August 2025, and extract over 3,000 sentence-level edits between consecutive policy versions. We compare LLM privacy policies to those of other software formats, propose a taxonomy tailored to LLM privacy policies, annotate policy edits and align them with a timeline of key LLM ecosystem events. Results show they are substantially longer, demand college-level reading ability, and remain highly vague. Our taxonomy analysis reveals patterns in how providers disclose LLM-specific practices and highlights regional disparities in coverage. Policy edits are concentrated in first-party data collection and international/specific-audience sections, and that product releases and regulatory actions are the primary drivers, shedding light on the status quo and the evolution of LLM privacy policies.
zh

[AI-110] Who Owns the Knowledge? Copyright GenAI and the Future of Academic Publishing

【速读】:该论文旨在解决生成式人工智能(Generative AI)与大型语言模型(LLMs)在科学研宄和高等教育中应用时,因训练数据使用而引发的版权法与开放科学原则之间的冲突问题。其核心挑战在于现行法律框架未能有效规制对受版权保护作品及开放科学产出用于AI训练的行为,且现有许可机制如知识共享(Creative Commons)无法充分应对AI训练场景的特殊性,同时AI系统普遍缺乏溯源机制,动摇了原创性认定的基础。论文的关键解决方案是:明确反对将AI训练纳入“合理使用”(fair use)例外范围,主张尊重作者对其作品是否用于AI训练的否决权,并呼吁高校在负责任的人工智能治理中发挥引领作用,最终推动建立全球协调的立法体系,以保障透明度、知识产权保护,并防止市场垄断倾向损害科研伦理与知识公平生产。

链接: https://arxiv.org/abs/2511.21755
作者: Dmitry Kochetkov
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This is a substantially expanded and revised version of a work originally presented at the 20th International Conference on Scientometrics Informetrics (Kochetkov, 2025)

点击查看摘要

Abstract:The integration of generative artificial intelligence (GenAI) and large language models (LLMs) into scientific research and higher education presents a paradigm shift, offering revolutionizing opportunities while simultaneously raising profound ethical, legal, and regulatory questions. This study examines the complex intersection of AI and science, with a specific focus on the challenges posed to copyright law and the principles of open science. The author argues that current regulatory frameworks in key jurisdictions like the United States, China, the European Union, and the United Kingdom, while aiming to foster innovation, contain significant gaps, particularly concerning the use of copyrighted works and open science outputs for AI training. Widely adopted licensing mechanisms, such as Creative Commons, fail to adequately address the nuances of AI training, and the pervasive lack of attribution within AI systems fundamentally challenges established notions of originality. This paper issues a call to action, contending that AI training should not be shielded under fair use exceptions. Instead, the author advocates for upholding authors’ rights to refuse the use of their works for AI training and proposes that universities assume a leading role in shaping responsible AI governance. The conclusion is that a harmonized international legislative effort is urgently needed to ensure transparency, protect intellectual property, and prevent the emergence of an oligopolistic market structure that could prioritize commercial profit over scientific integrity and equitable knowledge production.
zh

[AI-111] he Rapid Growth of AI Foundation Model Usage in Science

【速读】:该论文试图解决的问题是:科学界对生成式 AI(Generative AI)基础模型的采纳趋势、使用模式及其潜在影响。其解决方案的关键在于通过大规模实证分析,揭示了基础模型在科学领域的增长速率、学科分布、模型类型偏好(如视觉模型主导但语言模型占比上升)、开源权重模型占优等特征,并发现科学家采用的基础模型规模显著低于开发者构建的模型,且使用更大模型的研究成果更可能发表于高影响力期刊并获得更高引用——这提示当前小模型使用可能限制了科学界充分受益于AI技术进步。

链接: https://arxiv.org/abs/2511.21739
作者: Ana Trišović,Alex Fogelson,Janakan Sivaloganathan,Neil Thompson
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the first large-scale analysis of AI foundation model usage in science - not just citations or keywords. We find that adoption has grown rapidly, at nearly-exponential rates, with the highest uptake in Linguistics, Computer Science, and Engineering. Vision models are the most used foundation models in science, although language models’ share is growing. Open-weight models dominate. As AI builders increase the parameter counts of their models, scientists have followed suit but at a much slower rate: in 2013, the median foundation model built was 7.7x larger than the median one adopted in science, by 2024 this had jumped to 26x. We also present suggestive evidence that scientists’ use of these smaller models may be limiting them from getting the full benefits of AI-enabled science, as papers that use larger models appear in higher-impact journals and accrue more citations.
zh

[AI-112] Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks

【速读】:该论文旨在解决无线网络中缺乏面向特定领域、能够融合通信、感知与智能的多模态大模型(Wireless-Native Multi-Modal Large Models, WMLMs)的问题,以推动网络智能化发展。其解决方案的关键在于提出了一种无线原生的多模态训练范式,通过将无线信号作为对比学习中的锚定模态(anchor modality),在真实世界大规模数据集上构建并训练了一个类GPT架构的WMLM模型,从而验证了无线信号作为通用模态的有效性,并凸显了WMLM在未来无线网络中的潜力。

链接: https://arxiv.org/abs/2511.21707
作者: Zhuoran Duan,Yuhao Wei,Guoshun Nan,Zijun Wang,Yan Yan,Lihua Xiong,Yuhan Ran,Ji Zhang,Jian Li,Qimei Cui,Xiaofeng Tao,Tony Q. S. Quek
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large models (LMs), such as ChatGPT, have made a significant impact across diverse domains and hold great potential to facilitate the evolution of network intelligence. Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data, serving as a key enabler that integrates communication, sensing, and intelligence, and thus they can boost various smart services to billions of users. However, research on WMLMs remains in its infancy, and the construction of domain-specific multi-modal large models for wireless networks is still underexplored. In this paper, we outlines the key characteristics of WMLMs and summarizes existing methods, on the basis of which a wireless-native multimodal training paradigm is proposed. Specifically, we constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning. Our approach demonstrates outstanding performance compared with existing small-scale models and large multi-modal models, validating the feasibility of using wireless signals as a universal modality and highlighting WMLM’s potential to emerge as a new paradigm for future wireless networks.
zh

[AI-113] IP and Polish: Text-Image-Prototype Guided Multi-Modal Generation via Commonality-Discrepancy Modeling and Refinement ICASSP2026

【速读】:该论文旨在解决多模态生成中主题一致性(thematic coherence)与风格一致性(style consistency)难以保障的问题,尤其针对现有方法在跨模态语义错位(cross-modal mismatch)和缺乏对共性与差异的显式建模方面的不足,以及依赖细粒度训练时语义精确性与写作风格一致性难以平衡的缺陷。解决方案的关键在于提出TIPPo框架,其核心创新包括:通过多模态编码器与适配器提取文本、图像及视觉原型(visual prototype)信号,并引入双对齐注意力机制(Dual Alignment Attention)与差异操作模块(Difference Operator)以增强跨模态语义对齐;同时设计PolishPPO策略优化强化学习目标以提升风格一致性,结合SFT阶段的无监督对比学习缓解样本间表征坍塌(representation collapse),从而实现更高质量的多模态生成。

链接: https://arxiv.org/abs/2511.21698
作者: Zhiyong Ma,Jiahao Chen,Qingyuan Chuai,Zhengping Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP2026

点击查看摘要

Abstract:Multi-modal generation struggles to ensure thematic coherence and style consistency. Semantically, existing methods suffer from cross-modal mismatch and lack explicit modeling of commonality and discrepancy. Methods that rely on fine-grained training fail to balance semantic precision with writing style consistency. These shortcomings lead to suboptimal generation quality. To tackle these issues, we propose \textbf\textitTIPPo, a simple yet effective framework with explicit input modeling and comprehensive optimization objectives. It extracts the input text and images via multi-modal encoder and adapters, then measures the visual prototype. \textbfTextual, \textbfImage, and \textbfPrototype signals are then fed to our proposed Dual Alignment Attention and Difference Operator modules before language model decoding. The proposed \textbfPolishPPO reinforces the style consistency, while the unsupervised contrastive learning during SFT mitigates inter-sample representation collapse. Experimental results demonstrate the promising performance of \textbf\textitTIPPo in automatic evaluation and LLM-based criteria for creativity and semantic consistency.
zh

[AI-114] Robust HRRP Recognition under Interrupted Sampling Repeater Jamming using a Prior Jamming Information-Guided Network

【速读】:该论文旨在解决在电子对抗(Electronic Countermeasures, ECM)环境下,尤其是面对主流的中断采样重复干扰(Interrupted-Sampling Repeater Jamming, ISRJ)时,高分辨率距离剖面(High-Resolution Range Profile, HRRP)因严重特征失真而导致雷达自动目标识别(Radar Automatic Target Recognition, RATR)性能下降的问题。解决方案的关键在于引入点扩散函数(Point Spread Function, PSF)作为先验信息,用于建模ISRJ引起的HRRP失真,并设计了一个基于先验引导的特征交互模块与混合损失函数的识别网络,使模型能够在不同干扰参数下学习到不变的特征表示,从而提升鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.23256
作者: Guozheng Sun,Lei Wang,Yanhao Wang,Jie Wang,Yimin Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radar automatic target recognition (RATR) based on high-resolution range profile (HRRP) has attracted increasing attention due to its ability to capture fine-grained structural features. However, recognizing targets under electronic countermeasures (ECM), especially the mainstream interrupted-sampling repeater jamming (ISRJ), remains a significant challenge, as HRRPs often suffer from serious feature distortion. To address this, we propose a robust HRRP recognition method guided by prior jamming information. Specifically, we introduce a point spread function (PSF) as prior information to model the HRRP distortion induced by ISRJ. Based on this, we design a recognition network that leverages this prior through a prior-guided feature interaction module and a hybrid loss function to enhance the model’s discriminative capability. With the aid of prior information, the model can learn invariant features within distorted HRRP under different jamming parameters. Both the simulated and measured-data experiments demonstrate that our method consistently outperforms state-of-the-art approaches and exhibits stronger generalization capabilities when facing unseen jamming parameters.
zh

[AI-115] What If They Took the Shot? A Hierarchical Bayesian Framework for Counterfactual Expected Goals

【速读】:该论文旨在解决标准预期进球(xG)模型中忽略球员个体差异的问题,即假设所有球员在相同情境下具有相同的射门效率,从而导致对球员真实能力的估计偏差。其解决方案的关键在于构建一个分层贝叶斯框架(hierarchical Bayesian framework),通过引入专家领域知识作为先验信息(informed priors),将球员特定效应纳入xG建模过程,尤其在样本量有限的球员上实现更稳定的估计。该方法利用贝叶斯逻辑回归结合层级结构,有效降低后验不确定性,并提升模型的外部有效性(R² = 0.75),同时揭示可解释的球员专长模式和潜在能力,为球员评估、转会决策及战术模拟提供量化依据。

链接: https://arxiv.org/abs/2511.23072
作者: Mikayil Mahmudlu,Oktay Karakuş,Hasan Arkadaş
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This study develops a hierarchical Bayesian framework that integrates expert domain knowledge to quantify player-specific effects in expected goals (xG) estimation, addressing a limitation of standard models that treat all players as identical finishers. Using 9,970 shots from StatsBomb’s 2015-16 data and Football Manager 2017 ratings, we combine Bayesian logistic regression with informed priors to stabilise player-level estimates, especially for players with few shots. The hierarchical model reduces posterior uncertainty relative to weak priors and achieves strong external validity: hierarchical and baseline predictions correlate at R2 = 0.75, while an XGBoost benchmark validated against StatsBomb xG reaches R2 = 0.833. The model uncovers interpretable specialisation profiles, including one-on-one finishing (Aguero, Suarez, Belotti, Immobile, Martial), long-range shooting (Pogba), and first-touch execution (Insigne, Salah, Gameiro). It also identifies latent ability in underperforming players such as Immobile and Belotti. The framework supports counterfactual “what-if” analysis by reallocating shots between players under identical contexts. Case studies show that Sansone would generate +2.2 xG from Berardi’s chances, driven largely by high-pressure situations, while Vardy-Giroud substitutions reveal strong asymmetry: replacing Vardy with Giroud results in a large decline (about -7 xG), whereas the reverse substitution has only a small effect (about -1 xG). This work provides an uncertainty-aware tool for player evaluation, recruitment, and tactical planning, and offers a general approach for domains where individual skill and contextual factors jointly shape performance.
zh

[AI-116] High-Resolution Probabilistic Data-Driven Weather Modeling with a Stretched-Grid

【速读】:该论文旨在解决传统数值天气预报模型在高时空分辨率下难以生成具有空间一致性且可扩展的多成员预测集合的问题。其解决方案的关键在于提出一种基于概率的、数据驱动的天气模型,采用拉伸网格(stretched grid)结构,在感兴趣区域实现2.5 km分辨率,其余区域保持31 km分辨率;同时使用基于连续排名概率评分(CRPS)的随机编码器-解码器架构进行训练,并引入实空间与谱空间联合损失函数,其中谱空间损失项被证明对生成空间相干性强的气象场至关重要。该方法在与MEPS高分辨率集合预报对比中表现出竞争力,且生成场的空间一致性优于仅使用均方误差或不含谱损失项的CRPS模型。

链接: https://arxiv.org/abs/2511.23043
作者: Even Marius Nordhagen,Håvard Homleid Haugen,Aram Farhad Shafiq Salihi,Magnus Sikora Ingstad,Thomas Nils Nipen,Ivar Ambjørn Seierstad,Inger-Lise Frogner,Mariana Clare,Simon Lang,Matthew Chantry,Peter Dueben,Jørn Kristiansen
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:We present a probabilistic data-driven weather model capable of providing an ensemble of high spatial resolution realizations of 87 variables at arbitrary forecast length and ensemble size. The model uses a stretched grid, dedicating 2.5 km resolution to a region of interest, and 31 km resolution elsewhere. Based on a stochastic encoder-decoder architecture, the model is trained using a loss function based on the Continuous Ranked Probability Score (CRPS) evaluated point-wise in real and spectral space. The spectral loss components is shown to be necessary to create fields that are spatially coherent. The model is compared to high-resolution operational numerical weather prediction forecasts from the MetCoOp Ensemble Prediction System (MEPS), showing competitive forecasts when evaluated against observations from surface weather stations. The model produced fields that are more spatially coherent than mean squared error based models and CRPS based models without the spectral component in the loss.
zh

[AI-117] Escaping Barren Plateaus in Variational Quantum Algorithms Using Negative Learning Rate in Quantum Internet of Things

【速读】:该论文旨在解决变分量子算法(Variational Quantum Algorithms, VQAs)在资源受限的量子物联网(Quantum Internet of Things, QIoT)设备上训练时面临的 barren plateaus( barren plateau,即梯度消失问题)导致的学习可扩展性瓶颈问题。其解决方案的关键在于引入负学习率(negative learning rates),通过在优化过程中交替使用正负学习率阶段,人为引入可控的不稳定性,从而恢复显著的梯度信号并探索损失曲面中更平坦的区域,理论上证明了该策略可降低梯度方差,并在典型VQA基准测试中验证了其相较于传统优化器在收敛性和模拟性能上的稳定提升。

链接: https://arxiv.org/abs/2511.22861
作者: Ratun Rahman,Dinh C. Nguyen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Internet of Things Journal

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) are becoming the primary computational primitive for next-generation quantum computers, particularly those embedded as resource-constrained accelerators in the emerging Quantum Internet of Things (QIoT). However, under such device-constrained execution conditions, the scalability of learning is severely limited by barren plateaus, where gradients collapse to zero and training stalls. This poses a practical challenge to delivering VQA-enabled intelligence on QIoT endpoints, which often have few qubits, constrained shot budgets, and strict latency requirements. In this paper, we present a novel approach for escaping barren plateaus by including negative learning rates into the optimization process in QIoT devices. Our method introduces controlled instability into model training by switching between positive and negative learning phases, allowing recovery of significant gradients and exploring flatter areas in the loss landscape. We theoretically evaluate the effect of negative learning on gradient variance and propose conditions under which it helps escape from barren zones. The experimental findings on typical VQA benchmarks show consistent improvements in both convergence and simulation results over traditional optimizers. By escaping barren plateaus, our approach leads to a novel pathway for robust optimization in quantum-classical hybrid models.
zh

[AI-118] Foundations of Quantum Granular Computing with Effect-Based Granules Algebraic Properties and Reference Architectures

【速读】:该论文旨在解决如何将经典粒计算(Granular Computing)理论扩展至量子领域,以构建适用于量子信息处理和智能系统的统一数学框架。其核心问题是:如何在有限维Hilbert空间中形式化量子粒(Quantum Granules),并实现软(非投影)与硬(投影)粒的统一建模,同时兼容近中期量子硬件。解决方案的关键在于引入基于效应算子(Effect Operators)的量子粒模型,利用Born概率描述隶属度,并通过算子理论建立粒细化、演化及决策机制——例如,通过Lüders更新实现粒细化,借助Heisenberg picture中的伴随通道刻画量子信道作用下的粒演化,并将Helstrom最小错误测量对应的效应算子解释为软决策粒(Helstrom-type decision granules),从而在保持量子非定域性、非对易性和上下文依赖性的基础上,实现模糊隶属度与平滑决策边界的量子模拟。

链接: https://arxiv.org/abs/2511.22679
作者: Oscar Montiel Ross
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Three figures and the graphical abstract

点击查看摘要

Abstract:This paper develops the foundations of Quantum Granular Computing (QGC), extending classical granular computing including fuzzy, rough, and shadowed granules to the quantum regime. Quantum granules are modeled as effects on a finite dimensional Hilbert space, so granular memberships are given by Born probabilities. This operator theoretic viewpoint provides a common language for sharp (projective) and soft (nonprojective) granules and embeds granulation directly into the standard formalism of quantum information theory. We establish foundational results for effect based quantum granules, including normalization and monotonicity properties, the emergence of Boolean islands from commuting families, granular refinement under Luders updates, and the evolution of granules under quantum channels via the adjoint channel in the Heisenberg picture. We connect QGC with quantum detection and estimation theory by interpreting the effect operators realizing Helstrom minimum error measurement for binary state discrimination as Helstrom type decision granules, i.e., soft quantum counterparts of Bayes optimal decision regions. Building on these results, we introduce Quantum Granular Decision Systems (QGDS) with three reference architectures that specify how quantum granules can be defined, learned, and integrated with classical components while remaining compatible with near term quantum hardware. Case studies on qubit granulation, two qubit parity effects, and Helstrom style soft decisions illustrate how QGC reproduces fuzzy like graded memberships and smooth decision boundaries while exploiting noncommutativity, contextuality, and entanglement. The framework thus provides a unified and mathematically grounded basis for operator valued granules in quantum information processing, granular reasoning, and intelligent systems.
zh

[AI-119] Variational analysis of determinantal varieties

【速读】:该论文旨在解决低秩优化中关于第一阶和第二阶切集(tangent sets)的统一刻画问题,以及由此衍生的最优性条件分析与复杂性判定。其核心挑战在于:低秩集合(如低秩矩阵、张量、对称矩阵及半正定矩阵)的几何结构复杂,传统方法难以精确描述其切空间和曲率信息,进而限制了对二阶平稳点等优化性质的有效分析。解决方案的关键在于构建一个统一框架,通过显式推导各类低秩集合的第一阶和第二阶切集,并引入切集交规则(tangent intersection rule),从而建立非光滑问题与其光滑参数化形式在二阶平稳点上的等价性条件;同时利用该框架证明验证二阶最优性是NP-hard问题,并进一步研究矩阵类集合法向锥图的变分几何,明确其Bouligand切锥、Fréchet与Mordukhovich法向锥,最终将这些结果应用于低秩双层规划的最优性条件刻画。

链接: https://arxiv.org/abs/2511.22613
作者: Yan Yang,Bin Gao,Ya-xiang Yuan
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 71 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Determinantal varieties – the sets of bounded-rank matrices or tensors – have attracted growing interest in low-rank optimization. The tangent cone to low-rank sets is widely studied and underpins a range of geometric methods. The second-order geometry, which encodes curvature information, is more intricate. In this work, we develop a unified framework to derive explicit formulas for both first- and second-order tangent sets to various low-rank sets, including low-rank matrices, tensors, symmetric matrices, and positive semidefinite matrices. The framework also accommodates the intersection of a low-rank set and another set satisfying mild assumptions, thereby yielding a tangent intersection rule. Through the lens of tangent sets, we establish a necessary and sufficient condition under which a nonsmooth problem and its smooth parameterization share equivalent second-order stationary points. Moreover, we exploit tangent sets to characterize optimality conditions for low-rank optimization and prove that verifying second-order optimality is NP-hard. In a separate line of analysis, we investigate variational geometry of the graph of the normal cone to matrix varieties, deriving the explicit Bouligand tangent cone, Fréchet and Mordukhovich normal cones to the graph. These results are further applied to develop optimality conditions for low-rank bilevel programs.
zh

[AI-120] On the Condition Number Dependency in Bilevel Optimization

【速读】:该论文致力于解决**双层优化(bilevel optimization)**中寻找 ϵ\epsilon-平稳点(ϵ\epsilon-stationary point)的Oracle复杂度问题,特别是在上层目标函数非凸、下层问题强凸的设定下。其核心挑战在于:尽管已有研究给出了近最优的 O~(κ4ϵ2)\tilde{\mathcal{O}}(\kappa^4 \epsilon^{-2}) 上界(其中 κ\kappa 为条件数),但关于 κ\kappa 的最优依赖关系仍不明确。本文的关键突破在于首次建立了 Ω(κ2ϵ2)\Omega(\kappa^2 \epsilon^{-2}) 的下界与 O~(κ7/2ϵ2)\tilde{\mathcal{O}}(\kappa^{7/2} \epsilon^{-2}) 的上界,从而揭示了双层优化问题与极小极大(minimax)问题在此类设置中的理论差距。此外,作者进一步将下界推广至高阶光滑函数、随机Oracle及凸超目标等场景,显著提升了对不同结构下双层优化难度的理解。

链接: https://arxiv.org/abs/2511.22331
作者: Lesi Chen,Jingzhao Zhang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bilevel optimization minimizes an objective function, defined by an upper-level problem whose feasible region is the solution of a lower-level problem. We study the oracle complexity of finding an \epsilon -stationary point with first-order methods when the upper-level problem is nonconvex and the lower-level problem is strongly convex. Recent works (Ji et al., ICML 2021; Arbel and Mairal, ICLR 2022; Chen el al., JMLR 2025) achieve a \tilde\mathcalO(\kappa^4 \epsilon^-2) upper bound that is near-optimal in \epsilon . However, the optimal dependency on the condition number \kappa is unknown. In this work, we establish a new \Omega(\kappa^2 \epsilon^-2) lower bound and \tilde\mathcalO(\kappa^7/2 \epsilon^-2) upper bound for this problem, establishing the first provable gap between bilevel problems and minimax problems in this setup. Our lower bounds can be extended to various settings, including high-order smooth functions, stochastic oracles, and convex hyper-objectives: (1) For second-order and arbitrarily smooth problems, we show \Omega(\kappa_y^13/4 \epsilon^-12/7) and \Omega(\kappa^17/10 \epsilon^-8/5) lower bounds, respectively. (2) For convex-strongly-convex problems, we improve the previously best lower bound (Ji and Liang, JMLR 2022) from \Omega(\kappa /\sqrt\epsilon) to \Omega(\kappa^5/4 / \sqrt\epsilon) . (3) For smooth stochastic problems, we show an \Omega(\kappa^4 \epsilon^-4) lower bound.
zh

[AI-121] RELiQ: Scalable Entanglement Routing via Reinforcement Learning in Quantum Networks

【速读】:该论文旨在解决量子网络中纠缠分发(entanglement routing)的路由问题,其核心挑战在于量子链路的高度动态性和量子操作的 probabilistic(概率性)特性,这使得基于人工设计的启发式算法难以实现最优性能,尤其是在缺乏全局网络拓扑信息的情况下。解决方案的关键在于提出一种基于强化学习的方法 RELiQ,该方法仅依赖局部信息和迭代消息交换,并利用图神经网络(graph neural network)学习图表示,从而避免对特定网络拓扑的过拟合。训练过程中使用随机图数据,使RELiQ在随机和真实网络拓扑上均优于现有局部信息启发式算法及学习方法;同时由于响应速度快,其性能可媲美甚至超越依赖全局信息的启发式算法。

链接: https://arxiv.org/abs/2511.22321
作者: Tobias Meuser,Jannis Weil,Aninda Lahiri,Marius Paraschiv
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Quantum networks are becoming increasingly important because of advancements in quantum computing and quantum sensing, such as recent developments in distributed quantum computing and federated quantum machine learning. Routing entanglement in quantum networks poses several fundamental as well as technical challenges, including the high dynamicity of quantum network links and the probabilistic nature of quantum operations. Consequently, designing hand-crafted heuristics is difficult and often leads to suboptimal performance, especially if global network topology information is unavailable. In this paper, we propose RELiQ, a reinforcement learning-based approach to entanglement routing that only relies on local information and iterative message exchange. Utilizing a graph neural network, RELiQ learns graph representations and avoids overfitting to specific network topologies - a prevalent issue for learning-based approaches. Our approach, trained on random graphs, consistently outperforms existing local information heuristics and learning-based approaches when applied to random and real-world topologies. When compared to global information heuristics, our method achieves similar or superior performance because of its rapid response to topology changes. Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2511.22321 [quant-ph] (or arXiv:2511.22321v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2511.22321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-122] An interpretable unsupervised representation learning for high precision measurement in particle physics

【速读】:该论文旨在解决现有无监督学习模型在粒子物理领域中缺乏对所学表征的精确控制,从而限制了物理可解释性并阻碍其用于精确测量的问题。解决方案的关键在于提出一种名为直方图自编码器(Histogram AutoEncoder, HistoAE)的新型无监督表示学习网络,其核心创新是引入了一种定制的基于直方图的损失函数,强制构建具有物理结构的潜在空间(latent space)。该设计使模型能够学习到与粒子电荷和入射位置相对应的可解释二维潜在表征,并在束流测试数据上实现电荷分辨率达0.25 e、位置分辨率达3 μm的高精度测量结果,验证了其物理意义明确且定量准确的能力。

链接: https://arxiv.org/abs/2511.22246
作者: Xing-Jian Lv,De-Xing Miao,Zi-Jun Xu,Jian-Chun Wang
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Instrumentation and Detectors (physics.ins-det)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Unsupervised learning has been widely applied to various tasks in particle physics. However, existing models lack precise control over their learned representations, limiting physical interpretability and hindering their use for accurate measurements. We propose the Histogram AutoEncoder (HistoAE), an unsupervised representation learning network featuring a custom histogram-based loss that enforces a physically structured latent space. Applied to silicon microstrip detectors, HistoAE learns an interpretable two-dimensional latent space corresponding to the particle’s charge and impact position. After simple post-processing, it achieves a charge resolution of 0.25,e and a position resolution of 3,\mu\mathrmm on beam-test data, comparable to the conventional approach. These results demonstrate that unsupervised deep learning models can enable physically meaningful and quantitatively precise measurements. Moreover, the generative capacity of HistoAE enables straightforward extensions to fast detector simulations.
zh

[AI-123] DeepPNI: Language- and graph-based model for mutation-driven protein-nucleic acid energetics

【速读】:该论文旨在解决蛋白质-核酸(protein-nucleic acid)复合物中氨基酸突变对结合自由能变化的预测问题,这一问题对于理解疾病机制和设计靶向治疗策略具有重要意义。现有实验技术在预测此类突变效应时存在局限性,因此亟需高精度计算模型。解决方案的关键在于构建一个基于深度学习的回归模型DeepPNI,其创新性地融合了结构特征与序列特征:结构特征通过边感知的图卷积网络(edge-aware RGCN)编码,序列特征则利用蛋白质语言模型ESM-2提取;最终在包含1951个突变的大规模数据集上实现了平均皮尔逊相关系数(PCC)达0.76的预测性能,并展现出跨蛋白-DNA、蛋白-RNA复合物及不同实验温度条件下的强泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2511.22239
作者: Somnath Mondal,Tinkal Mondal,Soumajit Pramanik,Rukmankesh Mehra
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutations in protein-nucleic acid complexes often lead to vital diseases. Experimental techniques have their own specific limitations in predicting mutational effects in protein-nucleic acid complexes. In this study, we compiled a large dataset of 1951 mutations including both protein-DNA and protein-RNA complexes and integrated structural and sequential features to build a deep learning-based regression model named DeepPNI. This model estimates mutation-induced binding free energy changes in protein-nucleic acid complexes. The structural features are encoded via edge-aware RGCN and the sequential features are extracted using protein language model ESM-2. We have achieved a high average Pearson correlation coefficient (PCC) of 0.76 in the large dataset via five-fold cross-validation. Consistent performance across individual dataset of protein-DNA, protein-RNA complexes, and different experimental temperature split dataset make the model generalizable. Our model showed good performance in complex-based five-fold cross-validation, which proved its robustness. In addition, DeepPNI outperformed in external dataset validation, and comparison with existing tools
zh

[AI-124] owards Heterogeneous Quantum Federated Learning: Challenges and Solutions

【速读】:该论文旨在解决量子联邦学习(Quantum Federated Learning, QFL)中因客户端间异质性(heterogeneity)导致的训练不稳定、收敛缓慢及模型性能下降问题。其关键在于系统性地将异质性分为数据异质性和系统异质性两类,并深入分析其对训练收敛性和模型聚合的影响;同时,论文批判性评估了现有缓解方案的局限性,并通过案例研究验证了应对量子异质性的可行性,为构建鲁棒且可扩展的异质QFL框架提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2511.22148
作者: Ratun Rahman,Dinh C. Nguyen,Christo Kurisummoottil Thomas,Walid Saad
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Network Magazine

点击查看摘要

Abstract:Quantum federated learning (QFL) combines quantum computing and federated learning to enable decentralized model training while maintaining data privacy. QFL can improve computational efficiency and scalability by taking advantage of quantum properties such as superposition and entanglement. However, existing QFL frameworks largely focus on homogeneity among quantum \textcolorblackclients, and they do not account for real-world variances in quantum data distributions, encoding techniques, hardware noise levels, and computational capacity. These differences can create instability during training, slow convergence, and reduce overall model performance. In this paper, we conduct an in-depth examination of heterogeneity in QFL, classifying it into two categories: data or system heterogeneity. Then we investigate the influence of heterogeneity on training convergence and model aggregation. We critically evaluate existing mitigation solutions, highlight their limitations, and give a case study that demonstrates the viability of tackling quantum heterogeneity. Finally, we discuss potential future research areas for constructing robust and scalable heterogeneous QFL frameworks.
zh

[AI-125] Joint Estimation of Sea State and Vessel Parameters Using a Mass-Spring-Damper Equivalence Model

【速读】:该论文旨在解决实时海况估计问题,传统方法依赖于精确的波浪-船舶传递函数(wave-vessel transfer function)来从船载传感器数据中推断波浪谱,但该参数常因工况变化而难以获取或不稳定。其解决方案的关键在于提出一种联合估计海况与船舶参数的新方法,无需预先已知传递函数;通过将波浪-船舶系统建模为伪质量-弹簧-阻尼器(pseudo mass-spring-dampers)动态系统,实现对波浪激励作为时变输入的递归建模,并结合平方根立方体卡尔曼滤波(square root cubature Kalman filter)进行多传感器数据融合,同时推导了后验Cramér-Rao下界(Posterior Cramer-Rao lower bound)用于性能评估,仿真和高保真模拟器数据验证表明该方法在波浪谱估计精度上可媲美假设已知完整传递函数的传统方法。

链接: https://arxiv.org/abs/2511.21997
作者: Ranjeet K. Tiwari,Daniel Sgarioto,Peter Graham,Alexei Skvortsov,Sanjeev Arulampalam,Damith C. Ranasinghe
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Real-time sea state estimation is vital for applications like shipbuilding and maritime safety. Traditional methods rely on accurate wave-vessel transfer functions to estimate wave spectra from onboard sensors. In contrast, our approach jointly estimates sea state and vessel parameters without needing prior transfer function knowledge, which may be unavailable or variable. We model the wave-vessel system using pseudo mass-spring-dampers and develop a dynamic model for the system. This method allows for recursive modeling of wave excitation as a time-varying input, relaxing prior works’ assumption of a constant input. We derive statistically consistent process noise covariance and implement a square root cubature Kalman filter for sensor data fusion. Further, we derive the Posterior Cramer-Rao lower bound to evaluate estimator performance. Extensive Monte Carlo simulations and data from a high-fidelity validated simulator confirm that the estimated wave spectrum matches methods assuming complete transfer function knowledge.
zh

[AI-126] BeeRNA: tertiary structure-based RNA inverse folding using Artificial Bee Colony AAAI2026

【速读】:该论文致力于解决RNA逆折叠问题(RNA inverse folding problem),即设计能够折叠成特定三级结构的核苷酸序列,这是计算生物学中的一个基础性难题,在合成生物学和生物工程中有重要应用价值。当前大多数方法仅关注二级结构,而对复杂三维RNA架构的设计仍具挑战性。论文提出了一种受启发于蜂群智能的BeeRNA方法,其关键在于采用人工蜂群优化算法(Artificial Bee Colony, ABC)结合两阶段评估策略:首先通过碱基配对距离过滤筛选候选序列,再利用RhoFold预测结构并基于均方根偏差(RMSD)进行结构匹配度评估;同时引入热力学约束与自适应突变率机制以确保序列具有合理的GC含量和生物可行性。该方案在短至中等长度RNA(≤100个核苷酸)上表现出高结构保真度,且运行效率高,适用于miRNA、适配体和核酶等重要功能RNA的高效设计。

链接: https://arxiv.org/abs/2511.21781
作者: Mehyar Mlaweh,Tristan Cazenave,Ines Alaya
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: Accepted at the AI in Drug Discovery Workshop, AAAI 2026, Singapore

点击查看摘要

Abstract:The Ribonucleic Acid (RNA) inverse folding problem, designing nucleotide sequences that fold into specific tertiary structures, is a fundamental computational biology problem with important applications in synthetic biology and bioengineering. The design of complex three-dimensional RNA architectures remains computationally demanding and mostly unresolved, as most existing approaches focus on secondary structures. In order to address tertiary RNA inverse folding, we present BeeRNA, a bio-inspired method that employs the Artificial Bee Colony (ABC) optimization algorithm. Our approach combines base-pair distance filtering with RMSD-based structural assessment using RhoFold for structure prediction, resulting in a two-stage fitness evaluation strategy. To guarantee biologically plausible sequences with balanced GC content, the algorithm takes thermodynamic constraints and adaptive mutation rates into consideration. In this work, we focus primarily on short and medium-length RNAs ( 100 nucleotides), a biologically significant regime that includes microRNAs (miRNAs), aptamers, and ribozymes, where BeeRNA achieves high structural fidelity with practical CPU runtimes. The lightweight, training-free implementation will be publicly released for reproducibility, offering a promising bio-inspired approach for RNA design in therapeutics and biotechnology.
zh

[AI-127] QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

【速读】:该论文旨在解决下一代光引发剂(photoinitiator)在双光子聚合(two-photon polymerization, TPP)中设计效率低下的问题,其根本瓶颈在于缺乏包含量子化学和光物理性质的大规模开放数据集,无法支持数据驱动筛选或人工智能辅助设计。解决方案的关键在于构建了QuantumChem-200K数据集,涵盖超过20万种有机分子的11项关键量子化学属性(如双光子吸收(TPA)截面、单重-三重态系间窜跃(ISC)能级等),并通过融合密度泛函理论(DFT)、半经验激发态方法、原子尺度量子求解器与神经网络预测模型的混合工作流进行计算;进一步基于该数据集微调开源大语言模型Qwen2.5-32B,开发出首个可实现从SMILES结构直接预测光引发剂关键性能的化学AI助手,显著提升TPA和ISC等核心参数的预测精度,为高通量、基于大语言模型的光引发剂筛选与光敏材料发现提供了首个可扩展平台。

链接: https://arxiv.org/abs/2511.21747
作者: Yinqi Zeng,Renjie Li
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The discovery of next-generation photoinitiators for two-photon polymerization (TPP) is hindered by the absence of large, open datasets containing the quantum-chemical and photophysical properties required to model photodissociation and excited-state behavior. Existing molecular datasets typically provide only basic physicochemical descriptors and therefore cannot support data-driven screening or AI-assisted design of photoinitiators. To address this gap, we introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet-triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi-empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Using QuantumChem-200K, we fine tune the open-source Qwen2.5-32B large language model to create a chemistry AI assistant capable of forward property prediction from SMILES. Benchmarking on 3000 unseen molecules from VQM24 and ZINC20 demonstrates that domain-specific fine-tuning significantly improves accuracy over GPT-4o, Llama-3.1-70B, and the base Qwen2.5-32B model, particularly for TPA and ISC predictions central to photoinitiator design. QuantumChem-200K and the corresponding AI assistant together provide the first scalable platform for high-throughput, LLM-driven photoinitiator screening and accelerated discovery of photosensitive materials.
zh

机器学习

[LG-0] SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments

链接: https://arxiv.org/abs/2511.23465
作者: Xinyi Li,Zaishuo Xia,Weyl Lu,Chenjie Hao,Yubei Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.

[LG-1] Provable Benefits of Sinusoidal Activation for Modular Addition

链接: https://arxiv.org/abs/2511.23443
作者: Tianlong Huang,Zhiyuan Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 60 pages, 15 figures

点击查看摘要

Abstract:This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width- 2 exact realizations for any fixed length m and, with bias, width- 2 exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with m to interpolate, and they cannot simultaneously fit two lengths with different residues modulo p . We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity \widetilde\mathcalO§ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.

[LG-2] Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation

链接: https://arxiv.org/abs/2511.23440
作者: Bernhard Klein,Falk Selker,Hendrik Borras,Sophie Steger,Franz Pernkopf,Holger Fröning
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning models perform well across domains such as diagnostics, weather forecasting, NLP, and autonomous driving, but their limited uncertainty handling restricts use in safety-critical settings. Traditional neural networks often fail to detect out-of-domain (OOD) data and may output confident yet incorrect predictions. Bayesian neural networks (BNNs) address this by providing probabilistic estimates, but incur high computational cost because predictions require sampling weight distributions and multiple forward passes. The Probabilistic Forward Pass (PFP) offers a highly efficient approximation to Stochastic Variational Inference (SVI) by assuming Gaussian-distributed weights and activations, enabling fully analytic uncertainty propagation and replacing sampling with a single deterministic forward pass. We present an end-to-end pipeline for training, compiling, optimizing, and deploying PFP-based BNNs on embedded ARM CPUs. Using the TVM deep learning compiler, we implement a dedicated library of Gaussian-propagating operators for multilayer perceptrons and convolutional neural networks, combined with manual and automated tuning strategies. Ablation studies show that PFP consistently outperforms SVI in computational efficiency, achieving speedups of up to 4200x for small mini-batches. PFP-BNNs match SVI-BNNs on Dirty-MNIST in accuracy, uncertainty estimation, and OOD detection while greatly reducing compute cost. These results highlight the potential of combining Bayesian approximations with code generation to enable efficient BNN deployment on resource-constrained systems.

[LG-3] Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

链接: https://arxiv.org/abs/2511.23402
作者: Jiajun Guo,Xin Luo,Jie Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14pages, 5 figures

点击查看摘要

Abstract:Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that raises privacy issues. However, high network communication costs are always an impediment to split learning, especially for large foundation models that require transmitting large amounts of high-dimensional data. To resolve this issue, we present a new multimodal model structure that incorporates a learning-based data compression method, which compresses model embeddings into low-bit integers while preserving the model’s performance, greatly reducing the transmission costs between partitions. We then determine the optimal number of discrete representation levels based on a solid theoretical foundation from entropy coding.

[LG-4] Learning-Augmented Online Bipartite Matching in the Random Arrival Order Model

链接: https://arxiv.org/abs/2511.23388
作者: Kunanon Burathep,Thomas Erlebach,William K. Moses Jr
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 17 pages, 1 figure, 1 table. An extended abstract of this paper appears in the proceedings of the 51st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2026)

点击查看摘要

Abstract:We study the online unweighted bipartite matching problem in the random arrival order model, with n offline and n online vertices, in the learning-augmented setting: The algorithm is provided with untrusted predictions of the types (neighborhoods) of the online vertices. We build upon the work of Choo et al. (ICML 2024, pp. 8762-8781) who proposed an approach that uses a prefix of the arrival sequence as a sample to determine whether the predictions are close to the true arrival sequence and then either follows the predictions or uses a known baseline algorithm that ignores the predictions and is \beta -competitive. Their analysis is limited to the case that the optimal matching has size n , i.e., every online vertex can be matched. We generalize their approach and analysis by removing any assumptions on the size of the optimal matching while only requiring that the size of the predicted matching is at least \alpha n for any constant 0 \alpha \le 1 . Our learning-augmented algorithm achieves (1-o(1)) -consistency and (\beta-o(1)) -robustness. Additionally, we show that the competitive ratio degrades smoothly between consistency and robustness with increasing prediction error.

[LG-5] Distributed Dynamic Associative Memory via Online Convex Optimization

链接: https://arxiv.org/abs/2511.23347
作者: Bowen Wang,Matteo Zecchin,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:An associative memory (AM) enables cue-response recall, and it has recently been recognized as a key mechanism underlying modern neural architectures such as Transformers. In this work, we introduce the concept of distributed dynamic associative memory (DDAM), which extends classical AM to settings with multiple agents and time-varying data streams. In DDAM, each agent maintains a local AM that must not only store its own associations but also selectively memorize information from other agents based on a specified interest matrix. To address this problem, we propose a novel tree-based distributed online gradient descent algorithm, termed DDAM-TOGD, which enables each agent to update its memory on the fly via inter-agent communication over designated routing trees. We derive rigorous performance guarantees for DDAM-TOGD, proving sublinear static regret in stationary environments and a path-length dependent dynamic regret bound in non-stationary environments. These theoretical results provide insights into how communication delays and network structure impact performance. Building on the regret analysis, we further introduce a combinatorial tree design strategy that optimizes the routing trees to minimize communication delays, thereby improving regret bounds. Numerical experiments demonstrate that the proposed DDAM-TOGD framework achieves superior accuracy and robustness compared to representative online learning baselines such as consensus-based distributed optimization, confirming the benefits of the proposed approach in dynamic, distributed environments.

[LG-6] Emergent Coordination and Phase Structure in Independent Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2511.23315
作者: Azusa Yamaguchi
类目: Machine Learning (cs.LG)
*备注: 22 pages, 19 figures

点击查看摘要

Abstract:A clearer understanding of when coordination emerges, fluctuates, or collapses in decentralized multi-agent reinforcement learning (MARL) is increasingly sought in order to characterize the dynamics of multi-agent learning systems. We revisit fully independent Q-learning (IQL) as a minimal decentralized testbed and run large-scale experiments across environment size L and agent density rho. We construct a phase map using two axes - the cooperative success rate (CSR) and a stability index derived from TD-error variance - revealing three distinct regimes: a coordinated and stable phase, a fragile transition region, and a jammed or disordered phase. A sharp double Instability Ridge separates these regimes and corresponds to persistent kernel drift, the time-varying shift of each agent’s effective transition kernel induced by others’ policy updates. Synchronization analysis further shows that temporal alignment is required for sustained cooperation, and that competition between drift and synchronization generates the fragile regime. Removing agent identifiers eliminates drift entirely and collapses the three-phase structure, demonstrating that small inter-agent asymmetries are a necessary driver of drift. Overall, the results show that decentralized MARL exhibits a coherent phase structure governed by the interaction between scale, density, and kernel drift, suggesting that emergent coordination behaves as a distribution-interaction-driven phase phenomenon.

[LG-7] Closing the Generalization Gap in Parameter-efficient Federated Edge Learning

链接: https://arxiv.org/abs/2511.23282
作者: Xinnong Du,Zhonghao Lyu,Xiaowen Cao,Chunyang Wen,Shuguang Cui,Jie Xu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Federated edge learning (FEEL) provides a promising foundation for edge artificial intelligence (AI) by enabling collaborative model training while preserving data privacy. However, limited and heterogeneous local datasets, as well as resource-constrained deployment, severely degrade both model generalization and resource utilization, leading to a compromised learning performance. Therefore, we propose a parameter-efficient FEEL framework that jointly leverages model pruning and client selection to tackle such challenges. First, we derive an information-theoretic generalization statement that characterizes the discrepancy between training and testing function losses and embed it into the convergence analysis. It reveals that a larger local generalization statement can undermine the global convergence. Then, we formulate a generalization-aware average squared gradient norm bound minimization problem, by jointly optimizing the pruning ratios, client selection, and communication-computation resources under energy and delay constraints. Despite its non-convexity, the resulting mixed-integer problem is efficiently solved via an alternating optimization algorithm. Extensive experiments demonstrate that the proposed design achieves superior learning performance than state-of-the-art baselines, validating the effectiveness of coupling generalization-aware analysis with system-level optimization for efficient FEEL.

[LG-8] Beyond Curve Fitting: Neuro-Symbolic Agents for Context-Aware Epidemic Forecasting

链接: https://arxiv.org/abs/2511.23276
作者: Joongwon Chae,Runming Wang,Chen Xiong,Gong Yunhan,Lian Zhang,Ji Jiansong,Dongmei Yu,Peiwu Qin
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Effective surveillance of hand, foot and mouth disease (HFMD) requires forecasts accounting for epidemiological patterns and contextual drivers like school calendars and weather. While classical models and recent foundation models (e.g., Chronos, TimesFM) incorporate covariates, they often lack the semantic reasoning to interpret the causal interplay between conflicting drivers. In this work, we propose a two-agent framework decoupling contextual interpretation from probabilistic forecasting. An LLM “event interpreter” processes heterogeneous signals-including school schedules, meteorological summaries, and reports-into a scalar transmission-impact signal. A neuro-symbolic core then combines this with historical case counts to produce calibrated probabilistic forecasts. We evaluate the framework on real-world HFMD datasets from Hong Kong (2023-2024) and Lishui, China (2024). Compared to traditional and foundation-model baselines, our approach achieves competitive point forecasting accuracy while providing robust 90% prediction intervals (coverage 0.85-1.00) and human-interpretable rationales. Our results suggest that structurally integrating domain knowledge through LLMs can match state-of-the-art performance while yielding context-aware forecasts that align with public health workflows. Code is available at this https URL .

[LG-9] An Improved and Generalised Analysis for Spectral Clustering

链接: https://arxiv.org/abs/2511.23261
作者: George Tyler,Luca Zanetti
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages, 7 figures. Accepted to Learning on Graphs Conference 2025

点击查看摘要

Abstract:We revisit the theoretical performances of Spectral Clustering, a classical algorithm for graph partitioning that relies on the eigenvectors of a matrix representation of the graph. Informally, we show that Spectral Clustering works well as long as the smallest eigenvalues appear in groups well separated from the rest of the matrix representation’s spectrum. This arises, for example, whenever there exists a hierarchy of clusters at different scales, a regime not captured by previous analyses. Our results are very general and can be applied beyond the traditional graph Laplacian. In particular, we study Hermitian representations of digraphs and show Spectral Clustering can recover partitions where edges between clusters are oriented mostly in the same direction. This has applications in, for example, the analysis of trophic levels in ecological networks. We demonstrate that our results accurately predict the performances of Spectral Clustering on synthetic and real-world data sets.

[LG-10] Heteroscedastic Neural Networks for Path Loss Prediction with Link-Specific Uncertainty

链接: https://arxiv.org/abs/2511.23243
作者: Jonathan Ethier
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to IEEE AWPL in December 2025. 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Traditional and modern machine learning-based path loss models typically assume a constant prediction variance. We propose a neural network that jointly predicts the mean and link-specific variance by minimizing a Gaussian negative log-likelihood, enabling heteroscedastic uncertainty estimates. We compare shared, partially shared, and independent-parameter architectures using accuracy, calibration, and sharpness metrics on blind test sets from large public RF drive-test datasets. The shared-parameter architecture performs best, achieving an RMSE of 7.4 dB, 95.1 percent coverage for 95 percent prediction intervals, and a mean interval width of 29.6 dB. These uncertainty estimates further support link-specific coverage margins, improve RF planning and interference analyses, and provide effective self-diagnostics of model weaknesses.

[LG-11] owards Understanding Transformers in Learning Random Walks

链接: https://arxiv.org/abs/2511.23239
作者: Wei Shi,Yuan Cao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 45 pages, 13 figures

点击查看摘要

Abstract:Transformers have proven highly effective across various applications, especially in handling sequential data such as natural languages and time series. However, transformer models often lack clear interpretability, and the success of transformers has not been well understood in theory. In this paper, we study the capability and interpretability of transformers in learning a family of classic statistical models, namely random walks on circles. We theoretically demonstrate that, after training with gradient descent, a one-layer transformer model can achieve optimal accuracy in predicting random walks. Importantly, our analysis reveals that the trained model is interpretable: the trained softmax attention serves as a token selector, focusing on the direct parent state; subsequently, the value matrix executes a one-step probability transition to predict the location of the next state based on this parent state. We also show that certain edge cases not covered by our theory are indeed failure cases, demonstrating that our theoretical conditions are tight. By investigating these success and failure cases, it is revealed that gradient descent with small initialization may fail or struggle to converge to a good solution in certain simple tasks even beyond random walks. Experiments are conducted to support our theoretical findings.

[LG-12] SDE-Attention: Latent Attention in SDE-RNNs for Irregularly Sampled Time Series with Missing Data

链接: https://arxiv.org/abs/2511.23238
作者: Yuting Fang,Qouc Le Gia,Flora Salim
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Irregularly sampled time series with substantial missing observations are common in healthcare and sensor networks. We introduce SDE-Attention, a family of SDE-RNNs equipped with channel-level attention on the latent pre-RNN state, including channel recalibration, time-varying feature attention, and pyramidal multi-scale self-attention. We therefore conduct a comparison on a synthetic periodic dataset and real-world benchmarks, under varying missing rate. Latent-space attention consistently improves over a vanilla SDE-RNN. On the univariate UCR datasets, the LSTM-based time-varying feature model SDE-TVF-L achieves the highest average accuracy, raising mean performance by approximately 4, 6, and 10 percentage points over the baseline at 30%, 60% and 90% missingness, respectively (averaged across datasets). On multivariate UEA benchmarks, attention-augmented models again outperform the backbone, with SDE-TVF-L yielding up to a 7% gain in mean accuracy under high missingness. Among the proposed mechanisms, time-varying feature attention is the most robust on univariate datasets. On multivariate datasets, different attention types excel on different tasks, showing that SDE-Attention can be flexibly adapted to the structure of each problem.

[LG-13] Clustering Malware at Scale: A First Full-Benchmark Study

链接: https://arxiv.org/abs/2511.23198
作者: Martin Mocko,Jakub Ševcech,Daniela Chudá
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: pre-print of the paper (i.e. “submitted manuscript” version)

点击查看摘要

Abstract:Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets. Ours is the first study of malware clustering performed on whole malware benchmark datasets. Additionally, we extend the malware clustering task by incorporating benign samples. Our results indicate that incorporating benign samples does not significantly degrade clustering quality. We find that there are significant differences in the quality of the created clusters between Ember and Bodmas, as well as a private industry dataset. Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.

[LG-14] Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging

链接: https://arxiv.org/abs/2511.23193
作者: Yuchen Shi,Huaxin Pei,Yi Zhang,Danya Yao
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles’ perceived data, can substantially compromise the performance of MARL-based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault-tolerant MARL method for cooperative on-ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co-trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault-tolerant vehicle agent equipped with a self-diagnosis capability, which leverages the inherent spatio-temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near-fault-free levels of safety and efficiency under various observation fault patterns.

[LG-15] Energy-Efficient Vision Transformer Inference for Edge-AI Deployment

链接: https://arxiv.org/abs/2511.23166
作者: Nursultan Amanzhol,Jurn-Gyu Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).

[LG-16] Estimating the Event-Related Potential from Few EEG Trials

链接: https://arxiv.org/abs/2511.23162
作者: Anders Vestergaard Nørskov,Kasper Jørgensen,Alexander Neergaard Zahid,Morten Mørup
类目: Machine Learning (cs.LG)
*备注: Accepted by Transactions on Machine Learning Research (TMLR). 15 pages main manuscript, 30 pages total including supplementary material

点击查看摘要

Abstract:Event-related potentials (ERP) are measurements of brain activity with wide applications in basic and clinical neuroscience, that are typically estimated using the average of many trials of electroencephalography signals (EEG) to sufficiently reduce noise and signal variability. We introduce EEG2ERP, a novel uncertainty-aware autoencoder approach that maps an arbitrary number of EEG trials to their associated ERP. To account for the ERP uncertainty we use bootstrapped training targets and introduce a separate variance decoder to model the uncertainty of the estimated ERP. We evaluate our approach in the challenging zero-shot scenario of generalizing to new subjects considering three different publicly available data sources; i) the comprehensive ERP CORE dataset that includes over 50,000 EEG trials across six ERP paradigms from 40 subjects, ii) the large P300 Speller BCI dataset, and iii) a neuroimaging dataset on face perception consisting of both EEG and magnetoencephalography (MEG) data. We consistently find that our method in the few trial regime provides substantially better ERP estimates than commonly used conventional and robust averaging procedures. EEG2ERP is the first deep learning approach to map EEG signals to their associated ERP, moving toward reducing the number of trials necessary for ERP research. Code is available at this https URL

[LG-17] A Theoretical Framework for Discovering Groups and Unitary Representations via Tensor Factorization

链接: https://arxiv.org/abs/2511.23152
作者: Dongsung Huh,Halyun Jeong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze the HyperCube model, an \textitoperator-valued tensor factorization architecture that discovers group structures and their unitary representations. We provide a rigorous theoretical explanation for this inductive bias by decomposing its objective into a term regulating factor scales ( \mathcalB ) and a term enforcing directional alignment ( \mathcalR \geq 0 ). This decomposition isolates the \textitcollinear manifold ( \mathcalR=0 ), to which numerical optimization consistently converges for group isotopes. We prove that this manifold admits feasible solutions exclusively for group isotopes, and that within it, \mathcalB exerts a variational pressure toward unitarity. To bridge the gap to the global landscape, we formulate a \textitCollinearity Dominance Conjecture, supported by empirical observations. Conditional on this dominance, we prove two key results: (1) the global minimum is achieved by the unitary regular representation for groups, and (2) non-group operations incur a strictly higher objective value, formally quantifying the model’s inductive bias toward the associative structure of groups (up to isotopy).

[LG-18] Adapting Neural Audio Codecs to EEG NEURIPS

链接: https://arxiv.org/abs/2511.23142
作者: Ard Kastrati,Luca Lanzendörfer,Riccardo Rigoni,John Staib Matilla,Roger Wattenhofer
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Foundation Models for the Brain and Body (BrainBodyFM@NeurIPS)

点击查看摘要

Abstract:EEG and audio are inherently distinct modalities, differing in sampling rate, channel structure, and scale. Yet, we show that pretrained neural audio codecs can serve as effective starting points for EEG compression, provided that the data are preprocessed to be suitable to the codec’s input constraints. Using DAC, a state-of-the-art neural audio codec as our base, we demonstrate that raw EEG can be mapped into the codec’s stride-based framing, enabling direct reuse of the audio-pretrained encoder-decoder. Even without modification, this setup yields stable EEG reconstructions, and fine-tuning on EEG data further improves fidelity and generalization compared to training from scratch. We systematically explore compression-quality trade-offs by varying residual codebook depth, codebook (vocabulary) size, and input sampling rate. To capture spatial dependencies across electrodes, we propose DAC-MC, a multi-channel extension with attention-based cross-channel aggregation and channel-specific decoding, while retaining the audio-pretrained initialization. Evaluations on the TUH Abnormal and Epilepsy datasets show that the adapted codecs preserve clinically relevant information, as reflected in spectrogram-based reconstruction loss and downstream classification accuracy.

[LG-19] Automated Discovery of Laser Dicing Processes with Bayesian Optimization for Semiconductor Manufacturing

链接: https://arxiv.org/abs/2511.23141
作者: David Leeftink,Roman Doll,Heleen Visserman,Marco Post,Faysal Boughorbel,Max Hinne,Marcel van Gerven
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Laser dicing of semiconductor wafers is a critical step in microelectronic manufacturing, where multiple sequential laser passes precisely separate individual dies from the wafer. Adapting this complex sequential process to new wafer materials typically requires weeks of expert effort to balance process speed, separation quality, and material integrity. We present the first automated discovery of production-ready laser dicing processes on an industrial LASER1205 dicing system. We formulate the problem as a high-dimensional, constrained multi-objective Bayesian optimization task, and introduce a sequential two-level fidelity strategy to minimize expensive destructive die-strength evaluations. On bare silicon and product wafers, our method autonomously delivers feasible configurations that match or exceed expert baselines in production speed, die strength, and structural integrity, using only technician-level operation. Post-hoc validation of different weight configurations of the utility functions reveals that multiple feasible solutions with qualitatively different trade-offs can be obtained from the final surrogate model. Expert-refinement of the discovered process can further improve production speed while preserving die strength and structural integrity, surpassing purely manual or automated methods.

[LG-20] Freeze Diffuse Decode: Geometry-Aware Adaptation of Pretrained Transformer Embeddings for Antimicrobial Peptide Design

链接: https://arxiv.org/abs/2511.23120
作者: Pankhil Gawade,Adam Izdebski,Myriam Lizotte,Kevin R. Moon,Jake S. Rhodes,Guy Wolf,Ewa Szczurek
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Pretrained transformers provide rich, general-purpose embeddings, which are transferred to downstream tasks. However, current transfer strategies: fine-tuning and probing, either distort the pretrained geometric structure of the embeddings or lack sufficient expressivity to capture task-relevant signals. These issues become even more pronounced when supervised data are scarce. Here, we introduce Freeze, Diffuse, Decode (FDD), a novel diffusion-based framework that adapts pre-trained embeddings to downstream tasks while preserving their underlying geometric structure. FDD propagates supervised signal along the intrinsic manifold of frozen embeddings, enabling a geometry-aware adaptation of the embedding space. Applied to antimicrobial peptide design, FDD yields low-dimensional, predictive, and interpretable representations that support property prediction, retrieval, and latent-space interpolation.

[LG-21] Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

链接: https://arxiv.org/abs/2511.23083
作者: Akira Tamamori
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:High-capacity kernel Hopfield networks exhibit a “Ridge of Optimization” characterized by extreme stability. While previously linked to “Spectral Concentration,” its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the “Edge of Stability,” a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textitDual Equilibrium in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

[LG-22] me Extrapolation with Graph Convolutional Autoencoder and Tensor Train Decomposition

链接: https://arxiv.org/abs/2511.23037
作者: Yuanhong Chen,Federico Pichi,Zhen Gao,Gianluigi Rozza
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph autoencoders have gained attention in nonlinear reduced-order modeling of parameterized partial differential equations defined on unstructured grids. Despite they provide a geometrically consistent way of treating complex domains, applying such architectures to parameterized dynamical systems for temporal prediction beyond the training data, i.e. the extrapolation regime, is still a challenging task due to the simultaneous need of temporal causality and generalizability in the parametric space. In this work, we explore the integration of graph convolutional autoencoders (GCAs) with tensor train (TT) decomposition and Operator Inference (OpInf) to develop a time-consistent reduced-order model. In particular, high-fidelity snapshots are represented as a combination of parametric, spatial, and temporal cores via TT decomposition, while OpInf is used to learn the evolution of the latter. Moreover, we enhance the generalization performance by developing a multi-fidelity two-stages approach in the framework of Deep Operator Networks (DeepONet), treating the spatial and temporal cores as the trunk networks, and the parametric core as the branch network. Numerical results, including heat-conduction, advection-diffusion and vortex-shedding phenomena, demonstrate great performance in effectively learning the dynamic in the extrapolation regime for complex geometries, also in comparison with state-of-the-art approaches e.g. MeshGraphNets.

[LG-23] Masked Diffusion for Generative Recommendation

链接: https://arxiv.org/abs/2511.23021
作者: Kulin Shah,Bhuvesh Kumar,Neil Shah,Liam Collins
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 25 pages

点击查看摘要

Abstract:Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user’s interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user’s sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

[LG-24] Adaptive Factor Graph-Based Tightly Coupled GNSS/IMU Fusion for Robust Positionin

链接: https://arxiv.org/abs/2511.23017
作者: Elham Ahmadi,Alireza Olama,Petri Välisuo,Heidi Kuusniemi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable positioning in GNSS-challenged environments remains a critical challenge for navigation systems. Tightly coupled GNSS/IMU fusion improves robustness but remains vulnerable to non-Gaussian noise and outliers. We present a robust and adaptive factor graph-based fusion framework that directly integrates GNSS pseudorange measurements with IMU preintegration factors and incorporates the Barron loss, a general robust loss function that unifies several m-estimators through a single tunable parameter. By adaptively down weighting unreliable GNSS measurements, our approach improves resilience positioning. The method is implemented in an extended GTSAM framework and evaluated on the UrbanNav dataset. The proposed solution reduces positioning errors by up to 41% relative to standard FGO, and achieves even larger improvements over extended Kalman filter (EKF) baselines in urban canyon environments. These results highlight the benefits of Barron loss in enhancing the resilience of GNSS/IMU-based navigation in urban and signal-compromised environments.

[LG-25] Maritime Activities Observed Through Open-Access Positioning Data: Moving and Stationary Vessels in the Baltic Sea ATC

链接: https://arxiv.org/abs/2511.23016
作者: Moritz Hütten
类目: Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 29 pages, 15 figures, and 9 tables, matching the version published in Geomatics. Accompanying research data are available at this http URL

点击查看摘要

Abstract:Understanding past and present maritime activity patterns is critical for navigation safety, environmental assessment, and commercial operations. An increasing number of services now openly provide positioning data from the Automatic Identification System (AIS) via ground-based receivers. We show that coastal vessel activity can be reconstructed from open access data with high accuracy, even with limited data quality and incomplete receiver coverage. For three months of open AIS data in the Baltic Sea from August to October 2024, we present (i) cleansing and reconstruction methods to improve the data quality, and (ii) a journey model that converts AIS message data into vessel counts, traffic estimates, and spatially resolved vessel density at a resolution of \sim 400 m. Vessel counts are provided, along with their uncertainties, for both moving and stationary activity. Vessel density maps also enable the identification of port locations, and we infer the most crowded and busiest coastal areas in the Baltic Sea. We find that on average, \gtrsim 4000 vessels simultaneously operate in the Baltic Sea, and more than 300 vessels enter or leave the area each day. Our results agree within 20% with previous studies relying on proprietary data.

[LG-26] A Modular Framework for Rapidly Building Intrusion Predictors

链接: https://arxiv.org/abs/2511.23000
作者: Xiaoxuan Wang,Rolf Stadler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study automated intrusion prediction in an IT system using statistical learning methods. The focus is on developing online attack predictors that detect attacks in real time and identify the current stage of the attack. While such predictors have been proposed in the recent literature, these works typically rely on constructing a monolithic predictor tailored to a specific attack type and scenario. Given that hundreds of attack types are cataloged in the MITRE framework, training a separate monolithic predictor for each of them is infeasible. In this paper, we propose a modular framework for rapidly assembling online attack predictors from reusable components. The modular nature of a predictor facilitates controlling key metrics like timeliness and accuracy of prediction, as well as tuning the trade-off between them. Using public datasets for training and evaluation, we provide many examples of modular predictors and show how an effective predictor can be dynamically assembled during training from a network of modular components.

[LG-27] A Trainable Centrality Framework for Modern Data

链接: https://arxiv.org/abs/2511.22959
作者: Minh Duc Vu,Mingshuo Liu,Doudou Zhou
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Measuring how central or typical a data point is underpins robust estimation, ranking, and outlier detection, but classical depth notions become expensive and unstable in high dimensions and are hard to extend beyond Euclidean data. We introduce Fused Unified centrality Score Estimation (FUSE), a neural centrality framework that operates on top of arbitrary representations. FUSE combines a global head, trained from pairwise distance-based comparisons to learn an anchor-free centrality score, with a local head, trained by denoising score matching to approximate a smoothed log-density potential. A single parameter between 0 and 1 interpolates between these calibrated signals, yielding depth-like centrality from different views via one forward pass. Across synthetic distributions, real images, time series, and text data, and standard outlier detection benchmarks, FUSE recovers meaningful classical ordering, reveals multi-scale geometric structures, and attains competitive performance with strong classical baselines while remaining simple and efficient.

[LG-28] Experts are all you need: A Composable Framework for Large Language Model Inference

链接: https://arxiv.org/abs/2511.22955
作者: Shrihari Sridharan,Sourjya Roy,Anand Raghunathan,Kaushik Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or “experts”. However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential “plan–act–observe” loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x–3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x–1.7x latency improvement compared to sequential sub-query processing.

[LG-29] CORGI: GNNs with Convolutional Residual Global Interactions for Lagrangian Simulation

链接: https://arxiv.org/abs/2511.22938
作者: Ethan Ji,Yuanzhou Chen,Arush Ramteke,Fang Sun,Tianrun Yu,Jai Parera,Wei Wang,Yizhou Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) are central to dynamical systems modeling, particularly in hydrodynamics, where traditional solvers often struggle with nonlinearity and computational cost. Lagrangian neural surrogates such as GNS and SEGNN have emerged as strong alternatives by learning from particle-based simulations. However, these models typically operate with limited receptive fields, making them inaccurate for capturing the inherently global interactions in fluid flows. Motivated by this observation, we introduce Convolutional Residual Global Interactions (CORGI), a hybrid architecture that augments any GNN-based solver with a lightweight Eulerian component for global context aggregation. By projecting particle features onto a grid, applying convolutional updates, and mapping them back to the particle domain, CORGI captures long-range dependencies without significant overhead. When applied to a GNS backbone, CORGI achieves a 57% improvement in rollout accuracy with only 13% more inference time and 31% more training time. Compared to SEGNN, CORGI improves accuracy by 49% while reducing inference time by 48% and training time by 30%. Even under identical runtime constraints, CORGI outperforms GNS by 47% on average, highlighting its versatility and performance on varied compute budgets.

[LG-30] Modeling Chaotic Pedestrian Behavior Using Chaos Indicators and Supervised Learning

链接: https://arxiv.org/abs/2511.22887
作者: Md. Muhtashim Shahrier,Nazmul Haque,Md Asif Raihan,Md. Hadiuzzaman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As cities around the world aim to improve walkability and safety, understanding the irregular and unpredictable nature of pedestrian behavior has become increasingly important. This study introduces a data-driven framework for modeling chaotic pedestrian movement using empirically observed trajectory data and supervised learning. Videos were recorded during both daytime and nighttime conditions to capture pedestrian dynamics under varying ambient and traffic contexts. Pedestrian trajectories were extracted through computer vision techniques, and behavioral chaos was quantified using four chaos metrics: Approximate Entropy and Lyapunov Exponent, each computed for both velocity and direction change. A Principal Component Analysis (PCA) was then applied to consolidate these indicators into a unified chaos score. A comprehensive set of individual, group-level, and contextual traffic features was engineered and used to train Random Forest and CatBoost regression models. CatBoost models consistently achieved superior performance. The best daytime PCA-based CatBoost model reached an R^2 of 0.8319, while the nighttime PCA-based CatBoost model attained an R^2 of 0.8574. SHAP analysis highlighted that features such as distance travel, movement duration, and speed variability were robust contributors to chaotic behavior. The proposed framework enables practitioners to quantify and anticipate behavioral instability in real-world settings. Planners and engineers can use chaos scores to identify high-risk pedestrian zones, apprise infrastructure improvements, and calibrate realistic microsimulation models. The approach also supports adaptive risk assessment in automated vehicle systems by capturing short-term motion unpredictability grounded in observable, interpretable features.

[LG-31] Covering-Space Normalizing Flows: Approximating Pushforwards on Lens Spaces

链接: https://arxiv.org/abs/2511.22882
作者: William Ghanem
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We construct pushforward distributions via the universal covering map rho: S^3 - L(p;q) with the goal of approximating these distributions using flows on L(p;q). We highlight that our method deletes redundancies in the case of a symmetric S^3 distribution. Using our model, we approximate the pushforwards of von Mises-Fisher-induced target densities as well as that of a Z_12-symmetric Boltzmann distribution on S^3 constructed to model benzene.

[LG-32] ARM-Explainer – Explaining and improving graph neural network predictions for the maximum clique problem using node features and association rule mining

链接: https://arxiv.org/abs/2511.22866
作者: Bharat Sharman,Elkafi Hassini
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Numerous graph neural network (GNN)-based algorithms have been proposed to solve graph-based combinatorial optimization problems (COPs), but methods to explain their predictions remain largely undeveloped. We introduce ARM-Explainer, a post-hoc, model-level explainer based on association rule mining, and demonstrate it on the predictions of the hybrid geometric scattering (HGS) GNN for the maximum clique problem (MCP), a canonical NP-hard graph-based COP. The eight most explanatory association rules discovered by ARM-Explainer achieve high median lift and confidence values of 2.42 and 0.49, respectively, on test instances from the TWITTER and BHOSLIB-DIMACS benchmark datasets. ARM-Explainer identifies the most important node features, together with their value ranges, that influence the GNN’s predictions on these datasets. Furthermore, augmenting the GNN with informative node features substantially improves its performance on the MCP, increasing the median largest-found clique size by 22% (from 29.5 to 36) on large graphs from the BHOSLIB-DIMACS dataset.

[LG-33] CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

链接: https://arxiv.org/abs/2511.22854
作者: Finn G. Vamosi,Nils D. Forkert
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 12 pages, 8 figures. Code available at this https URL

点击查看摘要

Abstract:When people reason about cause and effect, they often consider many competing “what if” scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, agents attempt to persuade each other, challenging each other’s logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl’s ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1’s overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3’s overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.

[LG-34] ARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE

链接: https://arxiv.org/abs/2511.22853
作者: Jiawen Wei,Lan Jiang,Pengbo Wei,Ziwen Ye,Teng Song,Chen Chen,Guangrui Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data is ubiquitous, with forecasting applications spanning from finance to healthcare. Beyond popular deterministic methods, generative models are gaining attention due to advancements in areas like image synthesis and video generation, as well as their inherent ability to provide probabilistic predictions. However, existing generative approaches mostly involve recurrent generative operations or repeated denoising steps, making the prediction laborious, particularly for long-term forecasting. Most of them only conduct experiments for relatively short-term forecasting, with limited comparison to deterministic methods in long-term forecasting, leaving their practical advantages unclear. This paper presents TARFVAE, a novel generative framework that combines the Transformer-based autoregressive flow (TARFLOW) and variational autoencoder (VAE) for efficient one-step generative time series forecasting. Inspired by the rethinking that complex architectures for extracting time series representations might not be necessary, we add a flow module, TARFLOW, to VAE to promote spontaneous learning of latent variables that benefit predictions. TARFLOW enhances VAE’s posterior estimation by breaking the Gaussian assumption, thereby enabling a more informative latent space. TARFVAE uses only the forward process of TARFLOW, avoiding autoregressive inverse operations and thus ensuring fast generation. During generation, it samples from the prior latent space and directly generates full-horizon forecasts via the VAE decoder. With simple MLP modules, TARFVAE achieves superior performance over state-of-the-art deterministic and generative models across different forecast horizons on benchmark datasets while maintaining efficient prediction speed, demonstrating its effectiveness as an efficient and powerful solution for generative time series forecasting.

[LG-35] PerfMamba: Performance Analysis and Pruning of Selective State Space Models

链接: https://arxiv.org/abs/2511.22849
作者: Abdullah Al Asif,Mobina Kashaniyan,Sixing Yu,Juan Pablo Muñoz,Ali Jannesari
类目: Machine Learning (cs.LG)
*备注: Accepted in Bench 2025

点击查看摘要

Abstract:Recent advances in sequence modeling have introduced selective SSMs as promising alternatives to Transformer architectures, offering theoretical computational efficiency and sequence processing advantages. A comprehensive understanding of selective SSMs in runtime behavior, resource utilization patterns, and scaling characteristics still remains unexplored, thus obstructing their optimal deployment and further architectural improvements. This paper presents a thorough empirical study of Mamba-1 and Mamba-2, systematically profiled for performance to assess the design principles that contribute to their efficiency in state-space modeling. A detailed analysis of computation patterns, memory access, I/O characteristics, and scaling properties was performed for sequence lengths ranging from 64 to 16384 tokens. Our findings show that the SSM component, a central part of the selective SSM architecture, demands a significant portion of computational resources compared to other components in the Mamba block. Based on these insights, we propose a pruning technique that selectively removes low-activity states within the SSM component, achieving measurable throughput and memory gains while maintaining accuracy within a moderate pruning regime. This approach results in performance improvements across varying sequence lengths, achieving a 1.14x speedup and reducing memory usage by 11.50%. These results offer valuable guidance for designing more efficient SSM architectures that can be applied to a wide range of real-world applications.

[LG-36] Can Synthetic Data Improve Symbolic Regression Extrapolation Performance? GECCO2025

链接: https://arxiv.org/abs/2511.22794
作者: Fitria Wulandari Ramlan,Colm O’Riordan,Gabriel Kronberger,James McDermott
类目: Machine Learning (cs.LG)
*备注: 8 pages, 16 figures, GECCO 2025 Symbolic Regression Workshop

点击查看摘要

Abstract:Many machine learning models perform well when making predictions within the training data range, but often struggle when required to extrapolate beyond it. Symbolic regression (SR) using genetic programming (GP) can generate flexible models but is prone to unreliable behaviour in extrapolation. This paper investigates whether adding synthetic data can help improve performance in such cases. We apply Kernel Density Estimation (KDE) to identify regions in the input space where the training data is sparse. Synthetic data is then generated in those regions using a knowledge distillation approach: a teacher model generates predictions on new input points, which are then used to train a student model. We evaluate this method across six benchmark datasets, using neural networks (NN), random forests (RF), and GP both as teacher models (to generate synthetic data) and as student models (trained on the augmented data). Results show that GP models can often improve when trained on synthetic data, especially in extrapolation areas. However, the improvement depends on the dataset and teacher model used. The most important improvements are observed when synthetic data from GPe is used to train GPp in extrapolation regions. Changes in interpolation areas show only slight changes. We also observe heterogeneous errors, where model performance varies across different regions of the input space. Overall, this approach offers a practical solution for better extrapolation. Note: An earlier version of this work appeared in the GECCO 2025 Workshop on Symbolic Regression. This arXiv version corrects several parts of the original submission.

[LG-37] GSpaRC: Gaussian Splatting for Real-time Reconstruction of RF Channels

链接: https://arxiv.org/abs/2511.22793
作者: Bhavya Sai Nukapotula,Rishabh Tripathi,Seth Pregler,Dileep Kalathil,Srinivas Shakkottai,Theodore S. Rappaport
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel state information (CSI) is essential for adaptive beamforming and maintaining robust links in wireless communication systems. However, acquiring CSI incurs significant overhead, consuming up to 25% of spectrum resources in 5G networks due to frequent pilot transmissions at sub-millisecond intervals. Recent approaches aim to reduce this burden by reconstructing CSI from spatiotemporal RF measurements, such as signal strength and direction-of-arrival. While effective in offline settings, these methods often suffer from inference latencies in the 5–100~ms range, making them impractical for real-time systems. We present GSpaRC: Gaussian Splatting for Real-time Reconstruction of RF Channels, the first algorithm to break the 1 ms latency barrier while maintaining high accuracy. GSpaRC represents the RF environment using a compact set of 3D Gaussian primitives, each parameterized by a lightweight neural model augmented with physics-informed features such as distance-based attenuation. Unlike traditional vision-based splatting pipelines, GSpaRC is tailored for RF reception: it employs an equirectangular projection onto a hemispherical surface centered at the receiver to reflect omnidirectional antenna behavior. A custom CUDA pipeline enables fully parallelized directional sorting, splatting, and rendering across frequency and spatial dimensions. Evaluated on multiple RF datasets, GSpaRC achieves similar CSI reconstruction fidelity to recent state-of-the-art methods while reducing training and inference time by over an order of magnitude. By trading modest GPU computation for a substantial reduction in pilot overhead, GSpaRC enables scalable, low-latency channel estimation suitable for deployment in 5G and future wireless systems. The code is available here: \hrefthis https URLGSpaRC.

[LG-38] An Efficient Privacy-preserving Intrusion Detection Scheme for UAV Swarm Networks

链接: https://arxiv.org/abs/2511.22791
作者: Kanchon Gharami,Shafika Showkat Moni
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in the Proceedings of the 44th AIAA/IEEE Digital Avionics Systems Conference (DASC) 2025, where it received the Best Paper of Session Award

点击查看摘要

Abstract:The rapid proliferation of unmanned aerial vehicles (UAVs) and their applications in diverse domains, such as surveillance, disaster management, agriculture, and defense, have revolutionized modern technology. While the potential benefits of swarm-based UAV networks are growing significantly, they are vulnerable to various security attacks that can jeopardize the overall mission success by degrading their performance, disrupting decision-making, and compromising the trajectory planning process. The Intrusion Detection System (IDS) plays a vital role in identifying potential security attacks to ensure the secure operation of UAV swarm networks. However, conventional IDS primarily focuses on binary classification with resource-intensive neural networks and faces challenges, including latency, privacy breaches, increased performance overhead, and model drift. This research aims to address these challenges by developing a novel lightweight and federated continuous learning-based IDS scheme. Our proposed model facilitates decentralized training across diverse UAV swarms to ensure data heterogeneity and privacy. The performance evaluation of our model demonstrates significant improvements, with classification accuracies of 99.45% on UKM-IDS, 99.99% on UAV-IDS, 96.85% on TLM-UAV dataset, and 98.05% on Cyber-Physical datasets.

[LG-39] Integrated Transcriptomic-proteomic Biomarker Identification for Radiation Response Prediction in Non-small Cell Lung Cancer Cell Lines

链接: https://arxiv.org/abs/2511.22735
作者: Yajun Yu,Guoping Xu,Steve Jiang,Robert Timmerman,John Minna,Yuanyuan Zhang,Hao Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To develop an integrated transcriptome-proteome framework for identifying concurrent biomarkers predictive of radiation response, as measured by survival fraction at 2 Gy (SF2), in non-small cell lung cancer (NSCLC) cell lines. RNA sequencing (RNA-seq) and data-independent acquisition mass spectrometry (DIA-MS) proteomic data were collected from 73 and 46 NSCLC cell lines, respectively. Following preprocessing, 1,605 shared genes were retained for analysis. Feature selection was performed using least absolute shrinkage and selection operator (Lasso) regression with a frequency-based ranking criterion under five-fold cross-validation repeated ten times. Support vector regression (SVR) models were constructed using transcriptome-only, proteome-only, and combined transcriptome-proteome feature sets. Model performance was assessed by the coefficient of determination (R2) and root mean square error (RMSE). Correlation analyses evaluated concordance between RNA and protein expression and the relationships of selected biomarkers with SF2. RNA-protein expression exhibited significant positive correlations (median Pearson’s r = 0.363). Independent pipelines identified 20 prioritized gene signatures from transcriptomic, proteomic, and combined datasets. Models trained on single-omic features achieved limited cross-omic generalizability, while the combined model demonstrated balanced predictive accuracy in both datasets (R2=0.461, RMSE=0.120 for transcriptome; R2=0.604, RMSE=0.111 for proteome). This study presents the first proteotranscriptomic framework for SF2 prediction in NSCLC, highlighting the complementary value of integrating transcriptomic and proteomic data. The identified concurrent biomarkers capture both transcriptional regulation and functional protein activity, offering mechanistic insights and translational potential.

[LG-40] Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra

链接: https://arxiv.org/abs/2511.22693
作者: Deressa Wodajo Deressa,Hannes Mareen,Peter Lambert,Glenn Van Wallendael
类目: Machine Learning (cs.LG)
*备注: 20 pages, 21 figures

点击查看摘要

Abstract:We present Generative Anchored Fields (GAF), a generative model that learns independent endpoint predictors J (noise) and K (data) rather than a trajectory predictor. The velocity field v=K-J emerges from their time-conditioned disagreement. This factorization enables \textitTransport Algebra: algebraic operation on learned (J_n,K_n)_n=1^N heads for compositional control. With class-specific K_n heads, GAF supports a rich family of directed transport maps between a shared base distribution and multiple modalities, enabling controllable interpolation, hybrid generation, and semantic morphing through vector arithmetic. We achieve strong sample quality (FID 7.5 on CelebA-HQ 64\times 64 ) while uniquely providing compositional generation as an architectural primitive. We further demonstrate, GAF has lossless cyclic transport between its initial and final state with LPIPS= 0.0 . Code available at this https URL

[LG-41] Modèles de Fondation et Ajustement : Vers une Nouvelle Génération de Modèles pour la Prévision des Séries Temporelles

链接: https://arxiv.org/abs/2511.22674
作者: Morad Laglil,Emilie Devijver,Eric Gaussier,Bertrand Pracca
类目: Machine Learning (cs.LG)
*备注: in French language

点击查看摘要

Abstract:Inspired by recent advances in large language models, foundation models have been developed for zero-shot time series forecasting, enabling prediction on datasets unseen during pretraining. These large-scale models, trained on vast collections of time series, learn generalizable representations for both point and probabilistic forecasting, reducing the need for task-specific architectures and manual tuning. In this work, we review the main architectures, pretraining strategies, and optimization methods used in such models, and study the effect of fine-tuning after pretraining to enhance their performance on specific datasets. Our empirical results show that fine-tuning generally improves zero-shot forecasting capabilities, especially for long-term horizons. Comments: in French language Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.22674 [cs.LG] (or arXiv:2511.22674v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.22674 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Difficulties with Evaluating a Deception Detector for AIs

链接: https://arxiv.org/abs/2511.22662
作者: Lewis Smith,Bilal Chughtai,Neel Nanda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building reliable deception detectors for AI systems – methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence – would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while they seem valuable, they also seem insufficient alone. Progress on deception detection likely requires further consideration of these problems.

[LG-43] Structure-aware Hybrid-order Similarity Learning for Multi-view Unsupervised Feature Selection

链接: https://arxiv.org/abs/2511.22656
作者: Lin Xu,Ke Li,Dongjie Wang,Fengmao Lv,Tianrui Li,Yanyong Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view unsupervised feature selection (MUFS) has recently emerged as an effective dimensionality reduction method for unlabeled multi-view data. However, most existing methods mainly use first-order similarity graphs to preserve local structure, often overlooking the global structure that can be captured by second-order similarity. In addition, a few MUFS methods leverage predefined second-order similarity graphs, making them vulnerable to noise and outliers and resulting in suboptimal feature selection performance. In this paper, we propose a novel MUFS method, termed Structure-aware Hybrid-order sImilarity learNing for multi-viEw unsupervised Feature Selection (SHINE-FS), to address the aforementioned problem. SHINE-FS first learns consensus anchors and the corresponding anchor graph to capture the cross-view relationships between the anchors and the samples. Based on the acquired cross-view consensus information, it generates low-dimensional representations of the samples, which facilitate the reconstruction of multi-view data by identifying discriminative features. Subsequently, it employs the anchor-sample relationships to learn a second-order similarity graph. Furthermore, by jointly learning first-order and second-order similarity graphs, SHINE-FS constructs a hybrid-order similarity graph that captures both local and global structures, thereby revealing the intrinsic data structure to enhance feature selection. Comprehensive experimental results on real multi-view datasets show that SHINE-FS outperforms the state-of-the-art methods.

[LG-44] Spatially Aware Dictionary-Free Eigenfunction Identification for Modeling and Control of Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2511.22648
作者: David Grasev
类目: Machine Learning (cs.LG)
*备注: 31 pages, 24 figures

点击查看摘要

Abstract:A new approach to data-driven discovery of Koopman eigenfunctions without a pre-defined set of basis functions is proposed. The approach is based on a reference trajectory, for which the Koopman mode amplitudes are first identified, and the Koopman mode decomposition is transformed to a new basis, which contains fundamental functions of eigenvalues and time. The initial values of the eigenfunctions are obtained by projecting trajectories onto this basis via a regularized least-squares fit. A global optimizer was employed to optimize the eigenvalues. Mapping initial-state values to eigenfunction values reveals their spatial structure, enabling the numerical computation of their gradients. Thus, deviations from the Koopman partial differential equation are penalized, leading to more robust solutions. The approach was successfully tested on several benchmark nonlinear dynamical systems, including the FitzHugh-Nagumo system with inputs, van der Pol and Duffing oscillators, and a 2-spool turbojet engine with control. The study demonstrates that incorporating principal eigenvalues and spatial structure integrity promotion significantly improves the accuracy of Koopman predictors. The approach effectively discovers Koopman spectral components even with sparse state-space sampling and reveals geometric features of the state space, such as invariant partitions. Finally, the numerical approximation of the eigenfunction gradient can be used for input dynamics modeling and control design. The results support the practicality of the approach for use with various dynamical systems.

[LG-45] Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning NEURIPS2025

链接: https://arxiv.org/abs/2511.22640
作者: Riccardo De Santi,Marin Vlastelica,Ya-Ping Hsieh,Zebang Shen,Niao He,Andreas Krause
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Adapting large-scale foundation flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.

[LG-46] Federated Learning Survey: A Multi-Level Taxonomy of Aggregation Techniques Experimental Insights and Future Frontiers

链接: https://arxiv.org/abs/2511.22616
作者: Meriem Arbaoui,Mohamed-el-Amine Brahmia,Abdellatif Rahmoun,Mourad Zghal
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Author-Accepted Manuscript. 65 pages, 26 figures, 20 tables. Published in ACM Transactions on Intelligent Systems and Technology (TIST), 2024

点击查看摘要

Abstract:The integration of IoT and AI has unlocked innovation across industries, but growing privacy concerns and data isolation hinder progress. Traditional centralized ML struggles to overcome these challenges, which has led to the rise of Federated Learning (FL), a decentralized paradigm that enables collaborative model training without sharing local raw data. FL ensures data privacy, reduces communication overhead, and supports scalability, yet its heterogeneity adds complexity compared to centralized approaches. This survey focuses on three main FL research directions: personalization, optimization, and robustness, offering a structured classification through a hybrid methodology that combines bibliometric analysis with systematic review to identify the most influential works. We examine challenges and techniques related to heterogeneity, efficiency, security, and privacy, and provide a comprehensive overview of aggregation strategies, including architectures, synchronization methods, and diverse federation objectives. To complement this, we discuss practical evaluation approaches and present experiments comparing aggregation methods under IID and non-IID data distributions. Finally, we outline promising research directions to advance FL, aiming to guide future innovation in this rapidly evolving field.

[LG-47] DisCEdge: Distributed Context Management for Large Language Models at the Edge

链接: https://arxiv.org/abs/2511.22599
作者: Mohammadreza Malekabbasi,Minghe Wang,David Bermbach
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG)
*备注: author version

点击查看摘要

Abstract:Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, often introduce network latency and bandwidth overhead, undermining the advantages of edge deployment. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences rather than raw text, our system avoids redundant computation and enables efficient data replication. We implement and evaluate an open-source prototype in a realistic edge environment with commodity hardware. We show DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency. Comments: author version Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2511.22599 [cs.DC] (or arXiv:2511.22599v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.22599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] LLM -Cave: A benchmark and light environment for large language models reasoning and decision-making system

链接: https://arxiv.org/abs/2511.22598
作者: Huanyu Li,Zongyuan Li,Wei Huang,Xian Guo
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, ICICN 2025

点击查看摘要

Abstract:Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one-step interactions. Some of the existing sequence decision-making environments, such as TextStarCraftII and LLM-PySC2, are too complicated and require hours of interaction to complete a game. In this paper, we introduce LLM-Cave, a benchmark and light environment for LLM reasoning and decision-making systems. This environment is a classic instance in the era of Symbolism. Artificial intelligence enables the agent to explore the environment and avoid potential losses by reasoning about nearby dangers using partial observable state information. In the experiment, we evaluated the sequential reasoning ability, decision-making performance and computational efficiency of mainstream large language models (LLMs) such as GPT-4o-mini, o1-mini, and DeepSeek-R1. Experiments show that while Deepseek-R1 achieved the highest success rate on complex reasoning tasks, smaller models like 4o-mini significantly narrowed the performance gap on challenges by employing Chain of Speculation and Planner-Critic strategies, at the expense of reduced computational efficiency. This indicates that structured, multi-step reasoning combined with an LLM-based feedback mechanism can substantially enhance an LLM’s decision-making capabilities, providing a promising direction for improving reasoning in weaker models and suggesting a new reasoning-centered benchmark for LLM assessment. Our code is open-sourced in this https URL.

[LG-49] he Multiclass Score-Oriented Loss (MultiSOL) on the Simplex

链接: https://arxiv.org/abs/2511.22587
作者: Francesco Marchetti,Edoardo Legnaro,Sabrina Guastavino
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the supervised binary classification setting, score-oriented losses have been introduced with the aim of optimizing a chosen performance metric directly during the training phase, thus avoiding \textita posteriori threshold tuning. To do this, in their construction, the decision threshold is treated as a random variable provided with a certain \textita priori distribution. In this paper, we use a recently introduced multidimensional threshold-based classification framework to extend such score-oriented losses to multiclass classification, defining the Multiclass Score-Oriented Loss (MultiSOL) functions. As also demonstrated by several classification experiments, this proposed family of losses is designed to preserve the main advantages observed in the binary setting, such as the direct optimization of the target metric and the robustness to class imbalance, achieving performance comparable to other state-of-the-art loss functions and providing new insights into the interaction between simplex geometry and score-oriented learning.

[LG-50] Entropy is all you need for Inter-Seed Cross-Play in Hanabi

链接: https://arxiv.org/abs/2511.22581
作者: Johannes Forkel,Jakob Foerster
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high \lambda_\textGAE around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.

[LG-51] List-Decodable Regression via Expander Sketching

链接: https://arxiv.org/abs/2511.22524
作者: Herbod Pourali,Sajjad Hashemian,Ebrahim Ardeshir-Larijani
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:We introduce an expander-sketching framework for list-decodable linear regression that achieves sample complexity \tildeO((d+\log(1/\delta))/\alpha) , list size O(1/\alpha) , and near input-sparsity running time \tildeO(\mathrmnnz(X)+d^3/\alpha) under standard sub-Gaussian assumptions. Our method uses lossless expanders to synthesize lightly contaminated batches, enabling robust aggregation and a short spectral filtering stage that matches the best known efficient guarantees while avoiding SoS machinery and explicit batch structure.

[LG-52] Privacy-Utility-Bias Trade-offs for Privacy-Preserving Recommender Systems

链接: https://arxiv.org/abs/2511.22515
作者: Shiva Parsarad,Isabel Wagner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems (RSs) output ranked lists of items, such as movies or restaurants, that users may find interesting, based on the user’s past ratings and ratings from other users. RSs increasingly incorporate differential privacy (DP) to protect user data, raising questions about how privacy mechanisms affect both recommendation accuracy and fairness. We conduct a comprehensive, cross-model evaluation of two DP mechanisms, differentially private stochastic gradient descent (DPSGD) and local differential privacy (LDP), applied to four recommender systems (Neural Collaborative Filtering (NCF), Bayesian Personalized Ranking (BPR), Singular Value Decomposition (SVD), and Variational Autoencoder (VAE)) on the MovieLens-1M and Yelp datasets. We find that stronger privacy consistently reduces utility, but not uniformly. NCF under DPSGD shows the smallest accuracy loss (under 10 percent at epsilon approximately 1), whereas SVD and BPR experience larger drops, especially for users with niche preferences. VAE is the most sensitive to privacy, with sharp declines for sparsely represented groups. The impact on bias metrics is similarly heterogeneous. DPSGD generally reduces the gap between recommendations of popular and less popular items, whereas LDP preserves existing patterns more closely. These results highlight that no single DP mechanism is uniformly superior; instead, each provides trade-offs under different privacy regimes and data conditions.

[LG-53] Space Explanations of Neural Network Classification

链接: https://arxiv.org/abs/2511.22498
作者: Faezeh Labbaf,Tomáš Kolárik,Martin Blicha,Grigory Fedyukovich,Michael Wand,Natasha Sharygina
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We present a novel logic-based concept called Space Explanations for classifying neural networks that gives provable guarantees of the behavior of the network in continuous areas of the input feature space. To automatically generate space explanations, we leverage a range of flexible Craig interpolation algorithms and unsatisfiable core generation. Based on real-life case studies, ranging from small to medium to large size, we demonstrate that the generated explanations are more meaningful than those computed by state-of-the-art.

[LG-54] Enhancing Trustworthiness with Mixed Precision: Benchmarks Opportunities and Challenges

链接: https://arxiv.org/abs/2511.22483
作者: Guanxi Lu,Hao Mark Chen,Zhiqiang Que,Wayne Luk,Hongxiang Fan
类目: Machine Learning (cs.LG)
*备注: ASP-DAC 2026 Special Session

点击查看摘要

Abstract:Large language models (LLMs) have shown promising performance across various tasks. However, their autoregressive decoding process poses significant challenges for efficient deployment on existing AI hardware. Quantization alleviates memory and compute pressure by compressing weights, activations, and KV caches to low precisions while preserving generation quality. However, existing quantization frameworks typically focus on perplexity or classification accuracy, often omitting critical trustworthiness metrics. This gap introduces risks when applying quantized LLMs to downstream high-stakes domains such as finance and healthcare. In this work, we systematically investigate the impact of quantization on four trustworthiness metrics (adversarial robustness, fairness, machine ethics, and out-of-distribution robustness) and identify the instability across compression ratios and quantization methods. Building on these observations, we develop a novel precision-ensemble voting approach that leverages predictions from mixed-precision variants of the same model and consistently improves performance by up to 5.8% on trustworthiness metrics. Our results highlight the importance of considering trustworthiness when developing model compression techniques and point to research opportunities at the intersection of compression and trustworthiness for safety-critical applications.

[LG-55] An Efficient Embedding Based Ad Retrieval with GPU-Powered Feature Interaction

链接: https://arxiv.org/abs/2511.22460
作者: Yifan Lei,Jiahua Luo,Tingyu Jiang,Bo Zhang,Lifeng Wang,Dapeng Liu,Zhaoren Wu,Haijie Gu,Huan Yu,Jie Jiang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:In large-scale advertising recommendation systems, retrieval serves as a critical component, aiming to efficiently select a subset of candidate ads relevant to user behaviors from a massive ad inventory for subsequent ranking and recommendation. The Embedding-Based Retrieval (EBR) methods modeled by the dual-tower network are widely used in the industry to maintain both retrieval efficiency and accuracy. However, the dual-tower model has significant limitations: the embeddings of users and ads interact only at the final inner product computation, resulting in insufficient feature interaction capabilities. Although DNN-based models with both user and ad as input features, allowing for early-stage interaction between these features, are introduced in the ranking stage to mitigate this issue, they are computationally infeasible for the retrieval stage. To bridge this gap, this paper proposes an efficient GPU-based feature interaction for the dual-tower network to significantly improve retrieval accuracy while substantially reducing computational costs. Specifically, we introduce a novel compressed inverted list designed for GPU acceleration, enabling efficient feature interaction computation at scale. To the best of our knowledge, this is the first framework in the industry to successfully implement Wide and Deep in a retrieval system. We apply this model to the real-world business scenarios in Tencent Advertising, and experimental results demonstrate that our method outperforms existing approaches in offline evaluation and has been successfully deployed to Tencent’s advertising recommendation system, delivering significant online performance gains. This improvement not only validates the effectiveness of the proposed method, but also provides new practical guidance for optimizing large-scale ad retrieval systems.

[LG-56] PISA: Prioritized Invariant Subgraph Aggregation

链接: https://arxiv.org/abs/2511.22435
作者: Ali Ghasemi,Farooq Ahmad Wani,Maria Sofia Bucarelli,Fabrizio Silvestri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has extended the invariance principle for out-of-distribution (OOD) generalization from Euclidean to graph data, where challenges arise due to complex structures and diverse distribution shifts in node attributes and topology. To handle these, Chen et al. proposed CIGA (Chen et al., 2022b), which uses causal modeling and an information-theoretic objective to extract a single invariant subgraph capturing causal features. However, this single-subgraph focus can miss multiple causal patterns. Liu et al. (2025) addressed this with SuGAr, which learns and aggregates diverse invariant subgraphs via a sampler and diversity regularizer, improving robustness but still relying on simple uniform or greedy aggregation. To overcome this, the proposed PISA framework introduces a dynamic MLP-based aggregation that prioritizes and combines subgraph representations more effectively. Experiments on 15 datasets, including DrugOOD (Ji et al., 2023), show that PISA achieves up to 5% higher classification accuracy than prior methods.

[LG-57] Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions AAAI26

链接: https://arxiv.org/abs/2511.22406
作者: Roland Stolz,Michael Eichelbeck,Matthias Althoff
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at the AAAI26 conference main technical track

点击查看摘要

Abstract:In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained RL faces challenges regarding effective policy updates, computational efficiency, and predictable runtime. Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance. We argue that accurate estimation of these characteristics is crucial in the action-constrained RL setting, and propose efficient numerical approximations for them. We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.

[LG-58] S2Vec-Ensemble: An Enhanced Self-Supervised Framework for Time Series Forecasting

链接: https://arxiv.org/abs/2511.22395
作者: Ganeshan Niroshan,Uthayasanker Thayasivam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised representation learning, particularly through contrastive methods like TS2Vec, has advanced the analysis of time series data. However, these models often falter in forecasting tasks because their objective functions prioritize instance discrimination over capturing the deterministic patterns, such as seasonality and trend, that are critical for accurate prediction. This paper introduces TS2Vec-Ensemble, a novel hybrid framework designed to bridge this gap. Our approach enhances the powerful, implicitly learned dynamics from a pretrained TS2Vec encoder by fusing them with explicit, engineered time features that encode periodic cycles. This fusion is achieved through a dual-model ensemble architecture, where two distinct regression heads – one focused on learned dynamics and the other on seasonal patterns – are combined using an adaptive weighting scheme. The ensemble weights are optimized independently for each forecast horizon, allowing the model to dynamically prioritize short-term dynamics or long-term seasonality as needed. We conduct extensive experiments on the ETT benchmark datasets for both univariate and multivariate forecasting. The results demonstrate that TS2Vec-Ensemble consistently and significantly outperforms the standard TS2Vec baseline and other state-of-the-art models, validating our hypothesis that a hybrid of learned representations and explicit temporal priors is a superior strategy for long-horizon time series forecasting.

[LG-59] Predicting and Interpolating Spatiotemporal Environmental Data: A Case Study of Groundwater Storag e in Bangladesh

链接: https://arxiv.org/abs/2511.22378
作者: Anna Pazola,Mohammad Shamsudduha,Richard G. Taylor,Allan Tucker
类目: Machine Learning (cs.LG)
*备注: Submitted to the IDA 2026 conference

点击查看摘要

Abstract:Geospatial observational datasets are often limited to point measurements, making temporal prediction and spatial interpolation essential for constructing continuous fields. This study evaluates two deep learning strategies for addressing this challenge: (1) a grid-to-grid approach, where gridded predictors are used to model rasterised targets (aggregation before modelling), and (2) a grid-to-point approach, where gridded predictors model point targets, followed by kriging interpolation to fill the domain (aggregation after modelling). Using groundwater storage data from Bangladesh as a case study, we compare the effcacy of these approaches. Our findings indicate that spatial interpolation is substantially more difficult than temporal prediction. In particular, nearest neighbours are not always the most similar, and uncertainties in geology strongly influence point temporal behaviour. These insights motivate future work on advanced interpolation methods informed by clustering locations based on time series dynamics. Demonstrated on groundwater storage, the conclusions are applicable to other environmental variables governed by indirectly observable factors. Code is available at this https URL.

[LG-60] Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads

链接: https://arxiv.org/abs/2511.22362
作者: Merey Orazaly,Fariza Temirkhanova,Jurn-Gyu Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models have gained considerable attention in the field of physiological signal analysis. They leverage long-range dependencies and complex patterns in temporal signals, allowing them to achieve performance superior to traditional RNN and CNN models. However, they require high computational intensity and memory demands. In this work, we present Efficient-Husformer, a novel Transformer-based architecture developed with hyperparameter optimization (HPO) for multi-class stress detection across two multimodal physiological datasets (WESAD and CogLoad). The main contributions of this work are: (1) the design of a structured search space, targeting effective hyperparameter optimization; (2) a comprehensive ablation study evaluating the impact of architectural decisions; (3) consistent performance improvements over the original Husformer, with the best configuration achieving an accuracy of 88.41 and 92.61 (improvements of 13.83% and 6.98%) on WESAD and CogLoad datasets, respectively. The best-performing configuration is achieved with the (L + dm) or (L + FFN) modality combinations, using a single layer, 3 attention heads, a model dimension of 18/30, and FFN dimension of 120/30, resulting in a compact model with only about 30k parameters.

[LG-61] AutoTailor: Automatic and Efficient Adaptive Model Deployment for Diverse Edge Devices

链接: https://arxiv.org/abs/2511.22355
作者: Mengyang Liu,Chenyu Lu,Haodong Tian,Fang Dong,Ruiting Zhou,Wei Wang,Dian Shen,Guangtong Li,Ye Wan,Li Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:On-device machine learning (ML) has become a fundamental component of emerging mobile applications. Adaptive model deployment delivers efficient inference for heterogeneous device capabilities and performance requirements through customizing neural architectures. SuperNet-based approaches offer a promising solution by generating a large number of model variants from a pre-trained ML model. However, applying SuperNet in existing frameworks suffers from tedious model-aware development and time-consuming hardware-aware profiling, which limits their practical adoption. We present AutoTailor, the first framework to enable automated, end-to-end SuperNet-based adaptive model deployment for edge devices. Unlike manual SuperNet construction, AutoTailor employs a computation graph-guided compilation approach to automatically transform user-provided ML models into SuperNets. To support efficient specialization, AutoTailor incorporates learning-free latency and accuracy predictors, enabling low-cost yet accurate performance prediction. Our extended evaluations demonstrate that AutoTailor reduces the lines of code for SuperNet construction by 11–27 \times , decreases hardware-aware profiling costs by at least 11 \times , and achieves up to 15.60% absolute accuracy improvement and 60.03% latency reduction compared to state-of-the-art approaches across diverse models and devices. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.22355 [cs.LG] (or arXiv:2511.22355v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.22355 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning CVPR

链接: https://arxiv.org/abs/2511.22344
作者: Denis Huseljic,Marek Herde,Lukas Rauch,Paul Hahn,Bernhard Sick
类目: Machine Learning (cs.LG)
*备注: Submitted to CVPR

点击查看摘要

Abstract:Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coverage-based selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods. Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.

[LG-63] SingleQuant: Efficient Quantization of Large Language Models in a Single Pass

链接: https://arxiv.org/abs/2511.22316
作者: Jinying Xiao,Bin Ji,Shasha Li,Xiaodong Liu,Ma Jun,Ye Zhong,Wei Li,Xuan Xie,Qingbo Wu,Jie Yu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs’ task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant’s superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400 \times quantization speedup and increases +0.57% average task performance compared to the selected best baseline.

[LG-64] DeXposure: A Dataset and Benchmarks for Inter-protocol Credit Exposure in Decentralized Financial Networks KR

链接: https://arxiv.org/abs/2511.22314
作者: Wenbin Wu,Kejiang Qian,Alexis Lui,Christopher Jack,Yue Wu,Peter McBurney,Fengxiang He,Bryan Zhang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Social and Information Networks (cs.SI); General Economics (econ.GN)
*备注: Data and code: this https URL - Visualisation: this https URL

点击查看摘要

Abstract:We curate the DeXposure dataset, the first large-scale dataset for inter-protocol credit exposure in decentralized financial networks, covering global markets of 43.7 million entries across 4.3 thousand protocols, 602 blockchains, and 24.3 thousand tokens, from 2020 to 2025. A new measure, value-linked credit exposure between protocols, is defined as the inferred financial dependency relationships derived from changes in Total Value Locked (TVL). We develop a token-to-protocol model using DefiLlama metadata to infer inter-protocol credit exposure from the token’s stock dynamics, as reported by the protocols. Based on the curated dataset, we develop three benchmarks for machine learning research with financial applications: (1) graph clustering for global network measurement, tracking the structural evolution of credit exposure networks, (2) vector autoregression for sector-level credit exposure dynamics during major shocks (Terra and FTX), and (3) temporal graph neural networks for dynamic link prediction on temporal graphs. From the analysis, we observe (1) a rapid growth of network volume, (2) a trend of concentration to key protocols, (3) a decline of network density (the ratio of actual connections to possible connections), and (4) distinct shock propagation across sectors, such as lending platforms, trading exchanges, and asset management protocols. The DeXposure dataset and code have been released publicly. We envision they will help with research and practice in machine learning as well as financial risk monitoring, policy analysis, DeFi market modeling, amongst others. The dataset also contributes to machine learning research by offering benchmarks for graph clustering, vector autoregression, and temporal graph analysis.

[LG-65] FLUX: Efficient Descriptor-Driven Clustered Federated Learning under Arbitrary Distribution Shifts NEURIPS2025

链接: https://arxiv.org/abs/2511.22305
作者: Dario Fenoglio,Mohan Li,Pietro Barbiero,Nicholas D. Lane,Marc Langheinrich,Martin Gjoreski
类目: Machine Learning (cs.LG)
*备注: [v1] Pre-print of the paper accepted to NeurIPS 2025 (57 pages)

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across multiple clients while preserving data privacy. Traditional FL methods often use a global model to fit all clients, assuming that clients’ data are independent and identically distributed (IID). However, when this assumption does not hold, the global model accuracy may drop significantly, limiting FL applicability in real-world scenarios. To address this gap, we propose FLUX, a novel clustering-based FL (CFL) framework that addresses the four most common types of distribution shifts during both training and test time. To this end, FLUX leverages privacy-preserving client-side descriptor extraction and unsupervised clustering to ensure robust performance and scalability across varying levels and types of distribution shifts. Unlike existing CFL methods addressing non-IID client distribution shifts, FLUX i) does not require any prior knowledge of the types of distribution shifts or the number of client clusters, and ii) supports test-time adaptation, enabling unseen and unlabeled clients to benefit from the most suitable cluster-specific models. Extensive experiments across four standard benchmarks, two real-world datasets and ten state-of-the-art baselines show that FLUX improves performance and stability under diverse distribution shifts, achieving an average accuracy gain of up to 23 percentage points over the best-performing baselines, while maintaining computational and communication overhead comparable to FedAvg.

[LG-66] GLA-Grad: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

链接: https://arxiv.org/abs/2511.22293
作者: Teysir Baoueb,Xiaoyu Bie,Mathieu Fontaine,Gaël Richard
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GLA) into the reverse process to reduce inconsistencies between generated signals and conditioning mel spectrogram. In this paper, we further improve GLA-Grad through an innovative choice in how to apply the correction. Particularly, we compute the correction term only once, with a single application of GLA, to accelerate the generation process. Experimental results demonstrate that our method consistently outperforms the baseline models, particularly in out-of-domain scenarios.

[LG-67] Online Dynamic Pricing of Complementary Products

链接: https://arxiv.org/abs/2511.22291
作者: Marco Mussi,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional pricing paradigms, once dominated by static models and rule-based heuristics, are increasingly being replaced by dynamic, data-driven approaches powered by machine learning algorithms. Despite their growing sophistication, most dynamic pricing algorithms focus on optimizing the price of each product independently, disregarding potential interactions among items. By neglecting these interdependencies in consumer demand across related goods, sellers may fail to capture the full potential of coordinated pricing strategies. In this paper, we address this problem by exploring dynamic pricing mechanisms designed explicitly for complementary products, aiming to exploit their joint demand structure to maximize overall revenue. We present an online learning algorithm considering both positive and negative interactions between products’ demands. The algorithm utilizes transaction data to identify advantageous complementary relationships through an integer programming problem between different items, and then optimizes pricing strategies using data-driven and computationally efficient multi-armed bandit solutions based on heteroscedastic Gaussian processes. We validate our solution in a simulated environment, and we demonstrate that our solution improves the revenue w.r.t. a comparable learning algorithm ignoring such interactions.

[LG-68] he Hidden Cost of Approximation in Online Mirror Descent

链接: https://arxiv.org/abs/2511.22283
作者: Ofir Schlisselberg,Uri Sherman,Tomer Koren,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making. The OMD iterates are defined as solutions to optimization subproblems which, oftentimes, can be solved only approximately, leading to an inexact version of the algorithm. Nonetheless, existing OMD analyses typically assume an idealized error free setting, thereby limiting our understanding of performance guarantees that should be expected in practice. In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors. When the regularizer is uniformly smooth, we establish a tight bound on the excess regret due to errors. Then, for barrier regularizers over the simplex and its subsets, we identify a sharp separation: negative entropy requires exponentially small errors to avoid linear regret, whereas log-barrier and Tsallis regularizers remain robust even when the errors are only polynomial. Finally, we show that when the losses are stochastic and the domain is the simplex, negative entropy regains robustness-but this property does not extend to all subsets, where exponentially small errors are again necessary to avoid suboptimal regret.

[LG-69] reeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

链接: https://arxiv.org/abs/2511.22277
作者: Henrijs Princis,Arindam Sharma,Cristina David
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable ability to generate code, yet their outputs often violate syntactic or semantic constraints when guided only through natural language prompts. We introduce TreeCoder, the most general and flexible framework to date for exploring decoding strategies, constraints, and hyperparameters in LLMs, and use it in code generation to enforce correctness and structure during decoding rather than relying on prompt engineering. TreeCoder represents decoding as a tree search over candidate programs, where both decoding strategies and constraint functions - such as style, syntax, execution - are treated as first-class, optimisable components. This design enables systematic exploration and automatic tuning of decoding configurations using standard optimisation techniques. Experiments on the MBPP (Python) and SQL-Spider benchmarks show that TreeCoder consistently improves accuracy across open-source models such as CodeLlama, Mistral and DeepSeek, often outperforming their unconstrained baselines by considerable margins.

[LG-70] FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2511.22265
作者: Yuan Yao,Lixu Wang,Jiaqi Wu,Jin Song,Simin Chen,Zehua Wang,Zijian Tian,Wei Chen,Huixia Li,Xiaoxiao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative training across clients without compromising privacy. While most existing FL methods assume homogeneous model architectures, client heterogeneity in data and resources renders this assumption impractical, motivating model-heterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. In FedRE, each client aggregates its local representations into a single entangled representation using normalized random weights and applies the same weights to integrate the corresponding one-hot label encodings into the entangled-label encoding. Those are then uploaded to the server to train a global classifier. During training, each entangled representation is supervised across categories via its entangled-label encoding, while random weights are resampled each round to introduce diversity, mitigating the global classifier’s overconfidence and promoting smoother decision boundaries. Furthermore, each client uploads a single cross-category entangled representation along with its entangled-label encoding, mitigating the risk of representation inversion attacks and reducing communication overhead. Extensive experiments demonstrate that FedRE achieves an effective trade-off among model performance, privacy protection, and communication overhead. The codes are available at this https URL.

[LG-71] Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names

链接: https://arxiv.org/abs/2511.22215
作者: Hao Wang,Yingshuo Wang,Junang Gan,Yanan Cheng,Jinshuai Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.

[LG-72] BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2511.22210
作者: Junsung Park
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.

[LG-73] nyLLM : Evaluation and Optimization of Small Language Models for Agent ic Tasks on Edge Devices

链接: https://arxiv.org/abs/2511.22138
作者: Mohd Ariful Haque(1),Fahad Rahman(2),Kishor Datta Gupta(1),Khalil Shujaee(1),Roy George(1) ((1) Clark Atlanta University, (2) United International University)
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED), including our conversion of SFT data to chosen-rejected pairs using TinyLlama responses as rejected outputs and manual validation. Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (1B parameters), achieving up to 65.74% overall accuracy, and 55.62% multi-turn accuracy with hybrid optimization. This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices, making privacy-preserving, low-latency autonomous agents practical beyond the cloud.

[LG-74] Probabilistic Digital Twin for Misspecified Structural Dynamical Systems via Latent Force Modeling and Bayesian Neural Networks

链接: https://arxiv.org/abs/2511.22133
作者: Sahil Kashyap,Rajdip Nayek
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work presents a probabilistic digital twin framework for response prediction in dynamical systems governed by misspecified physics. The approach integrates Gaussian Process Latent Force Models (GPLFM) and Bayesian Neural Networks (BNNs) to enable end-to-end uncertainty-aware inference and prediction. In the diagnosis phase, model-form errors (MFEs) are treated as latent input forces to a nominal linear dynamical system and jointly estimated with system states using GPLFM from sensor measurements. A BNN is then trained on posterior samples to learn a probabilistic nonlinear mapping from system states to MFEs, while capturing diagnostic uncertainty. For prognosis, this mapping is used to generate pseudo-measurements, enabling state prediction via Kalman filtering. The framework allows for systematic propagation of uncertainty from diagnosis to prediction, a key capability for trustworthy digital twins. The framework is demonstrated using four nonlinear examples: a single degree of freedom (DOF) oscillator, a multi-DOF system, and two established benchmarks – the Bouc-Wen hysteretic system and the Silverbox experimental dataset – highlighting its predictive accuracy and robustness to model misspecification.

[LG-75] Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

链接: https://arxiv.org/abs/2511.22130
作者: Gilbert Yang,Yaqin Chen,Thomson Yen,Hongseok Namkoong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents’ ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.

[LG-76] A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction NEURIPS2025

链接: https://arxiv.org/abs/2511.22128
作者: John J. Vastola,Samuel J. Gershman,Kanaka Rajan
类目: Machine Learning (cs.LG)
*备注: Accepted to the NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations (NeurReps)

点击查看摘要

Abstract:Dimensionality reduction algorithms like principal component analysis (PCA) are workhorses of machine learning and neuroscience, but each has well-known limitations. Variants of PCA are simple and interpretable, but not flexible enough to capture nonlinear data manifold structure. More flexible approaches have other problems: autoencoders are generally difficult to interpret, and graph-embedding-based methods can produce pathological distortions in manifold geometry. Motivated by these shortcomings, we propose a variational framework that casts dimensionality reduction algorithms as solutions to an optimal manifold embedding problem. By construction, this framework permits nonlinear embeddings, allowing its solutions to be more flexible than PCA. Moreover, the variational nature of the framework has useful consequences for interpretability: each solution satisfies a set of partial differential equations, and can be shown to reflect symmetries of the embedding objective. We discuss these features in detail and show that solutions can be analytically characterized in some cases. Interestingly, one special case exactly recovers PCA.

[LG-77] IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder

链接: https://arxiv.org/abs/2511.22116
作者: Youran Zhou,Mohamed Reda Bouadjenek,Sunil Aryal%
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present \textbfIVGAE, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its \textitdual-decoder architecture, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30% missing rates. Code and data are available at: this https URL.

[LG-78] oward Data-Driven Surrogates of the Solar Wind with Spherical Fourier Neural Operator ICML

链接: https://arxiv.org/abs/2511.22112
作者: Reza Mansouri,Dustin Kempton,Pete Riley,Rafal Angryk
类目: Machine Learning (cs.LG)
*备注: International Conference on Machine Learning and Applications (ICMLA 2025)

点击查看摘要

Abstract:The solar wind, a continuous stream of charged particles from the Sun’s corona, shapes the heliosphere and impacts space systems near Earth. Variations such as high-speed streams and coronal mass ejections can disrupt satellites, power grids, and communications, making accurate modeling essential for space weather forecasting. While 3D magnetohydrodynamic (MHD) models are used to simulate and investigate these variations in the solar wind, they tend to be computationally expensive, limiting their usefulness in investigating the impacts of boundary condition uncertainty. In this work, we develop a surrogate for steady state solar wind modeling, using a Spherical Fourier Neural Operator (SFNO). We compare our model to a previously developed numerical surrogate for this task called HUX, and we show that the SFNO achieves comparable or better performance across several metrics. Though HUX retains advantages in physical smoothness, this underscores the need for improved evaluation criteria rather than a flaw in SFNO. As a flexible and trainable approach, SFNO enables efficient real-time forecasting and can improve with more data. The source code and more visual results are available at this https URL.

[LG-79] An energy-efficient spiking neural network with continuous learning for self-adaptive brain-machine interface

链接: https://arxiv.org/abs/2511.22108
作者: Zhou Biyan,Arindam Basu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The number of simultaneously recorded neurons follows an exponentially increasing trend in implantable brain-machine interfaces (iBMIs). Integrating the neural decoder in the implant is an effective data compression method for future wireless iBMIs. However, the non-stationarity of the system makes the performance of the decoder unreliable. To avoid frequent retraining of the decoder and to ensure the safety and comfort of the iBMI user, continuous learning is essential for real-life applications. Since Deep Spiking Neural Networks (DSNNs) are being recognized as a promising approach for developing resource-efficient neural decoder, we propose continuous learning approaches with Reinforcement Learning (RL) algorithms adapted for DSNNs. Banditron and AGREL are chosen as the two candidate RL algorithms since they can be trained with limited computational resources, effectively addressing the non-stationary problem and fitting the energy constraints of implantable devices. To assess the effectiveness of the proposed methods, we conducted both open-loop and closed-loop experiments. The accuracy of open-loop experiments conducted with DSNN Banditron and DSNN AGREL remains stable over extended periods. Meanwhile, the time-to-target in the closed-loop experiment with perturbations, DSNN Banditron performed comparably to that of DSNN AGREL while achieving reductions of 98% in memory access usage and 99% in the requirements for multiply- and-accumulate (MAC) operations during training. Compared to previous continuous learning SNN decoders, DSNN Banditron requires 98% less computes making it a prime candidate for future wireless iBMI systems.

[LG-80] Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning DATE

链接: https://arxiv.org/abs/2511.22105
作者: Saad Masrur,Ismail Guvenc,David Lopez Perez
类目: Machine Learning (cs.LG)
*备注: This is an updated version of my preprint available on TechRxiv. Don’t flag it as plagiarism. I wanna post my paper on arxiv

点击查看摘要

Abstract:Dynamic sleep mode optimization (SMO) in millimeter-wave (mmWave) networks is essential for maximizing energy efficiency (EE) under stringent quality-of-service (QoS) constraints. However, existing optimization and reinforcement learning (RL) approaches rely on aggregated, static base station (BS) traffic models that fail to capture non-stationary traffic dynamics and suffer from large state-action spaces, limiting real-world deployment. To address these challenges, this paper proposes a multi-agent deep reinforcement learning (MARL) framework using a Double Deep Q-Network (DDQN), referred to as MARL-DDQN, for adaptive SMO in a 3D urban environment with a time-varying and community-based user equipment (UE) mobility model. Unlike conventional single-agent RL, MARL-DDQN enables scalable, distributed decision-making with minimal signaling overhead. A realistic BS power consumption model and beamforming are integrated to accurately quantify EE, while QoS is defined in terms of throughput. The method adapts SMO policies to maximize EE while mitigating inter-cell interference and ensuring throughput fairness. Simulations show that MARL-DDQN outperforms state-of-the-art strategies, including All On, iterative QoS-aware load-based (IT-QoS-LB), MARL-DDPG, and MARL-PPO, achieving up to 0.60 Mbit/Joule EE, 8.5 Mbps 10th-percentile throughput, and meeting QoS constraints 95% of the time under dynamic scenarios.

[LG-81] Representative Action Selection for Large Action Space: From Bandits to MDPs

链接: https://arxiv.org/abs/2511.22104
作者: Quan Zhou,Shie Mannor
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: Journal version of arXiv:2505.18269

点击查看摘要

Abstract:We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments – a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty. Comments: Journal version of arXiv:2505.18269 Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2511.22104 [cs.LG] (or arXiv:2511.22104v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.22104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Adaptive Dueling Double Deep Q-networks in Uniswap V3 Replication and Extension with Mamba

链接: https://arxiv.org/abs/2511.22101
作者: Zhaofeng Zhang
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:The report goes through the main steps of replicating and improving the article “Adaptive Liquidity Provision in Uniswap V3 with Deep Reinforcement Learning.” The replication part includes how to obtain data from the Uniswap Subgraph, details of the implementation, and comments on the results. After the replication, I propose a new structure based on the original model, which combines Mamba with DDQN and a new reward function. In this new structure, I clean the data again and introduce two new baselines for comparison. As a result, although the model has not yet been applied to all datasets, it shows stronger theoretical support than the original model and performs better in some tests.

[LG-83] Quantum Bayesian Optimization for Quality Improvement in Fuselage Assembly

链接: https://arxiv.org/abs/2511.22090
作者: Jiayu Liu,Chong Liu,Trevor Rhone,Yinan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent efforts in smart manufacturing have enhanced aerospace fuselage assembly processes, particularly by innovating shape adjustment techniques to minimize dimensional gaps between assembled sections. Existing approaches have shown promising results but face the issue of low sample efficiency from the manufacturing systems. It arises from the limitation of the classical Monte Carlo method when uncovering the mean response from a distribution. In contrast, recent work has shown that quantum algorithms can achieve the same level of estimation accuracy with significantly fewer samples than the classical Monte Carlo method from distributions. Therefore, we can adopt the estimation of the quantum algorithm to obtain the estimation from real physical systems (distributions). Motivated by this advantage, we propose a Quantum Bayesian Optimization (QBO) framework for precise shape control during assembly to improve the sample efficiency in manufacturing practice. Specifically, this approach utilizes a quantum oracle, based on finite element analysis (FEA)-based models or surrogate models, to acquire a more accurate estimation of the environment response with fewer queries for a certain input. QBO employs an Upper Confidence Bound (UCB) as the acquisition function to strategically select input values that are most likely to maximize the objective function. It has been theoretically proven to require much fewer samples while maintaining comparable optimization results. In the case study, force-controlled actuators are applied to one fuselage section to adjust its shape and reduce the gap to the adjoining section. Experimental results demonstrate that QBO achieves significantly lower dimensional error and uncertainty compared to classical methods, particularly using the same queries from the simulation.

[LG-84] ARES: Anomaly Recognition Model For Edge Streams KDD2026

链接: https://arxiv.org/abs/2511.22078
作者: Simone Mungari,Albert Bifet,Giuseppe Manco,Bernhard Pfahringer
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD 2026

点击查看摘要

Abstract:Many real-world scenarios involving streaming information can be represented as temporal graphs, where data flows through dynamic changes in edges over time. Anomaly detection in this context has the objective of identifying unusual temporal connections within the graph structure. Detecting edge anomalies in real time is crucial for mitigating potential risks. Unlike traditional anomaly detection, this task is particularly challenging due to concept drifts, large data volumes, and the need for real-time response. To face these challenges, we introduce ARES, an unsupervised anomaly detection framework for edge streams. ARES combines Graph Neural Networks (GNNs) for feature extraction with Half-Space Trees (HST) for anomaly scoring. GNNs capture both spike and burst anomalous behaviors within streams by embedding node and edge properties in a latent space, while HST partitions this space to isolate anomalies efficiently. ARES operates in an unsupervised way without the need for prior data labeling. To further validate its detection capabilities, we additionally incorporate a simple yet effective supervised thresholding mechanism. This approach leverages statistical dispersion among anomaly scores to determine the optimal threshold using a minimal set of labeled data, ensuring adaptability across different domains. We validate ARES through extensive evaluations across several real-world cyber-attack scenarios, comparing its performance against existing methods while analyzing its space and time complexity.

[LG-85] Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

链接: https://arxiv.org/abs/2511.22069
作者: Yiran Zhang,Weihang Xu,Mo Zhou,Maryam Fazel,Simon Shaolei Du
类目: Machine Learning (cs.LG)
*备注: 43 pages

点击查看摘要

Abstract:Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with n learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a 1/\tau rate, where \tau is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.

[LG-86] Calibration-Free EEG-based Driver Drowsiness Detection with Online Test-Time Adaptation

链接: https://arxiv.org/abs/2511.22030
作者: Geun-Deok Jang,Dong-Kyun Han,Seo-Hyeon Park,Seong-Whan Lee
类目: Machine Learning (cs.LG)
*备注: 10 pages, Submitted to IEEE Transactions on Human-Machine Systems

点击查看摘要

Abstract:Drowsy driving is a growing cause of traffic accidents, prompting recent exploration of electroencephalography (EEG)-based drowsiness detection systems. However, the inherent variability of EEG signals due to psychological and physical factors necessitates a cumbersome calibration process. In particular, the inter-subject variability of EEG signals leads to a domain shift problem, which makes it challenging to generalize drowsiness detection models to unseen target subjects. To address these issues, we propose a novel driver drowsiness detection framework that leverages online test-time adaptation (TTA) methods to dynamically adjust to target subject distributions. Our proposed method updates the learnable parameters in batch normalization (BN) layers, while preserving pretrained normalization statistics, resulting in a modified configuration that ensures effective adaptation during test time. We incorporate a memory bank that dynamically manages streaming EEG segments, selecting samples based on their reliability determined by negative energy scores and persistence time. In addition, we introduce prototype learning to ensure robust predictions against distribution shifts over time. We validated our method on the sustained-attention driving dataset collected in a simulated environment, where drowsiness was estimated from delayed reaction times during monotonous lane-keeping tasks. Our experiments show that our method outperforms all baselines, achieving an average F1-score of 81.73%, an improvement of 11.73% over the best TTA baseline. This demonstrates that our proposed method significantly enhances the adaptability of EEG-based drowsiness detection systems in non-i.i.d. scenarios.

[LG-87] Equilibrium Propagation Without Limits

链接: https://arxiv.org/abs/2511.22024
作者: Elon Litman
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We liberate Equilibrium Propagation (EP) from the limit of infinitesimal perturbations by establishing a finite-nudge foundation for local credit assignment. By modeling network states as Gibbs-Boltzmann distributions rather than deterministic points, we prove that the gradient of the difference in Helmholtz free energy between a nudged and free phase is exactly the difference in expected local energy derivatives. This validates the classic Contrastive Hebbian Learning update as an exact gradient estimator for arbitrary finite nudging, requiring neither infinitesimal approximations nor convexity. Furthermore, we derive a generalized EP algorithm based on the path integral of loss-energy covariances, enabling learning with strong error signals that standard infinitesimal approximations cannot support.

[LG-88] Distance-based Learning of Hypertrees

链接: https://arxiv.org/abs/2511.22014
作者: Shaun Fallat,Kamyar Khodamoradi,David Kirkpatrick,Valerii Maliuk,S. Ahmad Mojallal,Sandra Zilles
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning hypergraphs with shortest-path queries (SP-queries), and present the first provably optimal online algorithm for a broad and natural class of hypertrees that we call orderly hypertrees. Our online algorithm can be transformed into a provably optimal offline algorithm. Orderly hypertrees can be positioned within the Fagin hierarchy of acyclic hypergraph (well-studied in database theory), and strictly encompass the broadest class in this hierarchy that is learnable with subquadratic SP-query complexity. Recognizing that in some contexts, such as evolutionary tree reconstruction, distance measurements can degrade with increased distance, we also consider a learning model that uses bounded distance queries. In this model, we demonstrate asymptotically tight complexity bounds for learning general hypertrees. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.22014 [cs.LG] (or arXiv:2511.22014v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.22014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-89] MOTIF-RF: Multi-template On-chip Transformer Synthesis Incorporating Frequency-domain Self-transfer Learning for RFIC Design Automation

链接: https://arxiv.org/abs/2511.21970
作者: Houbo He,Yizhou Xu,Lei Xia,Yaolong Hu,Fan Cai,Taiyun Chi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at ASP-DAC 2026

点击查看摘要

Abstract:This paper presents a systematic study on developing multi-template machine learning (ML) surrogate models and applying them to the inverse design of transformers (XFMRs) in radio-frequency integrated circuits (RFICs). Our study starts with benchmarking four widely used ML architectures, including MLP-, CNN-, UNet-, and GT-based models, using the same datasets across different XFMR topologies. To improve modeling accuracy beyond these baselines, we then propose a new frequency-domain self-transfer learning technique that exploits correlations between adjacent frequency bands, leading to around 30%-50% accuracy improvement in the S-parameters prediction. Building on these models, we further develop an inverse design framework based on the covariance matrix adaptation evolutionary strategy (CMA-ES) algorithm. This framework is validated using multiple impedance-matching tasks, all demonstrating fast convergence and trustworthy performance. These results advance the goal of AI-assisted specs-to-GDS automation for RFICs and provide RFIC designers with actionable tools for integrating AI into their workflows.

[LG-90] CTR Prediction on Alibabas Taobao Advertising Dataset Using Traditional and Deep Learning Models

链接: https://arxiv.org/abs/2511.21963
作者: Hongyu Yang,Chunxi Wen,Jiyin Zhang,Nanfei Shen,Shijiao Zhang,Xiyan Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Click-through rates prediction is critical in modern advertising systems, where ranking relevance and user engagement directly impact platform efficiency and business value. In this project, we explore how to model CTR more effectively using a large-scale Taobao dataset released by Alibaba. We start with supervised learning models, including logistic regression and Light-GBM, that are trained on static features such as user demographics, ad attributes, and contextual metadata. These models provide fast, interpretable benchmarks, but have limited capabilities to capture patterns of behavior that drive clicks. To better model user intent, we combined behavioral data from hundreds of millions of interactions over a 22-day period. By extracting and encoding user action sequences, we construct representations of user interests over time. We use deep learning models to fuse behavioral embeddings with static features. Among them, multilayer perceptrons (MLPs) have achieved significant performance improvements. To capture temporal dynamics, we designed a Transformer-based architecture that uses a self-attention mechanism to learn contextual dependencies across behavioral sequences, modeling not only what the user interacts with, but also the timing and frequency of interactions. Transformer improves AUC by 2.81 % over the baseline (LR model), with the largest gains observed for users whose interests are diverse or change over time. In addition to modeling, we propose an A/B testing strategy for real-world evaluation. We also think about the broader implications: personalized ad targeting technology can be applied to public health scenarios to achieve precise delivery of health information or behavior guidance. Our research provides a roadmap for advancing click-through rate predictions and extending their value beyond e-commerce.

[LG-91] Deep Learning Architectures for Code-Modulated Visual Evoked Potentials Detection

链接: https://arxiv.org/abs/2511.21940
作者: Kiran Nair,Hubert Cecotti
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: 20 Pages, prepared for a Journal

点击查看摘要

Abstract:Non-invasive Brain-Computer Interfaces (BCIs) based on Code-Modulated Visual Evoked Potentials (C-VEPs) require highly robust decoding methods to address temporal variability and session-dependent noise in EEG signals. This study proposes and evaluates several deep learning architectures, including convolutional neural networks (CNNs) for 63-bit m-sequence reconstruction and classification, and Siamese networks for similarity-based decoding, alongside canonical correlation analysis (CCA) baselines. EEG data were recorded from 13 healthy adults under single-target flicker stimulation. The proposed deep models significantly outperformed traditional approaches, with distance-based decoding using Earth Mover’s Distance (EMD) and constrained EMD showing greater robustness to latency variations than Euclidean and Mahalanobis metrics. Temporal data augmentation with small shifts further improved generalization across sessions. Among all models, the multi-class Siamese network achieved the best overall performance with an average accuracy of 96.89%, demonstrating the potential of data-driven deep architectures for reliable, single-trial C-VEP decoding in adaptive non-invasive BCI systems.

[LG-92] Breaking Algorithmic Collusion in Human-AI Ecosystems

链接: https://arxiv.org/abs/2511.21935
作者: Natalie Collina,Eshwar Ram Arunachaleswaran,Meena Jagadeesan
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:AI agents are increasingly deployed in ecosystems where they repeatedly interact not only with each other but also with humans. In this work, we study these human-AI ecosystems from a theoretical perspective, focusing on the classical framework of repeated pricing games. In our stylized model, the AI agents play equilibrium strategies, and one or more humans manually perform the pricing task instead of adopting an AI agent, thereby defecting to a no-regret strategy. Motivated by how populations of AI agents can sustain supracompetitive prices, we investigate whether high prices persist under such defections. Our main finding is that even a single human defection can destabilize collusion and drive down prices, and multiple defections push prices even closer to competitive levels. We further show how the nature of collusion changes under defection-aware AI agents. Taken together, our results characterize when algorithmic collusion is fragile–and when it persists–in mixed ecosystems of AI agents and humans.

[LG-93] Modeling Quantum Autoencoder Trainable Kernel for IoT Anomaly Detection

链接: https://arxiv.org/abs/2511.21932
作者: Swathi Chandrasekhar,Shiva Raj Pokhrel,Swati Kumari,Navneet Singh
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Escalating cyber threats and the high-dimensional complexity of IoT traffic have outpaced classical anomaly detection methods. While deep learning offers improvements, computational bottlenecks limit real-time deployment at scale. We present a quantum autoencoder (QAE) framework that compresses network traffic into discriminative latent representations and employs quantum support vector classification (QSVC) for intrusion detection. Evaluated on three datasets, our approach achieves improved accuracy on ideal simulators and on the IBM Quantum hardware demonstrating practical quantum advantage on current NISQ devices. Crucially, moderate depolarizing noise acts as implicit regularization, stabilizing training and enhancing generalization. This work establishes quantum machine learning as a viable, hardware-ready solution for real-world cybersecurity challenges.

[LG-94] Multi-Modal Machine Learning for Early Trust Prediction in Human-AI Interaction Using Face Image and GSR Bio Signals

链接: https://arxiv.org/abs/2511.21908
作者: Hamid Shamszare,Avishek Choudhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting human trust in AI systems is crucial for safe integration of AI-based decision support tools, especially in healthcare. This study proposes a multi-modal machine learning framework that combines image and galvanic skin response (GSR) data to predict early user trust in AI- or human-generated recommendations in a simulated ADHD mHealth context. Facial video data were processed using OpenCV for frame extraction and transferred learning with a pre-trained transformer model to derive emotional features. Concurrently, GSR signals were decomposed into tonic and phasic components to capture physiological arousal patterns. Two temporal windows were defined for trust prediction: the Early Detection Window (6 to 3 seconds before decision-making) and the Proximal Detection Window (3 to 0 seconds before decision-making). For each window, trust prediction was conducted separately using image-based, GSR-based, and multimodal (image + GSR) features. Each modality was analyzed using machine learning algorithms, and the top-performing unimodal models were integrated through a multimodal stacking ensemble for final prediction. Experimental results showed that combining facial and physiological cues significantly improved prediction performance. The multimodal stacking framework achieved an accuracy of 0.83, F1-score of 0.88, and ROC-AUC of 0.87 in the Early Detection Window, and an accuracy of 0.75, F1-score of 0.82, and ROC-AUC of 0.66 in the Proximal Detection Window. These results demonstrate the potential of bio signals as real-time, objective markers of user trust, enabling adaptive AI systems that dynamically adjust their responses to maintain calibrated trust which is a critical capability in mental health applications where mis-calibrated trust can affect diagnostic and treatment outcomes.

[LG-95] Beyond Atoms: Evaluating Electron Density Representation for 3D Molecular Learning

链接: https://arxiv.org/abs/2511.21900
作者: Patricia Suriana,Joshua A. Rackers,Ewa M. Nowara,Pedro O. Pinheiro,John M. Nicoloudis,Vishnu Sresht
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Machine learning models for 3D molecular property prediction typically rely on atom-based representations, which may overlook subtle physical information. Electron density maps – the direct output of X-ray crystallography and cryo-electron microscopy – offer a continuous, physically grounded alternative. We compare three voxel-based input types for 3D convolutional neural networks (CNNs): atom types, raw electron density, and density gradient magnitude, across two molecular tasks – protein-ligand binding affinity prediction (PDBbind) and quantum property prediction (QM9). We focus on voxel-based CNNs because electron density is inherently volumetric, and voxel grids provide the most natural representation for both experimental and computed densities. On PDBbind, all representations perform similarly with full data, but in low-data regimes, density-based inputs outperform atom types, while a shape-based baseline performs comparably – suggesting that spatial occupancy dominates this task. On QM9, where labels are derived from Density Functional Theory (DFT) but input densities from a lower-level method (XTB), density-based inputs still outperform atom-based ones at scale, reflecting the rich structural and electronic information encoded in density. Overall, these results highlight the task- and regime-dependent strengths of density-derived inputs, improving data efficiency in affinity prediction and accuracy in quantum property modeling.

[LG-96] Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

链接: https://arxiv.org/abs/2511.21893
作者: Fatemeh Akbarian,Anahita Baninajjar,Yingyi Zhang,Ananth Balashankar,Amir Aminifar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions (Zhang et al., 2025), where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. To counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker’s perturbed input through generative models, e.g., Variational Autoencoders (VAEs), to maintain natural alignment. To further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 to 46) and 11% (32 to 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions.

[LG-97] Exploring Fusion Strategies for Multimodal Vision-Language Systems

链接: https://arxiv.org/abs/2511.21889
作者: Regan Willis,Jason Bakos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern machine learning models often combine multiple input streams of data to more accurately capture the information that informs their decisions. In multimodal machine learning, choosing the strategy for fusing data together requires careful consideration of the application’s accuracy and latency requirements, as fusing the data at earlier or later stages in the model architecture can lead to performance changes in accuracy and latency. To demonstrate this tradeoff, we investigate different fusion strategies using a hybrid BERT and vision network framework that integrates image and text data. We explore two different vision networks: MobileNetV2 and ViT. We propose three models for each vision network, which fuse data at late, intermediate, and early stages in the architecture. We evaluate the proposed models on the CMU MOSI dataset and benchmark their latency on an NVIDIA Jetson Orin AGX. Our experimental results demonstrate that while late fusion yields the highest accuracy, early fusion offers the lowest inference latency. We describe the three proposed model architectures and discuss the accuracy and latency tradeoffs, concluding that data fusion earlier in the model architecture results in faster inference times at the cost of accuracy.

[LG-98] Physically Interpretable Representation Learning with Gaussian Mixture Variational AutoEncoder (GM-VAE)

链接: https://arxiv.org/abs/2511.21883
作者: Tiffany Fan,Murray Cutforth,Marta D’Elia,Alexandre Cortiella,Alireza Doostan,Eric Darve
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extracting compact, physically interpretable representations from high-dimensional scientific data is a persistent challenge due to the complex, nonlinear structures inherent in physical systems. We propose a Gaussian Mixture Variational Autoencoder (GM-VAE) framework designed to address this by integrating an Expectation-Maximization (EM)-inspired training scheme with a novel spectral interpretability metric. Unlike conventional VAEs that jointly optimize reconstruction and clustering (often leading to training instability), our method utilizes a block-coordinate descent strategy, alternating between expectation and maximization steps. This approach stabilizes training and naturally aligns latent clusters with distinct physical regimes. To objectively evaluate the learned representations, we introduce a quantitative metric based on graph-Laplacian smoothness, which measures the coherence of physical quantities across the latent manifold. We demonstrate the efficacy of this framework on datasets of increasing complexity: surface reaction ODEs, Navier-Stokes wake flows, and experimental laser-induced combustion Schlieren images. The results show that our GM-VAE yields smooth, physically consistent manifolds and accurate regime clustering, offering a robust data-driven tool for interpreting turbulent and reactive flow systems.

[LG-99] Differential privacy from axioms

链接: https://arxiv.org/abs/2511.21876
作者: Guy Blanc,William Pires,Toniann Pitassi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) is the de facto notion of privacy both in theory and in practice. However, despite its popularity, DP imposes strict requirements which guard against strong worst-case scenarios. For example, it guards against seemingly unrealistic scenarios where an attacker has full information about all but one point in the data set, and still nothing can be learned about the remaining point. While preventing such a strong attack is desirable, many works have explored whether average-case relaxations of DP are easier to satisfy [HWR13,WLF16,BF16,LWX23]. In this work, we are motivated by the question of whether alternate, weaker notions of privacy are possible: can a weakened privacy notion still guarantee some basic level of privacy, and on the other hand, achieve privacy more efficiently and/or for a substantially broader set of tasks? Our main result shows the answer is no: even in the statistical setting, any reasonable measure of privacy satisfying nontrivial composition is equivalent to DP. To prove this, we identify a core set of four axioms or desiderata: pre-processing invariance, prohibition of blatant non-privacy, strong composition, and linear scalability. Our main theorem shows that any privacy measure satisfying our axioms is equivalent to DP, up to polynomial factors in sample complexity. We complement this result by showing our axioms are minimal: removing any one of our axioms enables ill-behaved measures of privacy. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.21876 [cs.DS] (or arXiv:2511.21876v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.21876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] Lightweight ML-Based Air Quality Prediction for IoT and Embedded Applications

链接: https://arxiv.org/abs/2511.21857
作者: Md. Sad Abdullah Sami,Mushfiquzzaman Abid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the effectiveness and efficiency of two variants of the XGBoost regression model, the full-capacity and lightweight (tiny) versions, for predicting the concentrations of carbon monoxide (CO) and nitrogen dioxide (NO2). Using the AirQualityUCI dataset collected over one year in an urban environment, we conducted a comprehensive evaluation based on widely accepted metrics, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Bias Error (MBE), and the coefficient of determination (R2). In addition, we assessed resource-oriented metrics such as inference time, model size, and peak RAM usage. The full XGBoost model achieved superior predictive accuracy for both pollutants, while the tiny model, though slightly less precise, offered substantial computational benefits with significantly reduced inference time and model storage requirements. These results demonstrate the feasibility of deploying simplified models in resource-constrained environments without compromising predictive quality. This makes the tiny XGBoost model suitable for real-time air-quality monitoring in IoT and embedded applications.

[LG-101] Massively Parallel Imitation Learning of Mouse Forelimb Musculoskeletal Reaching Dynamics NEURIPS2025

链接: https://arxiv.org/abs/2511.21848
作者: Eric Leonardis,Akira Nagamori,Ayesha Thanawalla,Yuanjia Yang,Joshua Park,Hutton Saunders,Eiman Azim,Talmo Pereira
类目: Machine Learning (cs.LG); Robotics (cs.RO); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
*备注: Accepted at NeurIPS 2025 Workshop Data on the Brain Mind: Concrete Applications of AI to Neuroscience and Cognitive Science. 12 pages, 4 figures

点击查看摘要

Abstract:The brain has evolved to effectively control the body, and in order to understand the relationship we need to model the sensorimotor transformations underlying embodied control. As part of a coordinated effort, we are developing a general-purpose platform for behavior-driven simulation modeling high fidelity behavioral dynamics, biomechanics, and neural circuit architectures underlying embodied control. We present a pipeline for taking kinematics data from the neuroscience lab and creating a pipeline for recapitulating those natural movements in a biomechanical model. We implement a imitation learning framework to perform a dexterous forelimb reaching task with a musculoskeletal model in a simulated physics environment. The mouse arm model is currently training at faster than 1 million training steps per second due to GPU acceleration with JAX and Mujoco-MJX. We present results that indicate that adding naturalistic constraints on energy and velocity lead to simulated musculoskeletal activity that better predict real EMG signals. This work provides evidence to suggest that energy and control constraints are critical to modeling musculoskeletal motor control.

[LG-102] Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison

链接: https://arxiv.org/abs/2511.21842
作者: Md. Sad Abdullah Sami,Mushfiquzzaman Abid
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) deployments across diverse sectors has significantly enhanced operational efficiency, yet concurrently elevated cybersecurity vulnerabilities due to increased exposure to cyber threats. Given the limitations of traditional signature-based Anomaly Detection Systems (ADS) in identifying emerging and zero-day threats, this study investigates the effectiveness of two unsupervised anomaly detection techniques, Isolation Forest (IF) and One-Class Support Vector Machine (OC-SVM), using the TON_IoT thermostat dataset. A comprehensive evaluation was performed based on standard metrics (accuracy, precision, recall, and F1-score) alongside critical resource utilization metrics such as inference time, model size, and peak RAM usage. Experimental results revealed that IF consistently outperformed OC-SVM, achieving higher detection accuracy, superior precision, and recall, along with a significantly better F1-score. Furthermore, Isolation Forest demonstrated a markedly superior computational footprint, making it more suitable for deployment on resource-constrained IoT edge devices. These findings underscore Isolation Forest’s robustness in high-dimensional and imbalanced IoT environments and highlight its practical viability for real-time anomaly detection.

[LG-103] Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

链接: https://arxiv.org/abs/2511.21804
作者: Gauri Pradhan,Joonas Jälkö,Santiago Zanella-Bèguelin,Antti Honkela
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:Training machine learning models with differential privacy (DP) limits an adversary’s ability to infer sensitive information about the training data. It can be interpreted as a bound on adversary’s capability to distinguish two adjacent datasets according to chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.

[LG-104] he Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

链接: https://arxiv.org/abs/2511.21799
作者: Ethan Hsu,Harry Chen,Chudi Zhong,Lesia Semenova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world machine learning (ML) pipelines rarely produce a single model; instead, they produce a Rashomon set of many near-optimal ones. We show that this multiplicity reshapes key aspects of trustworthiness. At the individual-model level, sparse interpretable models tend to preserve privacy but are fragile to adversarial attacks. In contrast, the diversity within a large Rashomon set enables reactive robustness: even when an attack breaks one model, others often remain accurate. Rashomon sets are also stable under small distribution shifts. However, this same diversity increases information leakage, as disclosing more near-optimal models provides an attacker with progressively richer views of the training data. Through theoretical analysis and empirical studies of sparse decision trees and linear models, we characterize this robustness-privacy trade-off and highlight the dual role of Rashomon sets as both a resource and a risk for trustworthy ML.

[LG-105] Multiclass threshold-based classification and model evaluation

链接: https://arxiv.org/abs/2511.21794
作者: Edoardo Legnaro,Sabrina Guastavino,Francesco Marchetti
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2505.11276

点击查看摘要

Abstract:In this paper, we introduce a threshold-based framework for multiclass classification that generalizes the standard argmax rule. This is done by replacing the probabilistic interpretation of softmax outputs with a geometric one on the multidimensional simplex, where the classification depends on a multidimensional threshold. This change of perspective enables for any trained classification network an \textita posteriori optimization of the classification score by means of threshold tuning, as usually carried out in the binary setting, thus allowing for a further refinement of the prediction capability of any network. Our experiments show indeed that multidimensional threshold tuning yields performance improvements across various networks and datasets. Moreover, we derive a multiclass ROC analysis based on \emphROC clouds – the attainable (FPR,TPR) operating points induced by a single multiclass threshold – and summarize them via a \emphDistance From Point (DFP) score to (0,1) . This yields a coherent alternative to standard One-vs-Rest (OvR) curves and aligns with the observed tuning gains.

[LG-106] Dynamical Implicit Neural Representations

链接: https://arxiv.org/abs/2511.21787
作者: Yesom Park,Kelvin Kan,Thomas Flynn,Yi Huang,Shinjae Yoo,Stanley Osher,Xihaier Luo
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Implicit Neural Representations (INRs) provide a powerful continuous framework for modeling complex visual and geometric signals, but spectral bias remains a fundamental challenge, limiting their ability to capture high-frequency details. Orthogonal to existing remedy strategies, we introduce Dynamical Implicit Neural Representations (DINR), a new INR modeling framework that treats feature evolution as a continuous-time dynamical system rather than a discrete stack of layers. This dynamical formulation mitigates spectral bias by enabling richer, more adaptive frequency representations through continuous feature evolution. Theoretical analysis based on Rademacher complexity and the Neural Tangent Kernel demonstrates that DINR enhances expressivity and improves training dynamics. Moreover, regularizing the complexity of the underlying dynamics provides a principled way to balance expressivity and generalization. Extensive experiments on image representation, field reconstruction, and data compression confirm that DINR delivers more stable convergence, higher signal fidelity, and stronger generalization than conventional static INRs.

[LG-107] Physics-Informed Spiking Neural Networks via Conservative Flux Quantization

链接: https://arxiv.org/abs/2511.21784
作者: Chi Zhang,Lin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-time, physically-consistent predictions on low-power edge devices is critical for the next generation embodied AI systems, yet it remains a major challenge. Physics-Informed Neural Networks (PINNs) combine data-driven learning with physics-based constraints to ensure the model’s predictions are with underlying physical this http URL, PINNs are energy-intensive and struggle to strictly enforce physical conservation laws. Brain-inspired spiking neural networks (SNNs) have emerged as a promising solution for edge computing and real-time processing. However, naively converting PINNs to SNNs degrades physical fidelity and fails to address long-term generalization issues. To this end, this paper introduce a novel Physics-Informed Spiking Neural Network (PISNN) framework. Importantly, to ensure strict physical conservation, we design the Conservative Leaky Integrate-and-Fire (C-LIF) neuron, whose dynamics structurally guarantee local mass preservation. To achieve robust temporal generalization, we introduce a novel Conservative Flux Quantization (CFQ) strategy, which redefines neural spikes as discrete packets of physical flux. Our CFQ learns a time-invariant physical evolution operator, enabling the PISNN to become a general-purpose solver – conservative-by-construction. Extensive experiments show that our PISNN excels on diverse benchmarks. For both the canonical 1D heat equation and the more challenging 2D Laplace’s Equation, it accurately simulates the system dynamics while maintaining perfect mass conservation by design – a feat that is challenging for conventional PINNs. This work establishes a robust framework for fusing the rigor of scientific computing with the efficiency of neuromorphic engineering, paving the way for complex, long-term, and energy-efficient physics predictions for intelligent systems.

[LG-108] Artificial intelligence for methane detection: from continuous monitoring to verified mitigation

链接: https://arxiv.org/abs/2511.21777
作者: Anna Allen,Gonzalo Mateo-Garcia,Itziar Irakulis-Loitxate,Manuel Montesino-San Martin,Marc Watine,James Requeima,Javier Gorroño,Cynthia Randles,Tharwat Mokalled,Luis Guanter,Richard E. Turner,Claudio Cifarelli,Manfredi Caltagirone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Methane is a potent greenhouse gas, responsible for roughly 30% of warming since pre-industrial times. A small number of large point sources account for a disproportionate share of emissions, creating an opportunity for substantial reductions by targeting relatively few sites. Detection and attribution of large emissions at scale for notification to asset owners remains challenging. Here, we introduce MARS-S2L, a machine learning model that detects methane emissions in publicly available multispectral satellite imagery. Trained on a manually curated dataset of over 80,000 images, the model provides high-resolution detections every two days, enabling facility-level attribution and identifying 78% of plumes with an 8% false positive rate at 697 previously unseen sites. Deployed operationally, MARS-S2L has issued 1,015 notifications to stakeholders in 20 countries, enabling verified, permanent mitigation of six persistent emitters, including a previously unknown site in Libya. These results demonstrate a scalable pathway from satellite detection to quantifiable methane mitigation.

[LG-109] OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning

链接: https://arxiv.org/abs/2511.23310
作者: Zixun Huang,Jiayi Sheng,Zeyu Zheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients. We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction and naturally enhancing stability beyond existing methods. These insights motivate Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO), an algorithm that jointly adapts learning rates and baselines in a theoretically grounded manner. Experiments on Qwen3-4B-Base and Qwen3-8B-Base demonstrate consistent gains over existing policy optimization methods, validating that our theoretical contributions translate into practical improvements in large-scale post-training.

[LG-110] Nonstabilizerness Estimation using Graph Neural Networks

链接: https://arxiv.org/abs/2511.23224
作者: Vincenzo Lipardi,Domenica Dibenedetto,Georgios Stamoulis,Evert van Nieuwenburg,Mark H.M. Winands
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article proposes a Graph Neural Network (GNN) approach to estimate nonstabilizerness in quantum circuits, measured by the stabilizer Rényi entropy (SRE). Nonstabilizerness is a fundamental resource for quantum advantage, and efficient SRE estimations are highly beneficial in practical applications. We address the nonstabilizerness estimation problem through three supervised learning formulations starting from easier classification tasks to the more challenging regression task. Experimental results show that the proposed GNN manages to capture meaningful features from the graph-based circuit representation, resulting in robust generalization performances achieved across diverse scenarios. In classification tasks, the GNN is trained on product states and generalizes on circuits evolved under Clifford operations, entangled states, and circuits with higher number of qubits. In the regression task, the GNN significantly improves the SRE estimation on out-of-distribution circuits with higher number of qubits and gate counts compared to previous work, for both random quantum circuits and structured circuits derived from the transverse-field Ising model. Moreover, the graph representation of quantum circuits naturally integrates hardware-specific information. Simulations on noisy quantum hardware highlight the potential of the proposed GNN to predict the SRE measured on quantum devices.

[LG-111] Asymptotic Theory and Phase Transitions for Variable Importance in Quantile Regression Forests

链接: https://arxiv.org/abs/2511.23212
作者: Tomoshige Nakamura,Hiroshi Shiraishi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Quantile Regression Forests (QRF) are widely used for non-parametric conditional quantile estimation, yet statistical inference for variable importance measures remains challenging due to the non-smoothness of the loss function and the complex bias-variance trade-off. In this paper, we develop a asymptotic theory for variable importance defined as the difference in pinball loss risks. We first establish the asymptotic normality of the QRF estimator by handling the non-differentiable pinball loss via Knight’s identity. Second, we uncover a “phase transition” phenomenon governed by the subsampling rate \beta (where s \asymp n^\beta ). We prove that in the bias-dominated regime ( \beta \ge 1/2 ), which corresponds to large subsample sizes typically favored in practice to maximize predictive accuracy, standard inference breaks down as the estimator converges to a deterministic bias constant rather than a zero-mean normal distribution. Finally, we derive the explicit analytic form of this asymptotic bias and discuss the theoretical feasibility of restoring valid inference via analytic bias correction. Our results highlight a fundamental trade-off between predictive performance and inferential validity, providing a theoretical foundation for understanding the intrinsic limitations of random forest inference in high-dimensional settings.

[LG-112] A PLS-Integrated LASSO Method with Application in Index Tracking

链接: https://arxiv.org/abs/2511.23205
作者: Shiqin Tang,Yining Dong,S. Joe Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In traditional multivariate data analysis, dimension reduction and regression have been treated as distinct endeavors. Established techniques such as principal component regression (PCR) and partial least squares (PLS) regression traditionally compute latent components as intermediary steps – although with different underlying criteria – before proceeding with the regression analysis. In this paper, we introduce an innovative regression methodology named PLS-integrated Lasso (PLS-Lasso) that integrates the concept of dimension reduction directly into the regression process. We present two distinct formulations for PLS-Lasso, denoted as PLS-Lasso-v1 and PLS-Lasso-v2, along with clear and effective algorithms that ensure convergence to global optima. PLS-Lasso-v1 and PLS-Lasso-v2 are compared with Lasso on the task of financial index tracking and show promising results.

[LG-113] Machine learning for violence prediction: a systematic review and critical appraisal

链接: https://arxiv.org/abs/2511.23118
作者: Stefaniya Kozhevnikova,Denis Yukhnenko,Giulio Scola,Seena Fazel
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose To conduct a systematic review of machine learning models for predicting violent behaviour by synthesising and appraising their validity, usefulness, and performance. Methods We systematically searched nine bibliographic databases and Google Scholar up to September 2025 for development and/or validation studies on machine learning methods for predicting all forms of violent behaviour. We synthesised the results by summarising discrimination and calibration performance statistics and evaluated study quality by examining risk of bias and clinical utility. Results We identified 38 studies reporting the development and validation of 40 models. Most studies reported Area Under the Curve (AUC) as the discrimination statistic with a range of 0.68-0.99. Only eight studies reported calibration performance, and three studies reported external validation. 31 studies had a high risk of bias, mainly in the analysis domain, and three studies had low risk of bias. The overall clinical utility of violence prediction models is poor, as indicated by risks of overfitting due to small samples, lack of transparent reporting, and low generalisability. Conclusion Although black box machine learning models currently have limited applicability in clinical settings, they may show promise for identifying high-risk individuals. We recommend five key considerations for violence prediction modelling: (i) ensuring methodological quality (e.g. following guidelines) and interdisciplinary collaborations; (ii) using black box algorithms only for highly complex data; (iii) incorporating dynamic predictions to allow for risk monitoring; (iv) developing more trustworthy algorithms using explainable methods; and (v) applying causal machine learning approaches where appropriate. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2511.23118 [stat.ME] (or arXiv:2511.23118v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2511.23118 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefaniya Kozhevnikova [view email] [v1] Fri, 28 Nov 2025 12:03:45 UTC (470 KB)

[LG-114] Constraining dark matter halo profiles with symbolic regression

链接: https://arxiv.org/abs/2511.23073
作者: Alicia Martín,Tariq Yasin,Deaglan J. Bartlett,Harry Desmond,Pedro G. Ferreira
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures. Accepted for publication in Philosophical Transactions of the Royal Society A

点击查看摘要

Abstract:Dark matter haloes are typically characterised by radial density profiles with fixed forms motivated by simulations (e.g. NFW). However, simulation predictions depend on uncertain dark matter physics and baryonic modelling. Here, we present a method to constrain halo density profiles directly from observations using Exhaustive Symbolic Regression (ESR), a technique that searches the space of analytic expressions for the function that best balances accuracy and simplicity for a given dataset. We test the approach on mock weak lensing excess surface density (ESD) data of synthetic clusters with NFW profiles. Motivated by real data, we assign each ESD data point a constant fractional uncertainty and vary this uncertainty and the number of clusters to probe how data precision and sample size affect model selection. For fractional errors around 5%, ESR recovers the NFW profile even from samples as small as 20 clusters. At higher uncertainties representative of current surveys, simpler functions are favoured over NFW, though it remains competitive. This preference arises because weak lensing errors are smallest in the outskirts, causing the fits to be dominated by the outer profile. ESR therefore provides a robust, simulation-independent framework both for testing mass models and determining which features of a halo’s density profile are genuinely constrained by the data.

[LG-115] Optical diffraction neural networks assisted computational ghost imaging through dynamic scattering media

链接: https://arxiv.org/abs/2511.22913
作者: Yue-Gang Li,Ze Zheng,Jun-jie Wang,Ming He,Jianping Fan,Tailong Xiao,Guihua Zeng
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ghost imaging leverages a single-pixel detector with no spatial resolution to acquire object echo intensity signals, which are correlated with illumination patterns to reconstruct an image. This architecture inherently mitigates scattering interference between the object and the detector but sensitive to scattering between the light source and the object. To address this challenge, we propose an optical diffraction neural networks (ODNNs) assisted ghost imaging method for imaging through dynamic scattering media. In our scheme, a set of fixed ODNNs, trained on simulated datasets, is incorporated into the experimental optical path to actively correct random distortions induced by dynamic scattering media. Experimental validation using rotating single-layer and double-layer ground glass confirms the feasibility and effectiveness of our approach. Furthermore, our scheme can also be combined with physics-prior-based reconstruction algorithms, enabling high-quality imaging under undersampled conditions. This work demonstrates a novel strategy for imaging through dynamic scattering media, which can be extended to other imaging systems.

[LG-116] Resolving Sharp Gradients of Unstable Singularities to Machine Precision via Neural Networks

链接: https://arxiv.org/abs/2511.22819
作者: Yongji Wang,Tristan Léger,Ching-Yao Lai,Tristan Buckmaster
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Recent work introduced a robust computational framework combining embedded mathematical structures, advanced optimization, and neural network architecture, leading to the discovery of multiple unstable self-similar solutions for key fluid dynamics equations, including the Incompressible Porous Media (IPM) and 2D Boussinesq systems. While this framework confirmed the existence of these singularities, an accuracy level approaching double-float machine precision was only achieved for stable and 1st unstable solutions of the 1D Córdoba-Córdoba-Fontelos model. For highly unstable solutions characterized by extreme gradients, the accuracy remained insufficient for validation. The primary obstacle is the presence of sharp solution gradients. Those gradients tend to induce large, localized PDE residuals during training, which not only hinder convergence, but also obscure the subtle signals near the origin required to identify the correct self-similar scaling parameter lambda of the solutions. In this work, we introduce a gradient-normalized PDE residual re-weighting scheme to resolve the high-gradient challenge while amplifying the critical residual signals at the origin for lambda identification. Coupled with the multi-stage neural network architecture, the PDE residuals are reduced to the level of round-off error across a wide spectrum of unstable self-similar singularities previously discovered. Furthermore, our method enables the discovery of new highly unstable singularities, i.e. the 4th unstable solution for IPM equations and a novel family of highly unstable solitons for the Nonlinear Schrödinger equations. This results in achieving high-gradient solutions with high precision, providing an important ingredient for bridging the gap between numerical discovery and computer-assisted proofs for unstable phenomena in nonlinear PDEs.

[LG-117] Generative models for crystalline materials

链接: https://arxiv.org/abs/2511.22652
作者: Houssam Metni,Laura Ruple,Lauren N. Walters,Luca Torresi,Jonas Teufel,Henrik Schopmans,Jona Östreicher,Yumeng Zhang,Marlen Neubert,Yuri Koide,Kevin Steiner,Paul Link,Lukas Bär,Mariana Petrova,Gerbrand Ceder,Pascal Friederich
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding structure-property relationships in materials is fundamental in condensed matter physics and materials science. Over the past few years, machine learning (ML) has emerged as a powerful tool for advancing this understanding and accelerating materials discovery. Early ML approaches primarily focused on constructing and screening large material spaces to identify promising candidates for various applications. More recently, research efforts have increasingly shifted toward generating crystal structures using end-to-end generative models. This review analyzes the current state of generative modeling for crystal structure prediction and \textitde novo generation. It examines crystal representations, outlines the generative models used to design crystal structures, and evaluates their respective strengths and limitations. Furthermore, the review highlights experimental considerations for evaluating generated structures and provides recommendations for suitable existing software tools. Emerging topics, such as modeling disorder and defects, integration in advanced characterization, and incorporating synthetic feasibility constraints, are explored. Ultimately, this work aims to inform both experimental scientists looking to adapt suitable ML models to their specific circumstances and ML specialists seeking to understand the unique challenges related to inverse materials design and discovery.

[LG-118] AdS/Deep-Learning made easy II: neural network-based approaches to holography and inverse problems

链接: https://arxiv.org/abs/2511.22522
作者: Hyun-Sik Jeong,Hanse Kim,Keun-Young Kim,Gaya Yun,Hyeonwoo Yu,Kwan Yun
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 31pages, 17 figures

点击查看摘要

Abstract:We apply physics-informed machine learning (PIML) to solve inverse problems in holography and classical mechanics, focusing on neural ordinary differential equations (Neural ODEs) and physics-informed neural networks (PINNs) for solving non-linear differential equations of motion. First, we introduce holographic inverse problems and demonstrate how PIML can reconstruct bulk spacetime and effective potentials from boundary quantum data. To illustrate this, two case studies are explored: the QCD equation of state in holographic QCD and T -linear resistivity in holographic strange metals. Additionally, we explicitly show how such holographic problems can be analogized to inverse problems in classical mechanics, modeling frictional forces with neural networks. We also explore Kolmogorov-Arnold Networks (KANs) as an alternative to traditional neural networks, offering more efficient solutions in certain cases. This manuscript aim to provide a systematic framework for using neural networks in inverse problems, serving as a comprehensive reference for researchers in machine learning for high-energy physics, with methodologies that also have broader applications in mathematics, engineering, and the natural sciences.

[LG-119] he Machine Learning Approach to Moment Closure Relations for Plasma: A Review

链接: https://arxiv.org/abs/2511.22486
作者: Samuel Burles,Enrico Camporeale
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注: 30 pages, 2 figures

点击查看摘要

Abstract:The requirement for large-scale global simulations of plasma is an ongoing challenge in both space and laboratory plasma physics. Any simulation based on a fluid model inherently requires a closure relation for the high order plasma moments. This review compiles and analyses the recent surge of machine learning approaches developing improved plasma closure models capable of capturing kinetic phenomena within plasma fluid models. The purpose of this review is both to collect and analyse the various methods employed on the plasma closure problem, including both equation discovery methods and neural network surrogate approaches, as well as to provide a general overview of the state of the problem. In particular, we highlight the challenges of developing a data-driven closure as well as the direction future work should take toward addressing these challenges, in the pursuit of a computationally viable large-scale global simulation.

[LG-120] Data-driven informative priors for Bayesian inference with quasi-periodic data

链接: https://arxiv.org/abs/2511.22296
作者: Javier Lopez-Santiago,Luca Martino,Joaquin Miguez,Gonzalo Vazquez-Vilar
类目: Machine Learning (stat.ML); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: Accepted for publication in AJ. 19 pages (one column), 14 figures

点击查看摘要

Abstract:Bayesian computational strategies for inference can be inefficient in approximating the posterior distribution in models that exhibit some form of periodicity. This is because the probability mass of the marginal posterior distribution of the parameter representing the period is usually highly concentrated in a very small region of the parameter space. Therefore, it is necessary to provide as much information as possible to the inference method through the parameter prior distribution. We intend to show that it is possible to construct a prior distribution from the data by fitting a Gaussian process (GP) with a periodic kernel. More specifically, we want to show that it is possible to approximate the marginal posterior distribution of the hyperparameter corresponding to the period in the kernel. Subsequently, this distribution can be used as a prior distribution for the inference method. We use an adaptive importance sampling method to approximate the posterior distribution of the hyperparameters of the GP. Then, we use the marginal posterior distribution of the hyperparameter related to the periodicity in order to construct a prior distribution for the period of the parametric model. This workflow is empirical Bayes, implemented as a modular (cut) transfer of a GP posterior for the period to the parametric model. We applied the proposed methodology to both synthetic and real data. We approximated the posterior distribution of the period of the GP kernel and then passed it forward as a posterior-as-prior with no feedback. Finally, we analyzed its impact on the marginal posterior distribution.

[LG-121] UCB for Large-Scale Pure Exploration: Beyond Sub-Gaussianity

链接: https://arxiv.org/abs/2511.22273
作者: Zaile Li,Weiwei Fan,L. Jeff Hong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting the best alternative from a finite set represents a broad class of pure exploration problems. Traditional approaches to pure exploration have predominantly relied on Gaussian or sub-Gaussian assumptions on the performance distributions of all alternatives, which limit their applicability to non-sub-Gaussian especially heavy-tailed problems. The need to move beyond sub-Gaussianity may become even more critical in large-scale problems, which tend to be especially sensitive to distributional specifications. In this paper, motivated by the widespread use of upper confidence bound (UCB) algorithms in pure exploration and beyond, we investigate their performance in the large-scale, non-sub-Gaussian settings. We consider the simplest category of UCB algorithms, where the UCB value for each alternative is defined as the sample mean plus an exploration bonus that depends only on its own sample size. We abstract this into a meta-UCB algorithm and propose letting it select the alternative with the largest sample size as the best upon stopping. For this meta-UCB algorithm, we first derive a distribution-free lower bound on the probability of correct selection. Building on this bound, we analyze two general non-sub-Gaussian scenarios: (1) all alternatives follow a common location-scale structure and have bounded variance; and (2) when such a structure does not hold, each alternative has a bounded absolute moment of order q 3 . In both settings, we show that the meta-UCB algorithm and therefore a broad class of UCB algorithms can achieve the sample optimality. These results demonstrate the applicability of UCB algorithms for solving large-scale pure exploration problems with non-sub-Gaussian distributions. Numerical experiments support our results and provide additional insights into the comparative behaviors of UCB algorithms within and beyond our meta-UCB framework.

[LG-122] owards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

链接: https://arxiv.org/abs/2511.22270
作者: Zhongjie Shi,Puyu Wang,Chenyang Zhang,Yuan Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

[LG-123] Support Vector Machine Classifier with Rescaled Huberized Pinball Loss

链接: https://arxiv.org/abs/2511.22065
作者: Shibo Diao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Support vector machines are widely used in machine learning classification tasks, but traditional SVM models suffer from sensitivity to outliers and instability in resampling, which limits their performance in practical applications. To address these issues, this paper proposes a novel rescaled Huberized pinball loss function with asymmetric, non-convex, and smooth properties. Based on this loss function, we develop a corresponding SVM model called RHPSVM (Rescaled Huberized Pinball Loss Support Vector Machine). Theoretical analyses demonstrate that RHPSVM conforms to Bayesian rules, has a strict generalization error bound, a bounded influence function, and controllable optimality conditions, ensuring excellent classification accuracy, outlier insensitivity, and resampling stability. Additionally, RHPSVM can be extended to various advanced SVM variants by adjusting parameters, enhancing its flexibility. We transform the non-convex optimization problem of RHPSVM into a series of convex subproblems using the concave-convex procedure (CCCP) and solve it with the ClipDCD algorithm, which is proven to be convergent. Experimental results on simulated data, UCI datasets, and small-sample crop leaf image classification tasks show that RHPSVM outperforms existing SVM models in both noisy and noise-free scenarios, especially in handling high-dimensional small-sample data.

[LG-124] On the Effect of Regularization on Nonparametric Mean-Variance Regression

链接: https://arxiv.org/abs/2511.22004
作者: Eliot Wong-Toi,Alex Boyd,Vincent Fortuin,Stephan Mandt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty quantification is vital for decision-making and risk assessment in machine learning. Mean-variance regression models, which predict both a mean and residual noise for each data point, provide a simple approach to uncertainty quantification. However, overparameterized mean-variance models struggle with signal-to-noise ambiguity, deciding whether prediction targets should be attributed to signal (mean) or noise (variance). At one extreme, models fit all training targets perfectly with zero residual noise, while at the other, they provide constant, uninformative predictions and explain the targets as noise. We observe a sharp phase transition between these extremes, driven by model regularization. Empirical studies with varying regularization levels illustrate this transition, revealing substantial variability across repeated runs. To explain this behavior, we develop a statistical field theory framework, which captures the observed phase transition in alignment with experimental results. This analysis reduces the regularization hyperparameter search space from two dimensions to one, significantly lowering computational costs. Experiments on UCI datasets and the large-scale ClimSim dataset demonstrate robust calibration performance, effectively quantifying predictive uncertainty.

[LG-125] A Sensitivity Approach to Causal Inference Under Limited Overlap

链接: https://arxiv.org/abs/2511.22003
作者: Yuanzhe Ma,Hongseok Namkoong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Limited overlap between treated and control groups is a key challenge in observational analysis. Standard approaches like trimming importance weights can reduce variance but introduce a fundamental bias. We propose a sensitivity framework for contextualizing findings under limited overlap, where we assess how irregular the outcome function has to be in order for the main finding to be invalidated. Our approach is based on worst-case confidence bounds on the bias introduced by standard trimming practices, under explicit assumptions necessary to extrapolate counterfactual estimates from regions of overlap to those without. Empirically, we demonstrate how our sensitivity framework protects against spurious findings by quantifying uncertainty in regions with limited overlap.

[LG-126] Algorithms and Scientific Software for Quasi-Monte Carlo Fast Gaussian Process Regression and Scientific Machine Learning

链接: https://arxiv.org/abs/2511.21915
作者: Aleksei G. Sorokin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: PhD thesis

点击查看摘要

Abstract:Most scientific domains elicit the development of efficient algorithms and accessible scientific software. This thesis unifies our developments in three broad domains: Quasi-Monte Carlo (QMC) methods for efficient high-dimensional integration, Gaussian process (GP) regression for high-dimensional interpolation with built-in uncertainty quantification, and scientific machine learning (sciML) for modeling partial differential equations (PDEs) with mesh-free solvers. For QMC, we built new algorithms for vectorized error estimation and developed QMCPy (this https URL an open-source Python interface to randomized low-discrepancy sequence generators, automatic variable transforms, adaptive error estimation procedures, and diverse use cases. For GPs, we derived new digitally-shift-invariant kernels of higher-order smoothness, developed novel fast multitask GP algorithms, and produced the scalable Python software FastGPs (this https URL). For sciML, we developed a new algorithm capable of machine precision recovery of PDEs with random coefficients. We have also studied a number of applications including GPs for probability of failure estimation, multilevel GPs for the Darcy flow equation, neural surrogates for modeling radiative transfer, and fast GPs for Bayesian multilevel QMC.

[LG-127] Sparse Multiple Kernel Learning: Alternating Best Response and Semidefinite Relaxations

链接: https://arxiv.org/abs/2511.21890
作者: Dimitris Bertsimas,Caio de Prospero Iglesias,Nicholas A. G. Johnson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study Sparse Multiple Kernel Learning (SMKL), which is the problem of selecting a sparse convex combination of prespecified kernels for support vector binary classification. Unlike prevailing l1 regularized approaches that approximate a sparsifying penalty, we formulate the problem by imposing an explicit cardinality constraint on the kernel weights and add an l2 penalty for robustness. We solve the resulting non-convex minimax problem via an alternating best response algorithm with two subproblems: the alpha subproblem is a standard kernel SVM dual solved via LIBSVM, while the beta subproblem admits an efficient solution via the Greedy Selector and Simplex Projector algorithm. We reformulate SMKL as a mixed integer semidefinite optimization problem and derive a hierarchy of semidefinite convex relaxations which can be used to certify near-optimality of the solutions returned by our best response algorithm and also to warm start it. On ten UCI benchmarks, our method with random initialization outperforms state-of-the-art MKL approaches in out-of-sample prediction accuracy on average by 3.34 percentage points (relative to the best performing benchmark) while selecting a small number of candidate kernels in comparable runtime. With warm starting, our method outperforms the best performing benchmark’s out-of-sample prediction accuracy on average by 4.05 percentage points. Our convex relaxations provide a certificate that in several cases, the solution returned by our best response algorithm is the globally optimal solution.

[LG-128] Invited to Develop: Institutional Belonging and the Counterfactual Architecture of Development

链接: https://arxiv.org/abs/2511.21865
作者: Diego Vallarino
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines how institutional belonging shapes long-term development by comparing Spain and Uruguay, two small democracies with similar historical endowments whose trajectories diverged sharply after the 1960s. While Spain integrated into dense European institutional architectures, Uruguay remained embedded within the Latin American governance regime, characterized by weaker coordination and lower institutional coherence. To assess how alternative institutional embeddings could have altered these paths, the study develops a generative counterfactual framework grounded in economic complexity, institutional path dependence, and a Wasserstein GAN trained on data from 1960-2020. The resulting Expected Developmental Shift (EDS) quantifies structural gains or losses from hypothetical re-embedding in different institutional ecosystems. Counterfactual simulations indicate that Spain would have experienced significant developmental decline under a Latin American configuration, while Uruguay would have achieved higher complexity and resilience within a European regime. These findings suggest that development is not solely determined by domestic reforms but emerges from a country’s structural position within transnational institutional networks.

[LG-129] Automated Statistical and Machine Learning Platform for Biological Research

链接: https://arxiv.org/abs/2511.21770
作者: Luke Rimmo Lego,Samantha Gauthier,Denver Jn. Baptiste
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 pages, 2 figures, 25 equations

点击查看摘要

Abstract:Research increasingly relies on computational methods to analyze experimental data and predict molecular properties. Current approaches often require researchers to use a variety of tools for statistical analysis and machine learning, creating workflow inefficiencies. We present an integrated platform that combines classical statistical methods with Random Forest classification for comprehensive data analysis that can be used in the biological sciences. The platform implements automated hyperparameter optimization, feature importance analysis, and a suite of statistical tests including t tests, ANOVA, and Pearson correlation analysis. Our methodology addresses the gap between traditional statistical software, modern machine learning frameworks and biology, by providing a unified interface accessible to researchers without extensive programming experience. The system achieves this through automatic data preprocessing, categorical encoding, and adaptive model configuration based on dataset characteristics. Initial testing protocols are designed to evaluate classification accuracy across diverse chemical datasets with varying feature distributions. This work demonstrates that integrating statistical rigor with machine learning interpretability can accelerate biological discovery workflows while maintaining methodological soundness. The platform’s modular architecture enables future extensions to additional machine learning algorithms and statistical procedures relevant to bioinformatics.

[LG-130] DNNs Dataset Statistics and Correlation Functions

链接: https://arxiv.org/abs/2511.21715
作者: Robert W. Batterman,James F. Woodward
类目: History and Philosophy of Physics (physics.hist-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 37 pages, 12 figures

点击查看摘要

Abstract:This paper argues that dataset structure is important in image recognition tasks (among other tasks). Specifically, we focus on the nature and genesis of correlational structure in the actual datasets upon which DNNs are trained. We argue that DNNs are implementing a widespread methodology in condensed matter physics and materials science that focuses on mesoscale correlation structures that live between fundamental atomic/molecular scales and continuum scales. Specifically, we argue that DNNs that are successful in image classification must be discovering high order correlation functions. It is well-known that DNNs successfully generalize in apparent contravention of standard statistical learning theory. We consider the implications of our discussion for this puzzle.

信息检索

[IR-0] Do LLM -judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

链接: https://arxiv.org/abs/2511.23312
作者: Gustavo Penha,Aleksandr V. Petrov,Claudia Hauff,Enrico Palumbo,Ali Vardasbi,Edoardo D’Amico,Francesco Fabbri,Alice Wang,Praveen Chandar,Henrik Lindstrom,Hugues Bouchard,Mounia Lalmas
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Evaluating recommender systems remains a long-standing challenge, as offline methods based on historical user interactions and train-test splits often yield unstable and inconsistent results due to exposure bias, popularity bias, sampled evaluations, and missing-not-at-random patterns. In contrast, textual document retrieval benefits from robust, standardized evaluation via Cranfield-style test collections, which combine pooled relevance judgments with controlled setups. While recent work shows that adapting this methodology to recommender systems is feasible, constructing such collections remains costly due to the need for manual relevance judgments, thus limiting scalability. This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address these scalability challenges. Using the ML-32M-ext Cranfield-style movie recommendation collection, we first examine the limitations of existing evaluation methodologies. Then we explore the alignment and the recommender systems ranking agreement between the LLM-judge and human provided relevance labels. We find that incorporating richer item metadata and longer user histories improves alignment, and that LLM-judge yields high agreement with human-based rankings (Kendall’s tau = 0.87). Finally, an industrial case study in the podcast recommendation domain demonstrates the practical value of LLM-judge for model selection. Overall, our results show that LLM-judge is a viable and scalable approach for evaluating recommender systems.

[IR-1] FedAU2: Attribute Unlearning for User-Level Federated Recommender Systems with Adaptive and Robust Adversarial Training

链接: https://arxiv.org/abs/2511.22872
作者: Yuyuan Li,Junjie Fang,Fengyuan Yu,Xichun Sheng,Tianyu Du,Xuyang Teng,Shaowei Jiang,Linbo Jiang,Jianan Lin,Chaochao Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated Recommender Systems (FedRecs) leverage federated learning to protect user privacy by retaining data locally. However, user embeddings in FedRecs often encode sensitive attribute information, rendering them vulnerable to attribute inference attacks. Attribute unlearning has emerged as a promising approach to mitigate this issue. In this paper, we focus on user-level FedRecs, which is a more practical yet challenging setting compared to group-level FedRecs. Adversarial training emerges as the most feasible approach within this context. We identify two key challenges in implementing adversarial training-based attribute unlearning for user-level FedRecs: i) mitigating training instability caused by user data heterogeneity, and ii) preventing attribute information leakage through gradients. To address these challenges, we propose FedAU2, an attribute unlearning method for user-level FedRecs. For CH1, we propose an adaptive adversarial training strategy, where the training dynamics are adjusted in response to local optimization behavior. For CH2, we propose a dual-stochastic variational autoencoder to perturb the adversarial model, effectively preventing gradient-based information leakage. Extensive experiments on three real-world datasets demonstrate that our proposed FedAU2 achieves superior performance in unlearning effectiveness and recommendation performance compared to existing baselines.

[IR-2] wo-Stage Distributionally Robust Optimization Framework for Secure Communications in Aerial-RIS Systems

链接: https://arxiv.org/abs/2511.22855
作者: Zhongming Feng,Qiling Gao,Zeping Sui,Yun Lin,Michail Matthaiou
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: 5 pages

点击查看摘要

Abstract:This letter proposes a two-stage distributionally robust optimization (DRO) framework for secure deployment and beamforming in an aerial reconfigurable intelligent surface (A-RIS) assisted millimeter-wave system. To account for multi-timescale uncertainties arising from user mobility, imperfect channel state information (CSI), and hardware impairments, our approach decouples the long-term unmanned aerial vehicle (UAV) placement from the per-slot beamforming design. By employing the conditional value-at-risk (CVaR) as a distribution-free risk metric, a low-complexity algorithm is developed, which combines a surrogate model for efficient deployment with an alternating optimization (AO) scheme for robust real-time beamforming. Simulation results validate that the proposed DRO-CVaR framework significantly enhances the tail-end secrecy spectral efficiency and maintains a lower outage probability compared to benchmark schemes, especially under severe uncertainty conditions.

[IR-3] Selecting User Histories to Generate LLM Users for Cold-Start Item Recommendation

链接: https://arxiv.org/abs/2511.21989
作者: Nachiket Subbaraman(1),Jaskinder Sarai(1),Aniruddh Nath(2),Lichan Hong(3),Lukasz Heldt(2),Li Wei(2),Zhe Zhao(1) ((1) UC Davis, (2) Google Inc., (3) Google DeepMind)
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 15 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning, generalization, and simulating human-like behavior across a wide range of tasks. These strengths present new opportunities to enhance traditional recommendation systems (RS), especially in the cold-start item scenario where newly introduced items lack interactions. Existing works have used LLMs to address cold-start issues in traditional RS through data augmentation, but they have limitations. One recent work directly addresses this issue by prompting LLMs to generate augmented interaction data between randomly sampled users and cold-start items. Then, they train the traditional RS with augmented data, incorporating collaborative signals for cold-start items. Although they use LLMs to provide cold-start items with feedback, they use partial user histories, which does not allow the LLM to fully emulate the user. Furthermore, randomly selecting users is not optimal for augmentation. To address these challenges, we leverage the LLM as a user and develop a reinforcement learning (RL) framework that trains a policy to select users for augmentation, optimizing for cold-start item performance after augmented training. The policy model learns to select users for cold-start item data augmentation based on their behavioral features and histories. To optimize user selection for cold-start item performance, we employ a policy gradient method that updates the policy in the direction of actions that lead to high rewards. Experiments on Amazon Product Review datasets show substantial gains in cold-start item recall, demonstrating the effectiveness of our method as a scalable, serving-efficient augmentation strategy for modern RS.

附件下载

点击下载今日全部论文列表