本篇博文主要内容为 2025-11-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-11-14)
今日共更新545篇论文,其中:
- 自然语言处理共78篇(Computation and Language (cs.CL))
- 人工智能共176篇(Artificial Intelligence (cs.AI))
- 计算机视觉共121篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共158篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
【速读】: 该论文旨在解决权重仅后训练量化(Weight-only Post-Training Quantization, PTQ)中因权重和激活值存在异常值(outliers)而导致的量化误差过大及精度显著下降的问题,尤其是在推理类大语言模型(Reasoning Large Language Models, LLMs)中,误差会在长链式思维过程中累积。解决方案的关键在于提出Pairwise Rotation Quantization (ParoQuant),其核心创新是结合硬件高效的独立Givens旋转(independent Givens rotations)与通道级缩放(channel-wise scaling),以均衡各通道的幅值分布并压缩每个量化组内的动态范围;同时协同设计GPU推理内核,充分利用并行计算能力,在运行时保持旋转与缩放操作轻量高效,从而在低于10%推理开销下实现比AWQ方法平均提升2.4%的推理任务准确率。
链接: https://arxiv.org/abs/2511.10645
作者: Yesheng Liang,Haisheng Chen,Song Han,Zhijian Liu
机构: NVIDIA(英伟达); MIT(麻省理工学院); UC San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.
zh
[NLP-1] Black-Box On-Policy Distillation of Large Language Models
【速读】: 该论文旨在解决黑盒知识蒸馏(black-box distillation)中学生模型(student LLM)难以有效学习教师模型(teacher LLM)输出策略的问题,尤其在无法访问教师模型内部logits或参数的情况下。解决方案的关键在于提出生成对抗蒸馏(Generative Adversarial Distillation, GAD),其将学生模型视为生成器,通过训练一个判别器来区分学生与教师的响应,形成一种极小极大博弈(minimax game)。该判别器作为在线策略奖励模型(on-policy reward model)与学生模型协同进化,提供稳定且自适应的反馈信号,从而显著提升学生模型的性能。实验表明,GAD在自动评估指标上优于传统序列级蒸馏方法,并使Qwen2.5-14B-Instruct学生模型达到与GPT-5-Chat教师模型相当的水平。
链接: https://arxiv.org/abs/2511.10643
作者: Tianzhu Ye,Li Dong,Zewen Chi,Xun Wu,Shaohan Huang,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model’s text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM’s, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
zh
[NLP-2] Instella: Fully Open Language Models with Stellar Performance
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中普遍存在闭源或部分开源导致的透明性不足与可复现性差的问题。其解决方案的关键在于构建一个完全开源的三亿参数语言模型家族——Instella,该模型使用公开可用的数据和代码库进行训练,并基于AMD Instinct MI300X GPU完成大规模预训练、通用指令微调及人类偏好对齐。尽管预训练token数量少于多数同类模型,Instella在全开源模型中达到最优性能,并在同等规模下具备与领先开源权重模型相当的竞争力;同时,作者进一步发布了两个专用变体:支持128K上下文长度的Instella-Long和专注于数学推理任务的Instella-Math,后者通过监督微调与强化学习增强推理能力。这一系列工作为社区提供了透明、高效且多功能的语言建模替代方案,推动了开放和可复现的自然语言处理研究。
链接: https://arxiv.org/abs/2511.10628
作者: Jiang Liu,Jialian Wu,Xiaodong Yu,Yusheng Su,Prakamya Mishra,Gowtham Ramesh,Sudhanshu Ranjan,Chaitanya Manem,Ximeng Sun,Ze Wang,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
机构: AMD
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
zh
[NLP-3] SSR: Socratic Self-Refine for Large Language Model Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂推理任务中因依赖粗粒度自验证与自修正机制而导致的性能瓶颈问题。其核心挑战在于缺乏对推理链条中各步骤的细粒度评估能力,从而限制了模型输出的准确性与可解释性。解决方案的关键在于提出一种名为“苏格拉底式自我精炼”(Socratic Self-Refine, SSR)的新框架:该框架将模型输出分解为可验证的子问题-子答案对,通过受控重求解和一致性校验实现逐步置信度估计,并精准定位不可靠推理步骤进行迭代修正,从而生成更准确且可解释的推理链。
链接: https://arxiv.org/abs/2511.10621
作者: Haizhou Shi,Ye Liu,Bo Pang,Zeyu Leo Liu,Hao Wang,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz
机构: Salesforce AI Research (Salesforce人工智能研究中心); Rutgers University (罗格斯大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint; work in progress
Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at this https URL.
zh
[NLP-4] Know Your Limits: Entropy Estimation Modeling for Compression and Generalization
【速读】: 该论文旨在解决语言模型在训练过程中难以准确估计语言熵(language entropy)的问题,进而限制了其压缩效率和泛化能力的提升。当前最优的语言压缩算法依赖于因果(causal)大语言模型(如Transformer),但直接利用这些模型估算语言熵在计算上仍不可行。论文提出的关键解决方案是引入编码器增强的因果解码器(encoder-augmented causal decoder)架构,该结构在有限硬件条件下表现出更高的训练效率,并能实现优于纯因果Transformer的压缩性能。此外,作者通过每标记(per-token)熵估计方法,证明了训练模型逼近但不超过估计熵边界时,其泛化能力显著优于未考虑熵约束的传统损失最小化训练策略。
链接: https://arxiv.org/abs/2511.10618
作者: Benjamin L. Badger,Matthew Neligeorge
机构: IBM(国际商业机器公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.
zh
[NLP-5] owards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM -Evals
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在盲人及低视力(Blind and Low-Vision, BLV)用户场景下因高内存、计算和部署需求而难以实际应用的问题,尤其关注生成描述的准确性与上下文感知能力对BLV用户的可访问性影响。其解决方案的关键在于:首先,通过对比分析SmolVLM2系列模型(500M和2.2B参数)在两个不同场景数据集(AVCaps和Charades)上的表现,系统评估模型规模对BLV友好描述质量的影响;其次,提出两个专为BLV可访问性设计的新颖评估框架——多情境BLV框架(用于评估空间方位、社交互动、动作事件和氛围等语境信息)和导航辅助框架(聚焦于移动关键信息);最后,在移动端部署模型并测试FP32与INT8精度变体,以量化资源受限设备上的实际性能约束,从而实现兼顾模型效率与BLV用户体验的优化路径。
链接: https://arxiv.org/abs/2511.10615
作者: Shruti Singh Baghel,Yash Pratap Singh Rathore,Sushovan Jena,Anurag Pradhan,Amit Shukla,Arnav Bhavsar,Pawan Goyal
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校); Vellore Institute of Technology (维洛尔理工学院); Indian Institute of Technology Kharagpur (印度理工学院克勒格布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 pages
Abstract:Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
zh
[NLP-6] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
【速读】: 该论文旨在解决异步远程伤口护理(asynchronous remote wound care)中临床医生工作负荷过重的问题,提出通过生成式 AI (Generative AI) 系统辅助医护人员更高效地处理患者咨询。其解决方案的关键在于两个互补方法:一是基于检索的提示挖掘策略(mined prompting),通过嵌入相似训练样本作为少样本示例提升响应的相关性;二是基于元数据消融研究的动态引导生成机制,识别出四个能稳定提升响应质量的元数据属性,并利用分类器预测测试案例的属性值,结合置信度动态调整生成输出,从而增强临床准确性。
链接: https://arxiv.org/abs/2511.10591
作者: Bavana Durgapraveen,Sornaraj Sivasankaran,Abhinand Balachandran,Sriram Rajkumar
机构: EXL Service
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 figures, 11 pages
Abstract:The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
zh
[NLP-7] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction
【速读】: 该论文旨在解决从医患对话中准确提取医疗医嘱(medical orders)的问题,以减轻临床文档负担并保障患者安全。其解决方案的关键在于系统评估不同提示(prompting)策略在结构化医嘱抽取任务中的性能表现,发现尽管复杂的推理框架(如ReAct和多步骤代理工作流)具备强大能力,但在人工标注的对话转录文本上,简单的单次提示(one-shot prompting)方法反而取得了最佳效果,表明在高质量标注数据条件下,直接提示方式更鲁棒且高效,避免了复杂推理链可能引入的噪声与“过度思考”(overthinking)。
链接: https://arxiv.org/abs/2511.10583
作者: Abhinand Balachandran,Bavana Durgapraveen,Gowsikkan Sikkan Sudhagar,Vidhya Varshany J S,Sriram Rajkumar
机构: EXL Service
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 figures 7 pages
Abstract:The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to “overthinking” and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.
zh
[NLP-8] DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction
【速读】: 该论文旨在解决细粒度情感分析中Aspect Sentiment Triple Extraction (ASTE)任务的挑战,即准确捕捉方面(aspect)、观点(opinion)与情感极性(sentiment polarity)之间的复杂关系。其解决方案的关键在于提出DESS框架,该框架通过融合DeBERTa模型的增强注意力机制与LSTM通道构成双通道结构,分别处理语义信息和语法模式,并优化两者协同工作机制,从而更有效地理解长距离依赖和复杂句法结构中的情感关联。实验表明,该方法在标准数据集上显著提升了F1分数,验证了先进语言模型在ASTE任务中的潜力。
链接: https://arxiv.org/abs/2511.10577
作者: Vishal Thenuwara,Nisansa de Silva
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures. Published in Proceedings of the 17th International Conference on Computational Collective Intelligence (ICCCI 2025), Lecture Notes in Artificial Intelligence, Springer
Abstract:Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa’s enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa’s sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: this https URL.
zh
[NLP-9] owards Emotionally Intelligent and Responsible Reinforcement Learning
【速读】: 该论文旨在解决当前个性化决策系统在医疗健康与行为支持领域中因依赖静态规则或仅追求参与度最大化而忽视用户情绪背景与伦理约束的问题,此类方法可能推荐不敏感或不安全的干预措施,尤其在严重精神疾病、物质使用障碍或抑郁等高风险场景下尤为突出。解决方案的关键在于提出一种负责任强化学习(Responsible Reinforcement Learning, RRL)框架,其核心是将情感理解与伦理考量整合进序贯决策过程,通过将个性化建模为带有约束的马尔可夫决策过程(Constrained Markov Decision Process, CMDP),在优化用户参与度和依从性的同时确保情绪一致性与伦理安全性;该框架引入多目标奖励函数以平衡短期行为参与与长期福祉,并构建基于情绪感知的状态表示来捕捉情绪准备度、情绪状态及风险波动,从而实现情感智能与责任导向的个性化推荐。
链接: https://arxiv.org/abs/2511.10573
作者: Garapati Keerthana,Manik Gupta
机构: BITS Pilani Hyderabad (比特·皮兰尼海得拉巴分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:
Abstract:Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users’ emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
zh
[NLP-10] Impact of Layer Norm on Memorization and Generalization in Transformers NEURIPS2025
【速读】: 该论文旨在解决LayerNorm(层归一化)在Pre-LayerNorm与Post-LayerNorm两种Transformer架构中对学习(learning)和记忆(memorization)影响机制不明确的问题。其关键解决方案在于系统性地分析LayerNorm在不同位置(早期/中间/晚期层)的作用差异,发现LayerNorm在Pre-LayerNorm模型中是稳定学习的关键因素,而在Post-LayerNorm模型中则显著影响记忆行为;进一步表明,移除LayerNorm参数会加剧Pre-LayerNorm模型的记忆过拟合并导致训练不稳定,而对Post-LayerNorm模型则能通过恢复真实标签来抑制记忆。这一发现揭示了LayerNorm在两类架构中的差异化作用机制,为理解Transformer的优化稳定性与泛化能力提供了新的理论依据。
链接: https://arxiv.org/abs/2511.10566
作者: Rishi Singhal,Jung-Eun Kim
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
zh
[NLP-11] URaG : Unified Retrieval and Generation in Multimodal LLM s for Efficient Long Document Understanding AAAI2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长文档理解任务中面临的两大核心问题:一是大量无关内容引发的信息干扰,二是基于Transformer架构带来的二次方计算复杂度。现有方法主要分为两类——令牌压缩会损失细粒度信息,引入外部检索器则增加系统复杂性且难以端到端优化。论文的关键创新在于观察到MLLMs具有类人“粗粒度到细粒度”的推理模式:早期Transformer层对全文广泛注意力,而深层则聚焦于相关证据页。受此启发,作者提出URaG框架,其核心是将检索与生成统一于单一MLLM内部,通过引入轻量级跨模态检索模块,将早期层转化为高效证据选择器,识别并保留最相关页面,从而让深层层集中计算资源于关键信息,显著提升准确率与效率,实验表明该方法在保持SOTA性能的同时降低44–56%的计算开销。
链接: https://arxiv.org/abs/2511.10552
作者: Yongxin Shi,Jiapeng Wang,Zeyu Shan,Dezhi Peng,Zening Lin,Lianwen Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026 (Oral)
Abstract:Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at this https URL.
zh
[NLP-12] Computing the Formal and Institutional Boundaries of Contemporary Genre and Literary Fiction
【速读】: 该论文试图解决的问题是:在当代文学分类中,Genre(类型小说)与Literary Fiction(文学小说)之间的区分究竟是基于形式特征(formal characteristics)还是制度性特征(institutional characteristics)。为解决这一问题,研究采用计算方法对来自Andrew Piper的CONLIT数据集中的文学小说和类型小说(包括浪漫、推理和科幻小说)进行分析。其关键解决方案在于:首先利用Welch’s ANOVA检验不同性别作者在各类型小说内部及跨类型间叙事特征的分布差异;其次通过逻辑回归模型量化每个叙事特征对文学分类的影响,并考察作者性别是否调节这些影响;最后借助风格和语义向量表示分析形式与内容在文学分类中的作用。结果表明,存在统计显著的形式标记可区分各类文学类别,且女性作者的作品在文学地位认定上表现出更窄且模糊的边界,揭示了性别因素在文学权威建构中的关键作用。
链接: https://arxiv.org/abs/2511.10546
作者: Natasha Johnson
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be presented at Computational Humanities Research (CHR) Conference, 2025
Abstract:Though the concept of genre has been a subject of discussion for millennia, the relatively recent emergence of genre fiction has added a new layer to this ongoing conversation. While more traditional perspectives on genre have emphasized form, contemporary scholarship has invoked both formal and institutional characteristics in its taxonomy of genre, genre fiction, and literary fiction. This project uses computational methods to explore the soundness of genre as a formal designation as opposed to an institutional one. Pulling from Andrew Piper’s CONLIT dataset of Contemporary Literature, we assemble a corpus of literary and genre fiction, with the latter category containing romance, mystery, and science fiction novels. We use Welch’s ANOVA to compare the distribution of narrative features according to author gender within each genre and within genre versus literary fiction. Then, we use logistic regression to model the effect that each feature has on literary classification and to measure how author gender moderates these effects. Finally, we analyze stylistic and semantic vector representations of our genre categories to understand the importance of form and content in literary classification. This project finds statistically significant formal markers of each literary category and illustrates how female authorship narrows and blurs the target for achieving literary status.
zh
[NLP-13] Convomem Benchmark: Why Your First 150 Conversations Dont Need RAG
【速读】: 该论文旨在解决当前对话记忆(Conversational Memory)评估框架中存在的统计功效不足、数据生成一致性差以及评估灵活性低等根本性问题。其解决方案的关键在于构建一个包含75,336个问答对的综合性基准,覆盖用户事实、助手回忆、回避回答、偏好变化、时间推移和隐含关联等多个维度,并通过实证分析揭示长上下文处理与检索增强生成(Retrieval-Augmented Generation, RAG)之间的关系。研究发现,尽管RAG系统在架构上与记忆系统共享诸多特征(如时间推理、隐式提取、知识更新和图表示),但对话记忆从零开始逐步积累的特点使其能采用简单全上下文策略,在前30轮对话中达到70–82%准确率;而复杂RAG方法(如Mem0)在历史交互少于150轮时仅能实现30–45%准确率,表明小语料优势应成为专门研究方向,而非直接套用通用RAG方案。
链接: https://arxiv.org/abs/2511.10523
作者: Egor Pakhomov,Erik Nijkamp,Caiming Xiong
机构: Salesforce AI Research (Salesforce人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns–temporal reasoning, implicit extraction, knowledge updates, and graph representations–memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory–where exhaustive search and complete reranking are feasible–deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.
zh
[NLP-14] Say It Differently: Linguistic Styles as Jailbreak Vectors
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中忽视语言风格作为攻击面的问题,即如何通过改变提示语的语调或情感色彩(如恐惧、好奇、共情等)来绕过模型的安全防护机制,从而诱发不安全响应。解决方案的关键在于构建了一个风格增强型越狱基准(style-augmented jailbreak benchmark),通过对标准数据集中的提示进行11种不同语言风格的重构,在保持语义意图不变的前提下系统性地测试模型鲁棒性;进一步提出一种基于次级大语言模型的风格中和预处理步骤,能够有效剥离输入中的操纵性语言特征,显著降低越狱成功率,从而揭示并缓解了一种普遍存在于当前安全流水线中的、具有抗缩放特性的系统性漏洞。
链接: https://arxiv.org/abs/2511.10519
作者: Srikant Panda,Avinash Rai
机构: Independent Researcher; Oracle AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.10519 [cs.CL] (or arXiv:2511.10519v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.10519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-15] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025
【速读】: 该论文旨在解决奥林匹克级别物理问题求解的难题,这类问题对人类和人工智能(AI)均构成挑战,因其要求精确计算、抽象推理与物理原理的深刻理解相结合。针对这一问题,作者提出了一种改进的逻辑链增强推理框架LOCA-R(LOgical Chain Augmentation for Reasoning),其关键在于通过结构化逻辑链的构建与优化,提升模型在复杂物理情境下的多步推理能力。该方法在2025年中国物理奥林匹克(CPhO)理论考试中取得了313分(满分320)的接近完美成绩,显著优于最高分人类选手及所有基线方法,验证了其在高阶物理推理任务中的有效性。
链接: https://arxiv.org/abs/2511.10515
作者: Dong-Shan Jian,Xiang Li,Chen-Xu Yan,Hui-Wen Zheng,Zhi-Zhang Bian,You-Le Fang,Sheng-Qi Zhang,Bing-Rui Gong,Ren-Xi He,Jing-Tian Zhang,Ce Meng,Yan-Qing Ma
机构: Peking University (北京大学); Center for High Energy Physics, Peking University (北京大学高能物理中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Physics Education (physics.ed-ph)
备注: 19 pages, 3 figures
Abstract:Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.
zh
[NLP-16] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂、多轮及系统级指令遵循(Instruction Following, IF)能力上的显著不足问题,其核心挑战在于缺乏高质量的人工标注基准和可靠、可解释的奖励信号。解决方案的关键在于提出一个名为AdvancedIF的综合性评估基准和一种名为RIFL(Rubric-based Instruction-Following Learning)的后训练流程:前者通过超过1600个专家设计的提示和细粒度评分标准(rubrics)实现对LLM指令遵循能力的精准评估;后者则利用评分标准生成、微调后的评分验证器以及奖励塑造机制,构建有效的强化学习框架以提升模型的指令遵循性能。实验证明,RIFL在AdvancedIF上实现了6.7%的绝对性能提升,并在多个公开基准上表现优异,且消融实验验证了各组件的有效性。
链接: https://arxiv.org/abs/2511.10507
作者: Yun He,Wenzhe Li,Hejia Zhang,Songlin Li,Karishma Mandyam,Sopan Khosla,Yuanhao Xiong,Nanshu Wang,Selina Peng,Beibin Li,Shengjie Bi,Shishir G. Patil,Qi Qi,Shengyu Feng,Julian Katz-Samuels,Richard Yuanzhe Pang,Sujan Gonugondla,Hunter Lang,Yue Yu,Yundi Qian,Maryam Fazel-Zarandi,Licheng Yu,Amine Benhalloum,Hany Awadalla,Manaal Faruqui
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
zh
[NLP-17] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks
【速读】: 该论文旨在解决现有提示优化(prompt optimization)方法在处理知识密集型任务时的局限性,即基于激发(elicitation)的策略受限于模型固定的参数边界,难以提供专业领域所需的事实性知识、术语精确性和推理模式。其解决方案的关键在于提出一种基于知识供给的提示优化框架(Knowledge-Provision-based Prompt Optimization, KPPO),将提示优化重构为系统性的知识整合过程,包含三个核心创新:1)知识缺口填补机制以识别并针对性修复知识空白;2)批处理候选评估方法,兼顾性能提升与分布稳定性;3)自适应知识剪枝策略,在保持性能的同时降低高达29%的token消耗,从而实现更高效且精准的知识驱动提示优化。
链接: https://arxiv.org/abs/2511.10465
作者: Yunzhe Xu,Zhuosheng Zhang,Zhe Liu
机构: Shanghai Jiao Tong University (上海交通大学); Key Laboratory of System Control and Information Processing, Ministry of Education of China (教育部系统控制与信息处理重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 19 figures
Abstract:While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: this https URL.
zh
[NLP-18] LocalBench: Benchmarking LLM s on County-Level Local Knowledge and Reasoning
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在处理超本地知识(hyper-local knowledge)方面能力不足的问题,即现有模型虽在宏观地理任务中表现良好,但在涉及社区级动态、文化叙事与地方治理等细粒度本地信息时表现欠佳。解决方案的关键在于提出并构建了首个系统性评估基准LocalBench,该基准基于“本地性概念框架”(Localness Conceptual Framework),涵盖美国49个州的526个县,包含14,782个经验证的问答对,融合人口普查数据、本地Reddit讨论和区域新闻等多源异构信息,覆盖物理、认知与关系维度的 locality。通过LocalBench对13个前沿LLMs进行封闭书本与网络增强两种设置下的评估,揭示了模型在叙事类问题上最高仅达56.8%准确率,数值推理低于15.5%,且模型规模扩大或引入网络检索并不必然提升性能,凸显出开发具备地域敏感性与公平性的地方感知AI系统的重要性。
链接: https://arxiv.org/abs/2511.10459
作者: Zihan Gao,Yifei Xu,Jacob Thebault-Spieker
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
zh
[NLP-19] Exploring State Tracking Capabilities of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中对多实体状态跟踪(state tracking)能力不足的问题,即模型需持续维护和更新多个实体的状态信息。解决方案的关键在于构建一个基于三个明确定义的状态跟踪任务的基准测试(benchmark),以隔离状态跟踪这一核心组件并系统评估不同LLM在多种场景下的表现;实验表明,最新一代模型(如GPT-4和Llama3)结合思维链(Chain of Thought)机制后具备较强的状态跟踪能力,而早期模型虽能初步理解任务,但在长时间推理过程中易失效。
链接: https://arxiv.org/abs/2511.10457
作者: Kiamehr Rezaee,Jose Camacho-Collados,Mohammad Taher Pilehvar
机构: Cardiff NLP, School of Computer Science and Informatics (卡迪夫自然语言处理中心,计算机科学与信息学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.
zh
[NLP-20] Reasoning About Intent for Ambiguous Requests
【速读】: 该论文旨在解决大语言模型在面对模糊请求时,因隐式选择单一解释而导致意图误解的问题,这不仅可能引发用户不满,还可能带来安全风险。解决方案的关键在于:通过强化学习与定制化的奖励函数训练模型,使其在单次生成过程中输出结构化的多个解释-答案对(interpretation-answer pairs),从而提升对有效答案的覆盖范围,并增强模型响应的透明性与可控性。
链接: https://arxiv.org/abs/2511.10453
作者: Irina Saparina,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
zh
[NLP-21] Analogical Structure Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
【速读】: 该论文旨在解决如何在极少量数据条件下,使轻量级语言模型(如BERT+CNN,参数量仅0.5M)仍能实现与大规模预训练模型相当的语法规则学习能力的问题。其解决方案的关键在于引入一种受认知启发的类比范式组织策略,具体包括三个核心原则:类比结构(analogical structure)、对比学习(contrastive learning)和最小上下文提示(minimal contextual cues)。通过在结构化填空任务中利用类比模式与对比替代项进行训练,模型仅需100个英文致使/内变交替示例即可达到F1=0.95的性能,显著优于零样本GPT-o3(F1=0.87),且消融实验验证了类比组织和对比结构对性能提升的必要性。
链接: https://arxiv.org/abs/2511.10441
作者: Chunyang Jiang,Paola Merlo
机构: Idiap Research Institute (Idiap 研究所); University of Geneva (日内瓦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, 0.5M parameters) on only one hundred structured examples of English causative/inchoative alternations achieves F1=0.95 , outperforming zero-shot \textttGPT-o3 ( F1=0.87 ). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.
zh
[NLP-22] DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence
【速读】: 该论文旨在解决人文领域中实体链接(Entity Linking, EL)任务面临的三大挑战:复杂文档类型、缺乏特定领域的数据集与模型,以及长尾实体(即知识库中代表性不足的实体)。其解决方案的关键在于提出两个核心贡献:一是DELICATE,一种新颖的神经符号方法,通过结合BERT编码器与Wikidata的上下文信息,利用时间合理性(temporal plausibility)和实体类型一致性来选择合适的知识库实体;二是ENEIDE,一个从19世纪至20世纪多领域历史意大利语语料中半自动构建的EL语料库。实验表明,DELICATE在历史意大利语场景下优于其他EL模型,即使对比参数量更大的神经网络架构也表现更优,且其置信度分数和特征敏感性分析提供了比纯神经方法更强的可解释性和可理解性。
链接: https://arxiv.org/abs/2511.10404
作者: Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Mehwish Alam
机构: University of Macerata (马切拉塔大学); University of Bologna (博洛尼亚大学); Telecom Paris (巴黎电信学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.
zh
[NLP-23] Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中代理架构的可靠性问题,特别是当故障发生时如何有效识别异常代理。随着大语言模型(Large Language Models, LLMs)在MAS中的广泛应用,其对复杂任务求解和世界建模能力显著提升,但其可靠性影响尚未被充分研究。为此,作者从拜占庭容错(Byzantine Fault Tolerance)角度出发,发现LLM-based代理在处理错误信息流时表现出更强的怀疑倾向,从而在不同拓扑结构下优于传统代理。解决方案的关键在于提出一种基于置信度探测(confidence probe-based)的加权拜占庭容错共识机制(CP-WBFT),该机制利用LLM固有的反思与判别能力,通过探测式加权信息传递方式增强系统稳定性,实验证明其在极端拜占庭故障率(85.7%)下仍能保持高准确性和强可靠性,尤其在数学推理与安全评估任务中表现突出。
链接: https://arxiv.org/abs/2511.10400
作者: Lifan Zheng,Jiawei Chen,Qinghong Yin,Jingyuan Zhang,Xinyi Zeng,Yu Tian
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.
zh
[NLP-24] Agent Evolver: Towards Efficient Self-Evolving Agent System
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理(Autonomous Agents)在开发过程中存在的高成本、低效率问题,具体表现为任务数据集构建依赖人工、强化学习(Reinforcement Learning, RL)管道中随机探索效率低下以及样本利用率差。解决方案的关键在于提出AgentEvolver系统,其通过三个协同机制实现自主进化:(i) 自我提问(self-questioning),驱动好奇心导向的任务生成,减少对手工数据集的依赖;(ii) 自我导航(self-navigating),借助经验复用与混合策略引导提升探索效率;(iii) 自我归因(self-attributing),基于状态与动作对轨迹贡献度分配差异化奖励以增强样本利用效率。该框架实现了代理能力的可扩展、低成本且持续优化。
链接: https://arxiv.org/abs/2511.10395
作者: Yunpeng Zhai,Shuchang Tao,Cheng Chen,Anni Zou,Ziqian Chen,Qingxu Fu,Shinji Mai,Li Yu,Jiaji Deng,Zouying Cao,Zhaoyang Liu,Bolin Ding,Jingren Zhou
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
zh
[NLP-25] Simulating Misinformation Propagation in Social Networks using Large Language Models CIKM2025
【速读】: 该论文旨在解决社交媒体中虚假信息(misinformation)传播机制不明确的问题,特别是其如何受人类认知偏差、身份认同和意识形态驱动的推理影响。解决方案的关键在于构建一个基于大语言模型(large language model, LLM)人格模拟的审计节点框架(auditor-node framework),其中每个LLM节点被设定为具有特定身份或意识形态特征的合成代理(synthetic agent),模拟用户级偏见与信任启发式;通过多轮内容重写和基于问答的审计机制,实现对虚假信息在传播链中逐层漂移的可解释性追踪,并量化其事实性退化程度,从而揭示不同人格类型对虚假信息扩散速率与严重程度的影响。
链接: https://arxiv.org/abs/2511.10384
作者: Raj Gaurav Maurya,Vaibhav Shukla,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: Technische Universität München (慕尼黑工业大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Vizuara AI Labs (维兹瓦拉人工智能实验室)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to CIKM 2025 Workshop LASS
Abstract:Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases, ideological alignments, and trust heuristics. Within this setup, we introduce an auditor–node framework to simulate and analyze how misinformation evolves as it circulates through networks of such agents. News articles are propagated across networks of persona-conditioned LLM nodes, each rewriting received content. A question–answering-based auditor then measures factual fidelity at every step, offering interpretable, claim-level tracking of misinformation drift. We formalize a misinformation index and a misinformation propagation rate to quantify factual degradation across homogeneous and heterogeneous branches of up to 30 sequential rewrites. Experiments with 21 personas across 10 domains reveal that identity- and ideology-based personas act as misinformation accelerators, especially in politics, marketing, and technology. By contrast, expert-driven personas preserve factual stability. Controlled-random branch simulations further show that once early distortions emerge, heterogeneous persona interactions rapidly escalate misinformation to propaganda-level distortion. Our taxonomy of misinformation severity – spanning factual errors, lies, and propaganda – connects observed drift to established theories in misinformation studies. These findings demonstrate the dual role of LLMs as both proxies for human-like biases and as auditors capable of tracing information fidelity. The proposed framework provides an interpretable, empirically grounded approach for studying, simulating, and mitigating misinformation diffusion in digital ecosystems.
zh
[NLP-26] Position: On the Methodological Pitfalls of Evaluating Base LLM s for Reasoning
【速读】: 该论文旨在解决当前对基础大语言模型(base LLMs)推理能力评估中存在的方法论缺陷问题。其核心关切在于:基础LLMs的预训练目标(即基于统计语言模式的学习)与推理评估所依赖的规范性标准(如正确性)之间存在根本性不匹配,导致其输出的逻辑结论可能是语言统计合理性下的偶然产物,而非真正具备推理能力的表现。解决方案的关键在于批判性地重新审视现有研究中隐含的两个假设——即认为基础LLMs的输出是其对正确答案的真实尝试,以及这些结论可推广至经过指令微调后优化推理表现的LLMs,并呼吁未来研究需明确识别并规避此类方法学陷阱。
链接: https://arxiv.org/abs/2511.10381
作者: Jason Chan,Zhixue Zhao,Robert Gaizauskas
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs’ reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs’ pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs’ outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs’ reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.
zh
[NLP-27] ruthfulRAG : Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs AAAI2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因外部检索信息与大语言模型(Large Language Models, LLMs)内部知识存在事实性冲突而导致的生成内容准确性下降问题。现有方法多在词元或语义层面进行冲突处理,难以实现对知识差异的完整理解,尤其在知识密集型任务中表现不足。解决方案的关键在于提出TruthfulRAG框架,其创新性地利用知识图谱(Knowledge Graphs, KGs)对检索到的内容进行结构化建模,通过查询驱动的知识图谱检索和基于熵的过滤机制,精准定位并缓解事实层面的知识不一致,从而提升LLMs生成结果的忠实度与可靠性。
链接: https://arxiv.org/abs/2511.10375
作者: Shuyi Liu,Yuming Shang,Xi Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, accepted at AAAI 2026
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs’ internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs’ knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.
zh
[NLP-28] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLM s and Ontological Engineering for Scholarly Debates
【速读】: 该论文旨在解决文化遗产(Cultural Heritage, CH)文本中蕴含的丰富知识难以系统化查询的问题,其核心挑战在于将非结构化的叙述性文本转化为结构化的知识图谱(Knowledge Graph, KG)。解决方案的关键在于提出了一种名为ATR4CH(Adaptive Text-to-RDF for Cultural Heritage)的五步系统性方法论,通过结合注释模型、本体框架与大语言模型(Large Language Model, LLM)驱动的知识抽取,并采用迭代开发流程实现从基础分析到综合评估的完整闭环。该方法利用三个LLM(Claude Sonnet 3.7、Llama 3.3 70B、GPT-4o-mini)构建顺序抽取管道,在实体识别、假设提取、证据抽取等任务上取得高F1分数(0.65–0.99),并验证了小模型在成本效益上的竞争力,为文化遗产机构提供了一个可复用、跨领域适配的知识图谱构建框架。
链接: https://arxiv.org/abs/2511.10354
作者: Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp
机构: Università degli Studi di Bologna (博洛尼亚大学); KNAW Humanities Cluster (荷兰皇家科学院人文集群)
类目: Computation and Language (cs.CL)
备注: 46 pages
Abstract:Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts…), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.
zh
[NLP-29] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
【速读】: 该论文旨在解决低资源语言(特别是印地语系语言)在大规模语言模型(LLM)预训练中高质量数据稀缺的问题,以弥合主流语言与低资源语言之间在生成式AI(Generative AI)红利分配上的不均衡。其解决方案的关键在于构建一个大规模合成多语言预训练数据集BhashaKritika(包含5400亿token),通过五种不同的生成技术,系统性地探索基于文档、人物设定和主题的生成策略对数据质量的影响,并对比英文翻译与原生生成在印地语系语言中的表现差异。同时,提出了一套模块化的质量评估流水线,集成脚本识别、语言检测、元数据一致性校验、n-gram重复分析及基于KenLM模型的困惑度过滤,从而实现跨多种书写系统和语言环境下的高效、可扩展的质量控制,为构建高质多语言语料库提供了可复用的方法论框架。
链接: https://arxiv.org/abs/2511.10338
作者: Guduru Manoj,Neel Prabhanjan Rachamalla,Ashish Kulkarni,Gautam Rajeev,Jay Piplodiya,Arul Menezes,Shaharukh Khan,Souvik Rana,Manya Sah,Chandra Khatri,Shubham Agarwal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
zh
[NLP-30] Rectify Evaluation Preference: Improving LLM s Critique on Math Reasoning via Perplexity-aware Reinforcement Learning AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多步数学推理(Multi-step Mathematical Reasoning, MsMR)任务中批判能力不足的问题,尤其是由于模型对不同推理路径的评价偏好失衡导致的错误判断。其关键解决方案是提出一种基于困惑度感知的强化学习算法——在Group Relative Policy Optimization框架中引入困惑度作为引导信号,以纠正LLMs“倾向于将低困惑度解判为正确”的不平衡评价偏好,从而提升其自动评判推理过程的能力。
链接: https://arxiv.org/abs/2511.10303
作者: Changyuan Tian,Zhicong Lu,Shuang Qian,Nayu Liu,Peiguang Li,Li Jin,Leiyi Hu,Zhizhao Zeng,Sirui Wang,Ke Zeng,Zhi Guo
机构: 1. Tsinghua University (清华大学); 2. Institute for AI Industry Research, Tsinghua University (清华智云研究院); 3. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 4. Alibaba Group (阿里巴巴集团); 5. Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI2026
Abstract:To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason – imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs’ critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon – ``LLMs incline to judge solutions with lower perplexity as correct’', which is dubbed as \textitimbalanced evaluation preference. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.
zh
[NLP-31] Local Hybrid Retrieval-Augmented Document QA ACL
【速读】: 该论文旨在解决组织在处理敏感文档时面临的隐私与性能权衡问题:采用云端AI系统虽能提供强大的问答能力,但存在数据泄露风险;而本地部署虽保障安全,却往往准确率较低。解决方案的关键在于提出一种完全在本地运行的问答系统,通过融合语义理解(semantic understanding)与关键词精确匹配两种互补的检索策略,并利用消费级硬件加速,在不依赖互联网的情况下实现对法律、科学及对话类文档的高精度问答,从而在确保数据隐私的同时达到可与云端系统竞争的性能表现。
链接: https://arxiv.org/abs/2511.10297
作者: Paolo Astrino
机构: Università Ca’ Foscari Venezia(威尼斯卡福斯卡里大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, 3 tables; conference-style (ACL format); fully local RAG system
Abstract:Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.
zh
[NLP-32] OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际应用中可能生成不安全内容的问题,包括毒性语言、偏见图像、隐私侵犯及有害误导信息等,而现有安全评估基准在模态覆盖和风险维度上存在显著局限。解决方案的关键在于提出OutSafe-Bench——首个面向多模态时代的综合性内容安全评测基准,其核心创新包括:(1)构建涵盖文本、图像、音频和视频四类模态的大型多语言数据集(含超过18,000个双语文本提示、4,500张图片、450段音视频),并系统标注九类关键内容风险;(2)设计多维交叉风险评分(Multidimensional Cross Risk Score, MCRS),用于量化不同风险类别间的重叠与关联;(3)引入可解释的自动化多评审加权聚合框架FairScore,通过自适应选择高性能模型作为评审团,减少单一模型判断偏差,提升评估的公平性与可靠性。实证表明,当前主流MLLMs仍存在严重且普遍的安全漏洞,亟需强化防护机制。
链接: https://arxiv.org/abs/2511.10287
作者: Yuping Yan,Yuhan Xie,Yuanshuai Li,Yingchao Yu,Lingjuan Lyu,Yaochu Jin
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Donghua University (东华大学); Sony Research (索尼研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
zh
[NLP-33] FactGuard: Event-Centric and Commonsense-Guided Fake News Detection AAAI2026
【速读】: 该论文旨在解决当前假新闻检测方法因依赖写作风格而面临的效果下降问题,尤其是当伪造内容日益模仿真实新闻的风格时,传统基于风格的方法鲁棒性显著减弱;同时,尽管大语言模型(Large Language Models, LLMs)在假新闻检测中展现出巨大潜力,但其实际应用受限于功能挖掘不足、使用方式模糊以及推理成本高昂。解决方案的关键在于提出一个名为FactGuard的新框架,该框架利用LLMs提取以事件为中心的内容信息,从而弱化写作风格对检测性能的影响,并引入一种动态可用性机制来识别事实推理中的矛盾与模糊情形,自适应地融合LLM建议以提升决策可靠性;此外,通过知识蒸馏技术构建轻量级版本FactGuard-D,实现冷启动和资源受限场景下的高效部署,兼顾准确性与实用性。
链接: https://arxiv.org/abs/2511.10281
作者: Jing He,Han Zhang,Yuanhui Xiao,Wei Guo,Shaowen Yao,Renyang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FactGuard, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FactGuard-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection.
zh
[NLP-34] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
【速读】: 该论文旨在解决当前对全双工语音语言模型(Full-Duplex Speech Language Models, FD-SLMs)的评估体系不完善的问题,尤其是现有基准测试主要聚焦于单轮交互和基础对话特性,忽视了多轮对话中的复杂性以及指令遵循能力与安全性等关键维度。其解决方案的关键在于提出MTR-DuplexBench——一个创新的基准测试框架,通过将连续的全双工对话流分割为离散对话回合(turn),实现对FD-SLM在对话质量、对话动态性、指令遵循及安全性等方面的逐轮精细化评估,从而有效应对因对话轮次边界模糊和推理过程中上下文不一致带来的挑战。
链接: https://arxiv.org/abs/2511.10262
作者: He Zhang,Wenqian Cui,Haoning Xu,Xiaohui Li,Lei Zhu,Shaohua Ma,Irwin King
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); Huawei Technologies (华为技术)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Work in progress
Abstract:Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.
zh
[NLP-35] ProgRAG : Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型推理任务中因幻觉(hallucination)和透明度不足导致的推理失败问题,尤其是在多跳知识图谱问答(multi-hop knowledge graph question answering, KGQA)场景下,现有KG增强方法常因检索不准确、上下文过长干扰关键信息或无法适配不同问题类型的逻辑结构而表现不佳。解决方案的关键在于提出ProgRAG框架,其核心机制是将复杂问题分解为子问题,通过逐步回答每个子问题来扩展部分推理路径;每一步利用外部检索器获取候选证据,并由LLM基于不确定性感知进行修剪以提升证据质量;最终通过重构和优化子问题答案形成的推理路径来构建更清晰、结构化的上下文供LLM进行可靠推理。
链接: https://arxiv.org/abs/2511.10240
作者: Minbae Park,Hyemin Yang,Jeonghyun Kim,Kunsoo Park,Hyunjoon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.
zh
[NLP-36] VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
【速读】: 该论文旨在解决当前端到端语音语言模型(Spoken Language Models, SLMs)中存在的显著响应延迟问题,其根源在于自回归生成语音标记(speech tokens)以及依赖复杂流匹配(flow-matching)模型进行语音合成所引入的高延迟。解决方案的关键在于提出VocalNet-M2,该模型通过集成多码本(multi-codebook)分词器与多标记预测(Multi-Token Prediction, MTP)策略,直接生成多码本语音标记,从而摒弃了对高延迟流匹配模型的依赖;同时MTP策略提升了生成效率并优化了整体性能。实验表明,该方法将首个语音块的延迟从约725ms降低至350ms,且在主流SLMs中保持竞争力。
链接: https://arxiv.org/abs/2511.10232
作者: Yuhao Wang,Ziyang Cheng,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.
zh
[NLP-37] LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning AAAI2026
【速读】: 该论文旨在解决多语言指令微调(multilingual instruction tuning)中训练数据选择对模型多语言能力敏感性的问题,即现有数据筛选方法常忽略语言内在结构,导致模型在低资源语言或理解类任务上的性能提升有限。其解决方案的关键在于提出LangGPS——一个基于语言可分性(language separability)的轻量级两阶段预筛选框架:首先利用语言可分性得分过滤训练样本,再结合已有筛选方法进一步优化子集。实验证明,该方法能显著提升多语言训练的有效性和泛化能力,尤其在理解任务和低资源语言场景下表现突出;同时发现高可分性样本有助于形成清晰的语言边界并加速适应,而低可分性样本则促进跨语言对齐,语言可分性还可作为多语言课程学习(multilingual curriculum learning)的有效信号。
链接: https://arxiv.org/abs/2511.10229
作者: Yangfan Ye,Xiaocheng Feng,Xiachong Feng,Lei Huang,Weitao Ma,Qichen Hong,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI2026 Main Track Accepted
Abstract:Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model’s representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.
zh
[NLP-38] Persona-Aware Alignment Framework for Personalized Dialogue Generation
【速读】: 该论文旨在解决当前个性化对话生成模型在训练过程中对给定人格特征(persona)关注不足的问题,即主流方法依赖于基于token的语言建模(如Next Token Prediction),导致模型倾向于生成通用性响应而忽视具体人格信息。其解决方案的关键在于提出一种新颖的人格感知对齐框架(Persona-Aware Alignment Framework, PAL),将人格对齐明确设为训练目标,并采用两阶段训练策略——人格感知学习(Persona-aware Learning)与人格对齐(Persona Alignment),辅以“先选择后生成”(Select then Generate)的推理机制,从而在语义层面显著提升对话内容对人格特征的敏感性和相关性。
链接: https://arxiv.org/abs/2511.10215
作者: Guanrong Li,Xinyu Liu,Zhen Wu,Xinyu Dai
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre-MIT Press publication version
Abstract:Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.
zh
[NLP-39] EffiReason -Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用思维链(Chain-of-Thought, CoT)提示时产生的解释冗长、计算成本高且有时准确率下降的问题,同时应对当前效率导向推理方法评估标准碎片化导致难以公平比较的挑战。其解决方案的关键在于提出EffiReason-Bench——一个统一基准,支持跨范式的高效推理方法评估,涵盖推理蓝图(Reasoning Blueprints)、动态执行(Dynamic Execution)和事后精炼(Post-hoc Refinement)三类策略;并通过构建结构标准化、选项级分析详尽且经人工验证的CoT标注数据集(针对CommonsenseQA和LogiQA),实现步骤级评估;此外,引入E3-Score这一基于经济权衡建模的评分指标,提供平滑、稳定且无需依赖启发式规则的量化评估体系,从而系统性地推动高效推理方法的研究与优化。
链接: https://arxiv.org/abs/2511.10201
作者: Junquan Huang,Haotian Wu,Yubo Gao,Yibo Yan,Junyan Zhang,Yonghua Hei,Song Dai,Jie Zhang,Puay Siew Tan,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Nanyang Technological University (南洋理工大学); The Hong Kong University of Science and Technology (香港科技大学); Singapore Institute of Manufacturing Technology, ASTAR (新加坡制造技术研究院,ASTAR)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 4 tables. Appendix included
Abstract:Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
zh
[NLP-40] xt2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
【速读】: 该论文旨在解决Text-to-SQL任务中因数据稀缺、简单且多样性不足而导致性能受限的问题。解决方案的关键在于提出了一种SQL感知的数据增强框架Text2SQL-Flow,该框架能够从少量种子数据出发,生成大规模、语义有效且结构多样化的Text-to-SQL配对样本。其核心创新包括六个维度的增强策略、端到端流水线(含SQL执行验证、自然语言问题生成、思维链推理轨迹和数据分类)以及模块化的数据库管理器以确保跨数据库兼容性与可扩展性。基于此框架构建的高质量数据集SQLFlow(89,544个标注样本)在开源和闭源大语言模型(LLM)上均展现出显著提升效果,尤其在闭源场景下引入的掩码对齐检索方法实现了结构感知的细粒度示例匹配,凸显了高保真结构化数据在现代AI系统中的关键作用。
链接: https://arxiv.org/abs/2511.10192
作者: Qifeng Cai,Hao Liang,Chang Xu,Tao Xie,Wentao Zhang,Bin Cui
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
zh
[NLP-41] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA
【速读】: 该论文旨在解决多轮交互中大型语言模型(Large Language Models, LLMs)推理过程分析困难的问题,主要挑战在于复杂的上下文依赖关系以及缺乏专门的可视化工具,导致研究人员面临较高的认知负荷。解决方案的关键在于提出VISTA——一个基于Web的文本分析可视化交互系统,其核心功能包括:自动解析对话会话并生成推理依赖树(reasoning dependency tree),以透明展示模型的逐步逻辑路径;支持用户交互式修改对话历史,开展跨模型的“假设性”(what-if)分析;并通过统一且可交互的框架显著降低推理链分析的复杂度,从而提升对LLMs能力与局限性的理解。
链接: https://arxiv.org/abs/2511.10182
作者: Yiran Zhang,Mingyang Lin,Mark Dras,Usman Naseem
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct “what-if” analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model’s step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.
zh
[NLP-42] Generalizing to Unseen Disaster Events: A Causal View AACL2025
【速读】: 该论文旨在解决社交平台在灾难事件监测中因数据偏倚导致模型泛化能力不足的问题,尤其关注事件相关偏倚(event-related biases)和领域相关偏倚(domain-related biases),这些问题会显著削弱模型对新兴灾难事件的适应能力。解决方案的关键在于引入因果学习框架,通过因果推理方法识别并消除这些偏倚源,从而提升预训练语言模型(PLM)在多个灾难分类任务中的鲁棒性和泛化性能,实验表明该方法相较多个基线模型在F1分数上最高提升1.9%。
链接: https://arxiv.org/abs/2511.10120
作者: Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer
机构: Technische Hochschule Nürnberg Georg Simon Ohm (诺伊堡应用技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Findings of AACL 2025
Abstract:Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.
zh
[NLP-43] On the Military Applications of Large Language Models
链接: https://arxiv.org/abs/2511.10093
作者: Satu Johansson,Taneli Riihonen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in the meeting proceedings of the 2025 International Conference on Military Communications and Information Systems (ICMCIS) by the NATO Science and Technology Organization (STO)
[NLP-44] ELYADATA LIA at NADI 2025: ASR and ADI Subtasks EMNLP2025
链接: https://arxiv.org/abs/2511.10090
作者: Haroun Elleuch,Youssef Saidi,Salima Mdhaffar,Yannick Estève,Fethi Bougares
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the ArabicNLP 2025 Workshop (co-located with EMNLP 2025), Association for Computational Linguistics, 2025
[NLP-45] Format Matters: The Robustness of Multimodal LLM s in Reviewing Evidence from Tables and Charts AAAI2026
【速读】: 该论文试图解决当前多模态大语言模型(multimodal LLMs)在科学主张验证任务中对图表(chart)类证据理解能力不足的问题,这一问题限制了其在科研评审辅助系统中的可靠应用。解决方案的关键在于构建一个适配于多模态主张验证任务的新型数据集,该数据集基于已有科学论文数据集进行结构化标注和增强,从而能够系统评估模型在表格(table)与图表(chart)两种证据形式下的表现差异。实验结果表明,当前主流多模态LLMs在表格证据上表现较好,但在图表理解上显著落后于人类水平,且小参数模型(<8B)在两类任务间缺乏跨模态一致性,揭示出当前模型在多模态推理上的关键短板。
链接: https://arxiv.org/abs/2511.10075
作者: Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
机构: 1. National Institute of Information and Communications Technology (日本信息通信研究机构); 2. University of California, Berkeley (加州大学伯克利分校); 3. Nara Institute of Science and Technology (奈良先端科学技术大学院大学); 4. Institut national de recherche en informatique et en automatique (法国国家信息与自动化研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI 2026
Abstract:With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models’ multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.
zh
[NLP-46] ADI-20: Arabic Dialect Identification dataset and models INTERSPEECH2025
【速读】: 该论文旨在解决阿拉伯语方言识别(Arabic Dialect Identification, ADI)任务中因数据覆盖不全和模型性能受限而导致的识别准确率不足问题。其解决方案的关键在于构建了一个大规模、多方言覆盖的扩展数据集ADI-20,包含19种阿拉伯语方言及现代标准阿拉伯语(Modern Standard Arabic, MSA),总计3,556小时语音数据,并在此基础上训练和评估多种前沿模型,包括基于预训练ECAPA-TDNN的微调模型以及结合Whisper编码器块与注意力池化层和分类全连接层的架构。实验表明,即使仅使用30%的训练数据,F1分数下降较小,验证了数据规模与模型参数量对识别性能的影响,为ADI研究提供了高质量基准和可复现的工具。
链接: https://arxiv.org/abs/2511.10070
作者: Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares
机构: Elyadata(艾莱达塔); University of Avignon (阿维尼翁大学)
类目: Computation and Language (cs.CL)
备注: Published in Interspeech 2025
Abstract:We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries’ dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model’s number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
zh
[NLP-47] Enhancing the Medical Context-Awareness Ability of LLM s via Multifaceted Self-Refinement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实医疗场景中表现不佳的问题,其核心瓶颈在于模型缺乏足够的上下文感知能力(context-awareness),即难以识别关键缺失信息(如用户身份、病史、风险因素等),并提供安全、有用且符合语境的响应。解决方案的关键是提出多维自我精炼(Multifaceted Self-Refinement, MuSeR)方法,通过三个关键维度——决策、沟通与安全——进行自评估与迭代优化:首先设计属性条件查询生成器以模拟多样化真实用户情境;随后让模型对回答进行多维自评并迭代修正;最终利用生成的高质量数据进行监督微调,显著提升模型的上下文敏感性。实验表明,该方法在HealthBench基准上实现显著性能提升,尤其在上下文感知维度上效果突出,并借助知识蒸馏使小型模型(如Qwen3-32B)超越教师模型,成为开源LLMs在该数据集上的新最优(SOTA)。
链接: https://arxiv.org/abs/2511.10067
作者: Yuxuan Zhou,Yubin Wang,Bin Wang,Chen Ning,Xien Liu,Ji Wu,Jianye Hao
机构: Tsinghua University (清华大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 13 figures
Abstract:Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs’ context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model’s context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at this https URL.
zh
[NLP-48] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多轮对话中难以有效遵循跨轮次指令的问题,尤其是面对复杂长距离约束时表现不佳。现有方法主要依赖大规模多轮对话数据进行微调,但将每轮响应生成视为独立任务,未显式地将多轮指令遵循纳入优化目标。为此,作者提出GraphIF框架,其核心创新在于将多轮对话建模为有向关系图(directed relation graph),并通过图提示(graph prompt)增强LLM的指令遵循能力。关键组件包括:基于代理的关系抽取模块,用于通过动作触发机制构建结构化图;关系图提示生成模块,将图信息转化为自然语言提示;以及响应重写模块,利用生成的图提示优化初始输出。实验证明,GraphIF可无缝集成至指令微调的LLM中,并在四项多轮指令遵循评估指标上显著提升性能。
链接: https://arxiv.org/abs/2511.10051
作者: Zhenhe Li,Can Lin,Ling Zheng,Wen-Da Wei,Junli Liang,Qi Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
zh
[NLP-49] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)如何理解人类语言中音义关联(即音义象似性,phonetic iconicity)的问题,特别是其在文本(拼写与国际音标IPA)和听觉输入形式下的表现及其内部信息处理机制。解决方案的关键在于构建了一个大规模、跨语言的拟声词数据集LEX-ICON,包含8,052个自然词和2,930个系统构造的伪词,并标注了多达25个语义维度(如“尖锐 vs. 圆润”),同时通过逐层测量音素级注意力分数(phoneme-level attention fraction scores),揭示了模型对音义象似性的敏感性和关注模式。这一方法首次实现了对MLLMs音义理解能力的大规模、定量分析,为AI与认知语言学的交叉研究提供了新范式。
链接: https://arxiv.org/abs/2511.10045
作者: Jinhong Jeong,Sunghyun Lee,Jaeyoung Lee,Seonah Han,Youngjae Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 27 tables, 10 figures
Abstract:Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.
zh
[NLP-50] ScaleFormer: Span Representation Cumulation for Long-Context Transformer SIGIR
【速读】: 该论文旨在解决标准自注意力机制(self-attention)在Transformer模型中存在二次计算复杂度的问题,从而限制了其在长文本任务中的应用。解决方案的关键在于提出ScaleFormer框架,该框架通过将长输入分割为重叠的块(chunk),并设计一种无需参数的融合机制,利用累积上下文向量对每个块的边界表示进行增强,从而赋予模型对文档结构的位置感知能力。此策略实现了线性时间复杂度,并使预训练编码器-解码器模型无需架构修改即可有效处理长文本推理任务。
链接: https://arxiv.org/abs/2511.10029
作者: Jiangshu Du,Wenpeng Yin,Philip Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted by SIGIR-AP’25
Abstract:The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk’s representation with structural awareness of its position within the document. It achieves this by enriching each chunk’s boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document’s narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.
zh
[NLP-51] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中落地应用时面临的准确性、课程契合度与教学相关性不足的问题,尤其聚焦于如何将LLMs有效适配至印度国家教育部(NCERT)的中小学课程内容(英语和科学科目,6–8年级)。其解决方案的关键在于构建一个名为“PustakAI”的框架,用于设计和评估一个面向NCERT课程的新型问答数据集“NCERT-QA”,该数据集按问题类型细分为事实型(Factoid)、推理型(Inferential)及其他类型(如评价与推理类),并系统测试不同提示技术(如元提示、少样本提示和思维链提示)对模型表现的影响,从而识别最适合课程结构与学习目标的交互方式。同时,论文还对比了多个开源与高性能LLM在教育场景下的性能表现,为实际部署提供实证依据。
链接: https://arxiv.org/abs/2511.10002
作者: Shivam Sharma(1),Riya Naik(1),Tejas Gawas(1),Heramb Patil(1),Kunal Korgaonkar(1) ((1) CSIS Department, BITS Pilani K K Birla Goa Campus, India)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework “PustakAI”\footnotePustak means `book’ in many Indian languages. for the design and evaluation of a novel question-answering dataset “NCERT-QA” aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.
zh
[NLP-52] FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance CIKM2025
【速读】: 该论文旨在解决当前广泛使用的语义相似度评估指标BERTScore在金融文本分析中对数值变化不敏感的问题,这一缺陷可能导致无法准确区分具有重大财务意义的数值差异(如2%收益与20%亏损)。其解决方案的关键在于构建FinNuE——一个通过受控数值扰动生成的诊断数据集,涵盖财报电话会议、监管文件、社交媒体和新闻文章等金融场景文本,从而系统性地验证BERTScore在数值敏感性上的不足,并揭示基于嵌入的评估方法在金融自然语言处理(Natural Language Processing, NLP)中的根本局限,进而推动更注重数值感知的评估框架的发展。
链接: https://arxiv.org/abs/2511.09997
作者: Yu-Shiang Huang,Yun-Yu Lee,Tzu-Hsin Chou,Che Lin,Chuan-Ju Wang
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院); National Yang Ming Chiao Tung University (国立阳明交通大学); Department of Information Management, National Taiwan University (国立台湾大学资讯管理学系); Department of Electrical Engineering, National Taiwan University (国立台湾大学电机工程学系); Research Center for Information Technology Innovation, Academia Sinica (中央研究院资讯科技创新研究中心)
类目: Computation and Language (cs.CL)
备注: In CIKM 2025 Workshop on Advances in Financial AI: Innovations, Risk, and Responsibility in the Era of LLMs (Non-archival) (FinAI@CIKM 2025)
Abstract:BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.
zh
[NLP-53] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation AAAI’26
【速读】: 该论文旨在解决多语言检索增强生成(Multilingual Retrieval-Augmented Generation, Multilingual RAG)中出现的输出语言漂移(output language drift)问题,即当用户查询与检索到的文档或上下文示例语言不一致时,大型语言模型(LLM)倾向于生成非目标语言的响应,尤其在链式思维(Chain-of-Thought, CoT)推理过程中更为显著。研究表明,这种漂移并非源于理解失败,而是由解码器层面的分布坍缩(decoder-level collapse)导致,其中高频英语模式主导了生成过程,使英语成为跨语言条件下的语义吸引子(semantic attractor)。解决方案的关键在于提出一种轻量级、无需训练的解码策略——软约束解码(Soft Constrained Decoding, SCD),通过惩罚非目标语言的token来温和引导生成方向,该方法具有模型无关性,可无缝集成至任意生成算法且无需修改架构或额外数据,实验证明其能显著提升多语言对齐性和任务性能。
链接: https://arxiv.org/abs/2511.09984
作者: Bo Li,Zhenghua Xu,Rui Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI’26, Oral Paper
Abstract:Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG. Comments: AAAI’26, Oral Paper Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.09984 [cs.CL] (or arXiv:2511.09984v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.09984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-54] Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG AAAI’26
【速读】: 该论文旨在解决动态检索增强生成(Dynamic Retrieval-Augmented Generation, Dynamic RAG)中检索时机选择不当的问题,即现有方法通常基于低token级置信度触发检索,可能导致错误传播后才进行干预,从而影响生成质量。其解决方案的关键在于提出一种无需训练的熵趋势约束(Entropy-Trend Constraint, ETC)机制,通过建模token级不确定性的动态变化,利用熵序列的一阶和二阶差分检测不确定性趋势,从而实现更早且更精准的检索触发时机。该方法显著提升了检索效率与准确性,并在多个问答基准测试中展现出优于基线的方法性能及良好的泛化能力。
链接: https://arxiv.org/abs/2511.09980
作者: Bo Li,Tian Tian,Zhenghua Xu,Hao Cheng,Shikun Zhang,Wei Ye
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI’26, Oral Paper
Abstract:Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.
zh
[NLP-55] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction AACL
【速读】: 该论文旨在解决当前大语言模型在数值类事实核查(numerical fact-checking)任务中表现不佳的问题,特别是其在面对可控扰动时的鲁棒性不足。研究通过引入标签翻转探测(label-flipping probes)等系统性扰动方法对主流模型进行评估,发现即使是最先进的专有模型在特定扰动下准确率下降高达62%,且无一模型能在所有条件下保持稳定性能。解决方案的关键在于:虽然增加上下文长度通常会降低准确率,但若在扩展上下文中加入受扰动的示例演示(perturbed demonstrations),多数模型的性能可显著恢复。这表明,通过设计更具鲁棒性的训练数据结构(如包含扰动样本的示范),是提升模型数值推理与事实核查能力的重要方向。
链接: https://arxiv.org/abs/2511.09971
作者: Peter Røysland Aarnes,Vinay Setty
机构: University of Stavanger (斯塔万格大学)
类目: Computation and Language (cs.CL)
备注: Accepted in ICJNLP/AACL SRW
Abstract:Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
zh
[NLP-56] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering AAAI2026
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳推理任务中因缺乏全局规划而导致的局部推理困境问题,以及因对检索内容利用不足和潜在线索忽视而引发的推理准确性下降问题。解决方案的关键在于提出Recursive Evaluation and Adaptive Planning (REAP)框架,其核心是通过子任务规划器(Sub-task Planner, SP)与事实提取器(Fact Extractor, FE)模块协同工作:SP维护全局视角,动态评估任务状态并优化推理路径;FE则对检索内容进行细粒度分析以提取可靠答案与隐含线索。二者共同构建逻辑连贯的全局知识表示,从而提升推理过程的可靠性与可追溯性,并结合统一的任务范式设计实现多任务微调,显著增强模型在复杂、数据稀缺任务上的表现。
链接: https://arxiv.org/abs/2511.09966
作者: Yijie Zhu,Haojie Zhou,Wanting Hong,Tailin Liu,Ning Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in AAAI 2026
Abstract:Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP’s performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.
zh
[NLP-57] Leverag ing Large Language Models for Identifying Knowledge Components
【速读】: 该论文旨在解决自适应学习系统中知识组件(Knowledge Components, KCs)自动识别的瓶颈问题,即传统依赖领域专家手动标注的方式效率低下且难以扩展。其关键解决方案在于:首先利用大语言模型(LLMs)如GPT-4o-mini对大规模题库(646道选择题)进行自动化KC生成,随后提出一种基于余弦相似度的语义合并策略,用于消除冗余KC标签。实验表明,仅靠LLM生成会导致KCs数量剧增且性能下降,而引入阈值为0.8的语义合并机制后,KCs数量从569降至428,并将均方根误差(RMSE)优化至0.4259,显著提升了模型精度与实用性。
链接: https://arxiv.org/abs/2511.09935
作者: Canwen Wang,Jionghao Lin,Kenneth R. Koedinger
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted as an extended abstract in The International Conference on Learning Analytics Knowledge (LAK’25) Workshop: LLMs for Qualitative Analysis in Education
Abstract:Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a “simulated textbook” LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model’s performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.
zh
[NLP-58] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers AAAI AAAI2026
【速读】: 该论文旨在解决顺序微调(Sequential Fine-Tuning, SeqFT)在类增量学习(Class-Incremental Learning, CIL)中因共享主干参数的逐次优化导致的特征分布漂移(distribution drift)问题,该漂移会引发先前类别特征分布与更新模型之间的不匹配,从而降低分类器性能。解决方案的关键在于提出一种潜在空间过渡算子(latent space transition operator),并设计了两种变体:线性SLDC(Sequential Learning with Drift Compensation)通过求解正则化最小二乘问题学习映射关系以对齐任务间特征分布;弱非线性SLDC进一步引入可学习的弱非线性变换,在灵活性与泛化能力之间取得平衡。此外,结合知识蒸馏(Knowledge Distillation, KD)进一步缓解表示漂移,最终使SeqFT性能逼近联合训练(joint training)水平。
链接: https://arxiv.org/abs/2511.09926
作者: Xuan Rao,Simian Xu,Zheng Li,Bo Zhao,Derong Liu,Mingming Ha,Cesare Alippi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: this https URL.
zh
[NLP-59] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection AACL2025
【速读】: 该论文旨在解决计算模型在多轮对话中对社会规范(social norms)推理能力不足的问题,尤其是现有方法难以捕捉真实对话中情境依赖、文化差异和语用意图的动态性。其核心挑战在于社会规范具有主观性、跨文化多样性以及高度上下文敏感性,传统基于孤立话语或合成对话的数据集无法充分建模这些特性。解决方案的关键是提出Norm-RAG框架,该框架结合检索增强生成(Retrieval-Augmented Generation, RAG)与代理式推理机制,通过新颖的语义分块(Semantic Chunking)技术从结构化规范文档中检索相关依据,并将话语层面的属性(如交际意图、说话者角色、人际建构及语言线索)映射到可解释的规范条目上,从而实现跨语言多轮对话中的规范适配与违反检测。
链接: https://arxiv.org/abs/2511.09918
作者: Pritish Sahu,Anirudh Som,Dimitra Vergyri,Ajay Divakaran
机构: SRI International
类目: Computation and Language (cs.CL)
备注: IJCNLP-AACL 2025
Abstract:Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.
zh
[NLP-60] HI-TransPA: Hearing Impairments Translation Personal Assistant
【速读】: 该论文旨在解决听力障碍人士在日常交流中面临的多模态语音理解与翻译难题,尤其针对现有模型在处理噪声干扰、数据异构性及听力障碍者发音特征时适应性不足的问题。解决方案的关键在于提出一种名为HI-TransPA的指令驱动式音视频个人助手,其核心创新包括:构建一套完整的预处理与数据筛选流程(含面部关键点检测、唇部区域隔离稳定及多模态样本质量评分),并基于质量分数设计课程学习策略以提升鲁棒性;同时引入SigLIP编码器结合统一3D-Resampler高效建模高帧率唇动信息,从而实现单一流程内的精准翻译与自然对话交互。
链接: https://arxiv.org/abs/2511.09915
作者: Zhiming Ma,Shiyu Gan,Junhao Zhao,Xianming Li,Qingyun Pan,Peidong Wang,Mingjun Pan,Yuhao Mo,Jiajie Cheng,Chengxin Chen,Zhonglun Cao,Chonghan Liu,Shi Cheng
机构: SmartFlowAI Research(智能流AI研究院)
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
zh
[NLP-61] Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
【速读】: 该论文旨在解决医学图像自动生成描述(automated medical image captioning)的问题,即如何将复杂的放射学图像转化为结构化、可读的诊断性叙述,以辅助临床报告流程。其解决方案的关键在于提出一种基于Swin-BART架构的编码器-解码器模型,并引入轻量级区域注意力模块(lightweight regional attention module),在跨注意力机制前增强具有诊断意义的区域特征,从而提升生成文本的语义保真度与可解释性。该设计在ROCO数据集上实现了当前最优的ROUGE和BERTScore指标,同时保持模型紧凑性和透明的区域归因能力,适用于需人类参与验证的安全研究场景。
链接: https://arxiv.org/abs/2511.09893
作者: Zubia Naz,Farhan Asghar,Muhammad Ishfaq Hussain,Yahya Hadadi,Muhammad Aasim Rafique,Wookjin Choi,Moongu Jeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean \pm std over three seeds and include 95% confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size =4 ), length penalty =1.1 , no_repeat_ngram_size =3 , and max length =128 . The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
zh
[NLP-62] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在下游任务微调过程中出现的安全对齐系统性退化问题,即模型在提升特定任务性能的同时,往往导致伦理准则失效和有害输出风险增加。解决方案的关键在于提出EnchTable框架,其核心创新包括:基于神经切向核(Neural Tangent Kernel, NTK)的安全向量蒸馏方法,用于解耦安全约束与任务特定推理,从而实现跨架构和规模的兼容性;以及干扰感知的融合技术,有效平衡安全性与任务实用性,在不显著牺牲性能的前提下最小化不同任务域中的安全风险。
链接: https://arxiv.org/abs/2511.09880
作者: Jialin Wu,Kecen Li,Zhicong Huang,Xinfeng Li,Xiaofeng Wang,Cheng Hong
机构: Ant Group(蚂蚁集团); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted by IEEE Symposium on Security and Privacy (SP) 2026
Abstract:Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable’s generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.
zh
[NLP-63] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限或实时场景中部署时面临的高计算与内存开销问题。解决方案的关键在于提出HierRouter,一种分层路由机制,通过将多个轻量级专用语言模型动态组装为多跳推理流水线,并利用基于近端策略优化(Proximal Policy Optimization, PPO)的强化学习代理,在每一步决策中根据当前上下文和累积成本做出上下文感知的路由选择,从而在仅增加极少额外推理开销的前提下显著提升响应质量(最高达2.4倍)。
链接: https://arxiv.org/abs/2511.09873
作者: Nikunj Gupta,Bill Guo,Rajgopal Kannan,Viktor K. Prasanna
机构: University of Southern California (南加州大学); DEVCOM Army Research Office (陆军研究办公室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here this https URL Nikunj-Gupta/hierouter.
zh
[NLP-64] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(chain-of-thought reasoning)训练中面临的两大挑战:一是监督微调(supervised fine-tuning)仅依赖单一“黄金”推理路径,会惩罚其他合理但不同的推理方式,从而损害泛化能力;二是基于可验证奖励的强化学习方法在信用分配(credit assignment)上存在困难,且计算成本高昂。解决方案的关键在于提出InTRO(In-Token Rationality Optimization)框架,其核心机制是通过估计生成策略与其答案条件版本之间的信息差异,计算每个token的修正因子(即重要性权重),从而在单次前向传播中实现token级别的探索与自我反馈,引导模型生成更准确且简洁的推理过程。
链接: https://arxiv.org/abs/2511.09865
作者: Mingye Zhu,Yi Liu,Zheren Fu,Quan Wang,Yongdong Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026 Oral
Abstract:Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single “golden” rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.
zh
[NLP-65] rmGPT : Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律和金融等专业领域中因嵌入空间同质化(isotropy problem)而导致术语级表征能力不足的问题,这会严重影响下游任务如法律判决预测或金融风险分析的准确性。解决方案的关键在于提出TermGPT,一个基于多层级对比学习的微调框架:首先构建句子图以捕捉语义与结构关系,并依据上下文和拓扑信息生成语义一致但具有判别性的正负样本;随后在句子和词元(token)两个层次上实施对比学习,从而同时增强全局语境理解与细粒度术语区分能力。
链接: https://arxiv.org/abs/2511.09854
作者: Yidan Sun,Mengying Zhu,Feiyue Chen,Yangyang Wu,Xiaolei Dan,Mengyuan Yang,Xiaolin Zheng,Shenglin Ben
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures
Abstract:Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.
zh
[NLP-66] Answering Students Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG -Enabled LLM
【速读】: 该论文旨在解决大规模在线课程中学生提问响应滞后及重复性问题,尤其在学生数量增加时,教师难以及时处理所有疑问。解决方案的关键在于构建一个基于开源大语言模型(Large Language Model, LLM)的问答系统,并结合检索增强生成(Retrieval-Augmented Generation, RAG)方法,利用本地知识库(包含全部课程内容)检索与学生问题相关的文档,从而提升回答的相关性和准确性;同时引入多链式思维推理(multi chain-of-thought reasoning)机制以缓解大语言模型常见的幻觉问题,显著提升了问答系统的性能。
链接: https://arxiv.org/abs/2511.09831
作者: Neo Wang,Sonit Singh
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8 pages
Abstract:The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students’ queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students’ queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.
zh
[NLP-67] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests
【速读】: 该论文旨在解决学生在选课过程中难以匹配当前行业趋势与个人职业目标的问题。解决方案的关键在于设计并开发一个融合数据挖掘、协同过滤(collaborative filtering)与机器学习算法的课程推荐系统,通过整合用户偏好、学术标准及历史课程数据,实现个性化课程推荐;同时,系统注重前端界面的易用性与交互体验,借助迭代原型设计和用户反馈优化,确保推荐结果既精准又实用,从而弥合高校教育与产业需求之间的差距,助力学生做出数据驱动的选课决策,提升毕业生就业竞争力。
链接: https://arxiv.org/abs/2511.09819
作者: Rahul Soni,Basem Suleiman,Sonit Singh
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages
Abstract:This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.
zh
[NLP-68] Khmer Spellchecking: A Holistic Approach
【速读】: 该论文旨在解决柬埔寨语(Khmer)拼写检查(spellchecking)中存在的多个关键挑战,包括词典与分词模型之间的错位、同一词汇的多种书写形式、松散且频繁形成的复合词难以在词典中识别,以及缺乏命名实体识别(Named Entity Recognition, NER)模型导致专有名词被误判为拼写错误。解决方案的核心在于提出一种整体性方法,集成柬埔寨语子词分段(subword segmentation)、NER、字符到音素转换(grapheme-to-phoneme, G2P)和语言模型(language model),以系统性地识别潜在修正候选并进行排序,从而实现高达94.4%的拼写检查准确率,显著优于现有方法。
链接: https://arxiv.org/abs/2511.09812
作者: Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.
zh
[NLP-69] ARG: Training-Free Adaptive Retrieval Gating for Efficient RAG
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在实际应用中因每次查询都进行检索而导致的效率低下问题,即检索操作显著增加token数量和延迟,同时可能损害生成质量。解决方案的核心是提出一种无需训练的自适应检索门控机制(Training-free Adaptive Retrieval Gating, TARG),其关键在于利用基础模型生成的一个短无上下文草稿(no-context draft)的前缀logits,计算轻量级不确定性分数(如平均词元熵、基于top-1与top-2 logit差值的边际信号或少量随机前缀的小N方差),仅当该分数超过阈值时才触发检索。该门控策略不依赖特定模型结构、无需额外训练或辅助头,仅引入数十到数百个草稿token,即可在保持甚至提升准确率的同时,将检索频率降低70%-90%,显著降低端到端延迟。
链接: https://arxiv.org/abs/2511.09803
作者: Yufeng Wang,Lu wei,Haibin Ling
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft’s prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.
zh
[NLP-70] Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer
【速读】: 该论文旨在解决跨语言自然语言处理(Cross-lingual Natural Language Processing, NLP)中因语言类型学差异导致的迁移效率低下问题,特别是在资源匮乏语言中如何有效利用资源丰富语言的知识进行模型或标注迁移。其解决方案的关键在于对中文与英文平行句中谓词-论元结构(predicate-argument structures)的对齐与非对齐现象进行系统分析,并提出了一种结构分歧的分类体系;通过定量与定性结合的标注投影实验验证了语言迁移的不对称性,从而为跨语言迁移学习中源语言的选择提供实证依据和理论支撑。
链接: https://arxiv.org/abs/2511.09796
作者: Rocco Tripodi,Xiaoyu Liu
机构: Ca’ Foscari University of Venice (威尼斯大学); Catholic University of the Sacred Heart (圣心天主教大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.
zh
[NLP-71] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)质量评估中因规模庞大、计算成本高而难以部署于边缘设备和隐私敏感场景的问题。其核心挑战是:如何在保持对语义改变型翻译错误(meaning-altering translation errors)检测能力的前提下,尽可能缩小模型规模。解决方案的关键在于采用参数量约10亿的轻量级指令微调模型(如Gemma-3-1B),结合标准化提示框架、轻量级logit偏置校准(logit-bias calibration)与多数投票机制(majority voting),实现了高质量与高效率的平衡——在SynCED-EnDe-2025数据集上达到MCC=0.77、F1-ERR=0.98的同时,单样本延迟控制在400毫秒以内(MacBook Pro M4 Pro, 24 GB VRAM),证明了紧凑型LLMs在本地化、私密且低成本的MT错误检测中的可行性。
链接: https://arxiv.org/abs/2511.09748
作者: Muskaan Chopra,Lorenz Sparrenberg,Sarthak Khanna,Rafet Sifa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE BigData 2025
Abstract:Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English-German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
zh
[NLP-72] Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives
【速读】: 该论文试图解决如何利用自然语言处理(Natural Language Processing, NLP)技术从大规模书面语料库中自动提取核心主题的问题,特别是在识别雷根至克林顿执政期间总统指令(Presidential Directives, PDs)中的信号主题(signaling themes)方面的应用。解决方案的关键在于通过NLP方法对文本进行自动化分析,并与人工标注结果进行对比,以评估其在社会科学研究场景下的有效性与可靠性,从而揭示当前NLP工具在处理复杂政策文本时的潜力与局限性。
链接: https://arxiv.org/abs/2511.09738
作者: C. LeMay,A. Lane,J. Seales,M. Winstead,S. Baty
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages
Abstract:Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.
zh
[NLP-73] Contextual morphologically-guided tokenization for Latin encoder models
【速读】: 该论文旨在解决标准分词方法在形态学丰富语言(如拉丁语)中因忽视形态对齐目标而导致的下游任务性能不佳的问题。其核心解决方案是采用形态学引导的分词策略,充分利用拉丁语中丰富的词典资源,从而提升语言模型在四个下游任务上的表现,尤其在域外文本上展现出更强的泛化能力。这一方法表明,在预训练数据有限但词典资源充足的语言中,整合语言学知识可有效替代大规模数据以提升语言建模性能。
链接: https://arxiv.org/abs/2511.09709
作者: Marisa Hudspeth,Patrick J. Burns,Brendan O’Connor
机构: Manning College of Information & Computer Sciences, University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校信息与计算机科学学院); Institute for the Study of the Ancient World, New York University (纽约大学古代世界研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources – a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models’ improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
zh
[NLP-74] Order Matters: Rethinking Prompt Construction in In-Context Learning
【速读】: 该论文旨在解决当前对提示工程中示例选择(example selection)与示例排序(example ordering)相对重要性的认知偏差问题,即以往研究普遍认为示例选择对上下文学习(in-context learning, ICL)性能的影响远大于排序,因而忽视了排序的作用。论文通过在分类和生成任务上的系统性控制实验(涵盖0.5B至27B参数的多个开源模型及GPT-5),发现不同示例顺序带来的性能波动与使用完全不同的示例集相当,表明排序同样具有显著影响。解决方案的关键在于:仅利用开发集(development set)即可识别出强排序策略,其性能接近基于测试标签的最优排序(oracle),从而证明示例排序可被有效建模并优化,推动提示设计从单一关注选择转向对选择与排序协同优化的新范式。
链接: https://arxiv.org/abs/2511.09700
作者: Warren Li,Yiqian Wang,Zihan Wang,Jingbo Shang
机构: UC San Diego (加州大学圣地亚哥分校); Cushing Academy (库欣学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.
zh
[NLP-75] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600 Languages
【速读】: 该论文旨在解决自动语音识别(ASR)在低资源语言(low-resource languages)中覆盖不足的问题,这些语言占全球语言总数的绝大多数,但长期以来因数据稀缺和模型架构限制而难以获得有效的ASR支持。传统ASR系统扩展成本高、可扩展性差,且常因缺乏社区协作引发伦理争议。解决方案的关键在于提出Omnilingual ASR——首个面向可扩展性的大规模ASR系统,其核心创新包括:1)通过自监督预训练扩展至70亿参数(7B parameters),学习鲁棒的语音表征;2)设计一种受大语言模型(LLM)启发的编码器-解码器架构,实现零样本泛化能力;3)结合公共数据与通过有偿本地合作获取的社区录音,构建广度与多样性兼具的训练语料库,从而有效适应未见过的语言。这一方法使模型能够仅用少量样本即可引入新语言,并将支持语言数量扩展至1600种以上,显著优于现有系统,尤其在低资源场景下表现突出。
链接: https://arxiv.org/abs/2511.09690
作者: Omnilingual ASR team:Gil Keren,Artyom Kozhevnikov,Yen Meng,Christophe Ropers,Matthew Setzler,Skyler Wang,Ife Adebara,Michael Auli,Can Balioglu,Kevin Chan,Chierh Cheng,Joe Chuang,Caley Droof,Mark Duppenthaler,Paul-Ambroise Duquenne,Alexander Erben,Cynthia Gao,Gabriel Mejia Gonzalez,Kehan Lyu,Sagar Miglani,Vineel Pratap,Kaushik Ram Sadagopan,Safiyyah Saleem,Arina Turkatenko,Albert Ventayol-Boada,Zheng-Xin Yong,Yu-An Chung,Jean Maillard,Rashel Moritz,Alexandre Mourachko,Mary Williamson,Shireen Yates
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world’s 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most–all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date–including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at this https URL.
zh
[NLP-76] Probability-Biased Attention over Directed Bipartite Graphs for Long-Tail ICD Coding
【速读】: 该论文旨在解决自动化国际疾病分类(International Classification of Diseases, ICD)编码任务中的长尾分布问题,即在大规模标签空间(10,000–20,000个代码)中,少数高频代码占据主导地位,而大量罕见代码因训练数据不足导致模型性能下降。解决方案的关键在于构建一种基于细粒度共现关系建模的图结构编码器——定向二分图编码器(Directed Bipartite Graph Encoder),其中将代码分为常见码和罕见码两个互不相交的节点集,并仅允许信息从常见码流向罕见码。通过引入基于条件概率的偏置项(bias)注入注意力模块,实现共现编码(Co-occurrence Encoding),从而利用常见码与罕见码之间的统计共现关系增强罕见码的表示能力;同时,借助大语言模型(Large Language Model, LLM)生成高质量的代码描述以丰富初始嵌入,提供临床语境与共病信息作为外部知识,显著提升罕见类别上的分类性能,尤其在Macro-F1指标上表现突出。
链接: https://arxiv.org/abs/2511.09559
作者: Tianlei Chen,Yuxiao Chen,Yang Li,Feifei Wang
机构: Renmin University of China (中国人民大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Automated International Classification of Diseases (ICD) coding aims to assign multiple disease codes to clinical documents, constituting a crucial multi-label text classification task in healthcare informatics. However, the task is challenging due to its large label space (10,000 to 20,000 codes) and long-tail distribution, where a few codes dominate while many rare codes lack sufficient training data. To address this, we propose a learning method that models fine-grained co-occurrence relationships among codes. Specifically, we construct a Directed Bipartite Graph Encoder with disjoint sets of common and rare code nodes. To facilitate a one-way information flow, edges are directed exclusively from common to rare codes. The nature of these connections is defined by a probability-based bias, which is derived from the conditional probability of a common code co-occurring given the presence of a rare code. This bias is then injected into the encoder’s attention module, a process we term Co-occurrence Encoding. This structure empowers the graph encoder to enrich rare code representations by aggregating latent comorbidity information reflected in the statistical co-occurrence of their common counterparts. To ensure high-quality input to the graph, we utilize a large language model (LLM) to generate comprehensive descriptions for codes, enriching initial embeddings with clinical context and comorbidity information, serving as external knowledge for the statistical co-occurrence relationships in the code system. Experiments on three automated ICD coding benchmark datasets demonstrate that our method achieves state-of-the-art performance with particularly notable improvements in Macro-F1, which is the key metric for long-tail classification.
zh
[NLP-77] Music Flamingo: Scaling Music Understanding in Audio Language Models
【速读】: 该论文旨在解决当前音频-语言模型在音乐理解方面存在的局限性,尤其是面对音乐动态性、多层次结构和信息密集特性时的不足,以及由于高质量音乐数据与标注稀缺导致的模型可扩展性问题。现有模型通常只能生成简短的高层级描述,难以回答深层次问题,且跨文化音乐泛化能力弱。解决方案的关键在于:首先构建大规模、多维度标注的数据集MF-Skills,涵盖和声、结构、音色、歌词及文化语境等要素;其次,在增强版Audio Flamingo 3基础上进行微调,并引入后训练策略——先利用基于音乐理论的链式思维(chain-of-thought)数据集MF-Think进行冷启动,再通过基于GRPO(Generalized Reward Policy Optimization)的强化学习优化推理能力,从而显著提升模型在音乐理解与推理任务上的表现,实现从表面识别向类人层次感知的跃迁。
链接: https://arxiv.org/abs/2511.10289
作者: Sreyan Ghosh,Arushi Goel,Lasha Koroshinadze,Sang-gil Lee,Zhifeng Kong,Joao Felipe Santos,Ramani Duraiswami,Dinesh Manocha,Wei Ping,Mohammad Shoeybi,Bryan Catanzaro
机构: NVIDIA(英伟达); University of Maryland(马里兰大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model’s reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.
zh
计算机视觉
[CV-0] Enhancing the Outcome Reward-based RL Training of MLLM s with Self-Consistency Sampling NEURIPS2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于结果奖励(outcome-reward)的强化学习(Reinforcement Learning, RL)训练中,因“不忠实轨迹”(unfaithful trajectories)导致的性能下降问题。具体而言,在多项选择类任务中,若模型通过错误的推理链(faulty chain of thought)偶然猜中正确选项,其仍会获得与真实推理相同的奖励,从而误导策略优化过程。解决方案的关键在于提出自一致性采样(Self-Consistency Sampling, SCS),其核心机制包括:对输入图像施加微小扰动,并对初始推理轨迹进行多次截断与重采样;通过分析生成轨迹的一致性来构建可微分的一致性评分,用于在策略更新时降低不可靠轨迹的权重,从而提升训练的有效性和鲁棒性。
链接: https://arxiv.org/abs/2511.10648
作者: Jiahao Wang,Weiye Xu,Aijun Yang,Wengang Zhou,Lewei Lu,Houqiang Li,Xiaohua Wang,Jinguo Zhu
机构: Xi’an Jiaotong University (西安交通大学); University of Science and Technology of China (中国科学技术大学); Shanghai Artifcial Intelligence Laboratory (上海人工智能实验室); SenseTime Research (商汤研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)
Abstract:Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.
zh
[CV-1] Depth Anything 3: Recovering the Visual Space from Any Views
【速读】:该论文旨在解决从任意数量视觉输入中预测空间一致几何结构的问题,无论输入图像是否具有已知相机位姿。其核心挑战在于如何在不依赖复杂架构或多任务学习的情况下实现高精度与强泛化能力。解决方案的关键在于两个重要发现:一是仅使用一个简单的Transformer(如原始DINO编码器)作为主干网络即可满足需求,无需专门设计架构;二是采用单一的深度射线预测目标替代复杂的多任务学习机制,从而简化模型训练并提升性能。通过教师-学生训练范式,该方法在多个视觉几何任务上达到新的SOTA水平,显著优于现有最优模型VGGT和前代模型DA2。
链接: https://arxiv.org/abs/2511.10647
作者: Haotong Lin,Sili Chen,Junhao Liew,Donny Y. Chen,Zhenyu Li,Guang Shi,Jiashi Feng,Bingyi Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
zh
[CV-2] One Small Step in Latent One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
【速读】:该论文旨在解决扩散模型在高分辨率图像生成中面临的效率与质量瓶颈问题:直接在高分辨率空间采样计算成本高昂,而后期图像超分辨率(ISR)方法则会引入伪影并增加延迟。解决方案的关键在于提出Latent Upscaler Adapter (LUA),一个轻量级模块,能够在生成器的潜在空间(latent space)中直接进行超分辨率操作,无需修改基础模型或添加额外扩散阶段,仅通过一次前向传播即可实现高质量高分辨率图像合成。LUA采用共享的Swin风格骨干网络与针对不同缩放因子设计的像素洗牌(pixel-shuffle)头,支持2倍和4倍放大,并展现出对多种VAE潜空间的良好泛化能力,从而在保持接近原生高分辨率生成质量的同时,显著降低解码与上采样时间(相比同架构的像素空间SR减少约3倍)。
链接: https://arxiv.org/abs/2511.10629
作者: Aleksandr Razin,Danil Kazantsev,Ilya Makarov
机构: Saint Petersburg State Polytechnical University (圣彼得堡国立理工大学); National Research Technological University MISiS (莫斯科国立科技大学); Higher School of Economics (高等经济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator’s latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
zh
[CV-3] Querying Labeled Time Series Data with Scenario Programs
【速读】:该论文旨在解决仿真环境中识别的自动驾驶汽车(AV)故障场景在真实世界中是否可重现的问题,即“仿真实验到现实世界的差距”(sim-to-real gap)。由于仿真与真实传感器数据之间的差异可能导致故障场景仅为合成数据的伪影而非实际问题,因此需要验证这些场景在真实数据中的存在性。解决方案的关键在于提出了一种形式化定义,用于判断标注的时间序列传感器数据是否匹配由Scenic概率编程语言表示的抽象场景,并设计了一个查询算法,能够高效识别真实数据集中符合特定场景程序的数据子集。该方法相较于当前最先进的商用视觉大语言模型,在准确性和查询速度上均有显著提升,且能随时间序列长度扩展而保持可扩展性。
链接: https://arxiv.org/abs/2511.10627
作者: Edward Kim,Devan Shanker,Varun Bharadwaj,Hongbeen Park,Jinkyu Kim,Hazem Torfah,Daniel J Fremont,Sanjit A Seshia
机构: University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:
Abstract:Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.
zh
[CV-4] Multitask GLocal OBIA-Mamba for Sentinel-2 Landcover Mapping
【速读】:该论文旨在解决基于Sentinel-2遥感影像的土地利用与土地覆盖(Land Use and Land Cover, LULC)分类难题,主要挑战包括空间异质性、上下文信息缺失以及类别特征模糊性。其解决方案的关键在于提出一种多任务全局-局部(Global-Local, GLocal)协同的OBIA-Mamba架构(MSOM),通过超像素作为Mamba令牌减少冗余计算并保留细粒度信息;设计双分支CNN-Mamba结构以联合建模局部空间细节与全局上下文信息;同时引入多任务优化框架,结合双损失函数平衡局部精度与全局一致性,从而在阿尔伯塔省Sentinel-2影像上实现优于现有先进方法的分类性能。
链接: https://arxiv.org/abs/2511.10604
作者: Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Kaylee Xiao,Motasem Alkayid,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学); University of Jordan (约旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.
zh
[CV-5] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis
【速读】:该论文旨在解决数字乳腺断层成像(Digital Breast Tomosynthesis, DBT)中因标注数据有限而导致深度学习模型性能受限的问题。现有方法要么通过将DBT体积数据降维为2D投影或逐切片处理而丢失三维信息,要么采用复杂3D架构但需更多训练数据。其解决方案的关键在于提出MM-3D架构,该架构在保持与2D全视野数字乳腺摄影(Full-Field Digital Mammography, FFDM)模型相同参数量的前提下,实现可学习的3D推理能力:通过构建恶性病变引导的3D特征,并以不增加参数的方式反复混合这些3D特征与切片级信息,从而保留体积信息并支持从FFDM模型直接迁移权重。实验表明,MM-3D在低数据场景下显著优于复杂3D方法,在高数据场景下性能相当,且在BCS-DBT基准上分别提升分类和定位准确率4%和10%。
链接: https://arxiv.org/abs/2511.10597
作者: Yen Nhi Truong Vu,Dan Guo,Sripad Joshi,Harshit Kumar,Jason Su,Thomas Paul Matthews
机构: Whiterabbit.ai(白兔人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose MM-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, MM. MM-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in MM without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that MM-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, MM-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, MM-3D outperforms previous top baseline by 4% for classification and 10% for localization.
zh
[CV-6] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
【速读】:该论文旨在解决当前通用3D基础模型(General 3D foundation models)主要依赖RGB输入、忽视可用几何线索(如相机内参、位姿和深度图等辅助几何模态)的问题。其关键解决方案在于提出OmniVGGT框架,该框架通过引入GeoAdapter模块,利用零初始化卷积逐步注入深度与相机内外参信息,从而在不破坏基础模型表示空间的前提下实现稳定优化;同时设计随机多模态融合策略,在训练时对每一样本随机采样模态子集,使模型在测试时可灵活支持任意数量的几何输入,并增强空间表征的鲁棒性而非过拟合于特定辅助信号。
链接: https://arxiv.org/abs/2511.10560
作者: Haosong Peng,Hao Li,Yalun Dai,Yushi Lan,Yihang Luo,Tianyu Qi,Zhengshen Zhang,Yufeng Zhan,Junfei Zhang,Wenchao Xu,Ziwei Liu
机构: HKUST(香港科技大学); NTU(南洋理工大学); SYSU(中山大学); NUS(新加坡国立大学); Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model’s representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
zh
[CV-7] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
【速读】:该论文旨在解决生成式 AI 中视觉风格一致性与创造力不足的问题,即如何在仅依赖少量输入的情况下生成新颖且稳定的视觉风格图像。现有方法通常依赖长文本提示、参考图像或参数高效微调,但难以保证风格的一致性并受限于复杂风格表示。解决方案的关键在于提出“码到风格”(code-to-style)图像生成任务,并设计首个开源方法 CoTyle:首先从图像集合中训练离散风格代码本(discrete style codebook)以提取风格嵌入(style embeddings),随后利用自回归风格生成器建模这些嵌入的分布,从而合成新的风格嵌入;推理时,通过数值风格码映射至唯一风格嵌入,驱动文本到图像扩散模型(text-to-image diffusion model, T2I-DM)生成对应风格图像。该方案实现了仅用一个数值代码即可控制多样、可复现的视觉风格,显著提升了风格生成的简洁性和可控性。
链接: https://arxiv.org/abs/2511.10555
作者: Huijie Liu,Shuhao Cui,Haoxiang Cao,Shuai Ma,Kai Wu,Guoliang Kang
机构: Beihang University (北京航空航天大学); Kolors Team, Kuaishou Technology (快手科技); South China Normal University (华南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 13 figures, 5 tables
Abstract:Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
zh
[CV-8] Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在多样性方面表现不足的问题,即模型倾向于生成同质化输出,缺乏对输入提示中潜在语义变体的充分响应。其解决方案的关键在于提出一个系统性的评估框架,通过量化单个概念及其相关变因(factor of variation)来实现精细化的多样性评估;具体包括:(1)设计了一种新颖的人工评估模板以捕捉多样性细节;(2)构建了一个涵盖多种概念及其明确变因(如颜色、姿态等)的提示集;(3)基于二项式检验比较不同模型在人类标注上的表现,并结合多种图像嵌入方法进行严谨的多样性度量分析。该方法能够对T2I模型按多样性进行排序,并识别其在特定类别中的薄弱环节,从而为模型改进和评估指标发展提供可靠依据。
链接: https://arxiv.org/abs/2511.10547
作者: Isabela Albuquerque,Ira Ktena,Olivia Wiles,Ivana Kajić,Amal Rannen-Triki,Cristina Vasconcelos,Aida Nematzadeh
机构: Ellison Institute of Technology (艾尔森技术研究所); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2511.10547 [cs.CV] (or arXiv:2511.10547v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.10547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-9] Dynamic Avatar-Scene Rendering from Human-centric Context
【速读】:该论文旨在解决从单目视频中重建动态人类与真实环境交互的难题,现有方法要么整体建模动态场景导致忽略人体与其他场景组件(如背景)的不同运动特性,从而造成重建不完整;要么将场景与背景分开建模,但忽视了二者之间的信息交互,导致人-场景边界处出现空间不一致性和视觉伪影。解决方案的关键在于提出“先分离再映射”(Separate-then-Map, StM)策略,通过引入专用的信息映射机制,将分别建模的组件进行统一关联:利用共享的变换函数对每个高斯属性进行处理,避免了繁琐的成对交互计算,同时保证了人体与其周围环境在空间和视觉上的连贯性,显著提升了重建质量与渲染精度,尤其在复杂的人-场景交互边界处表现优异。
链接: https://arxiv.org/abs/2511.10539
作者: Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu
机构: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey; Centre for Vision, Speech and Signal Processing (CVSSP), People-Centred Artificial Intelligence (PAI), University of Surrey
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose \bf Separate-then-Map (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.
zh
[CV-10] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation AAAI2026
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中部署时面临的两大关键问题:一是感知冗余(perceptual redundancy),即模型对无关视觉输入的低效处理;二是指令与视觉之间的浅层对齐(superficial instruction-vision alignment),导致动作语义接地(semantic grounding)不足。解决方案的核心在于提出SemanticVLA框架,其创新性地引入三个模块:1)语义引导的双视觉剪枝器(Semantic-guided Dual Visual Pruner, SD-Pruner),通过指令驱动剪枝(ID-Pruner)提取全局动作线索和局部语义锚点,并结合空间聚合剪枝(SA-Pruner)将几何丰富特征压缩为任务自适应令牌;2)语义互补层次融合器(Semantic-complementary Hierarchical Fuser, SH-Fuser),融合SigLIP的密集补丁与DINOv2的稀疏令牌,实现语义与空间几何的一致表征;3)语义条件动作耦合器(Semantic-conditioned Action Coupler, SA-Coupler),替代传统观测到自由度(DoF)的映射方式,提升行为建模的效率与可解释性。实验表明,SemanticVLA在性能与效率上均达到新SOTA,在LIBERO基准上较OpenVLA成功率提升21.1%,同时训练成本和推理延迟分别降低3倍。
链接: https://arxiv.org/abs/2511.10518
作者: Wei Li,Renshan Zhang,Rui Shao,Zhijian Fang,Kaiwen Zhou,Zhuotao Tian,Liqiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to AAAI 2026 (Oral), Project Page: this https URL
Abstract:Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and this http URL is open-sourced and publicly available at this https URL
zh
[CV-11] Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising
【速读】:该论文旨在解决传统总变差(Total Variation, TV)正则化方法对参数λ(lambda)高度依赖的问题,这一局限性导致其在实际应用中效率低下且难以有效调参。解决方案的关键在于提出一种可学习的总变差(Learnable Total Variation, LTV)框架,该框架将未展开的TV求解器与一个数据驱动的Lambda映射网络(LambdaNet)相结合,后者能够预测每像素的正则化权重图;整个流程端到端训练,使重建质量和正则化参数优化协同进行,从而实现空间自适应平滑:在均匀区域强去噪,在解剖边界附近保留更多细节。
链接: https://arxiv.org/abs/2511.10500
作者: Yusuf Talha Basak,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: this https URL
zh
[CV-12] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在计算上的高复杂度问题,其注意力机制的计算开销随token数量呈二次增长,导致资源消耗过大。解决方案的关键在于提出SPOT(SParsification with attentiOn dynamics via Token relevance)框架,通过融合token嵌入、交互信息及跨层注意力动态变化来推断token重要性,从而实现对冗余token的早期识别与剔除。该方法设计轻量级预测器,可无缝集成至多种ViT架构中,自适应地学习输入相关的token优先级策略,在显著提升计算效率(最高达40%)的同时保持甚至提升模型精度。
链接: https://arxiv.org/abs/2511.10488
作者: Oded Schlesinger,Amirhossein Farzam,J. Matias Di Martino,Guillermo Sapiro
机构: Duke University (杜克大学); Princeton University (普林斯顿大学); Universidad Católica del Uruguay (乌拉圭天主教大学); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project repository: this https URL
Abstract:While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at this https URL .
zh
[CV-13] Utility of Pancreas Surface Lobularity as a CT Biomarker for Opportunistic Screening of Type 2 Diabetes
【速读】:该论文旨在解决2型糖尿病(Type 2 Diabetes Mellitus, T2DM)早期筛查困难的问题,特别是探索胰腺表面叶状结构(Pancreatic Surface Lobularity, PSL)是否可作为潜在的CT影像生物标志物用于T2DM的早期识别。解决方案的关键在于提出了一种全自动化的深度学习方法,通过四个基于深度学习的分割模型对584例患者的腹部CT图像进行胰腺及其他器官的自动勾画,从而提取包括PSL在内的多维CT影像特征,并构建多变量预测模型实现T2DM的高效筛查;结果显示PSL在糖尿病患者中显著升高(p=0.01),且基于CT生物标志物的预测模型达到0.90 AUC、66.7%敏感性和91.9%特异性,表明PSL具有良好的早期诊断潜力。
链接: https://arxiv.org/abs/2511.10484
作者: Tejas Sudharshan Mathai,Anisa V. Prasad,Xinya Wang,Praveen T.S. Balamuralikrishna,Yan Zhuang,Abhinav Suri,Jianfei Liu,Perry J. Pickhardt,Ronald M. Summers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE ISBI 2026
Abstract:Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45 \pm 15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 \pm 8.32 compared to 3.19 \pm 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 \pm 0.17 and lowest ASSD error of 1.94 \pm 2.63 mm (p 0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7% sensitivity, and 91.9% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.
zh
[CV-14] Intrinsic Dimensionality as a Model-Free Measure of Class Imbalance
【速读】:该论文旨在解决分类任务中类别不平衡问题,传统方法通常仅依赖各类别样本数量(cardinality)来衡量不平衡程度,但忽略了冗余样本的存在以及不同类别学习难度的差异。为此,作者提出使用数据内在维度(Intrinsic Dimensionality, ID)作为无需训练模型、计算简便的不平衡度量指标,其能更准确地反映类别的难易程度和信息密度。该方案的关键在于将ID作为可直接集成到现有不平衡缓解方法中的通用度量工具,实验证明其在多个数据集上均优于基于样本数量的重加权与重采样策略,并且结合ID与基数信息可进一步提升性能。
链接: https://arxiv.org/abs/2511.10475
作者: Çağrı Eser,Zeynep Sonat Baltacı,Emre Akbaş,Sinan Kalkan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 11 figures
Abstract:Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: this https URL.
zh
[CV-15] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data
【速读】:该论文旨在解决地球观测(Earth Observation)领域中单图像超分辨率(Single-Image Super-Resolution, SISR)任务的实现门槛高、模型配置与扩展复杂的问题。传统方法通常需要用户直接修改代码来调整生成器(generator)、判别器(discriminator)、损失函数及训练策略,缺乏灵活性和可复现性。其解决方案的关键在于提出一个开源且模块化的框架 OpenSR-SRGAN,通过配置文件驱动的方式封装 SRGAN 类模型的核心组件,使得用户无需改动代码即可灵活切换架构、缩放因子和多光谱波段组合,并提供针对遥感场景的预设配置、对抗训练默认参数以及日志记录、验证和大场景推理等内置钩子(hooks),从而显著降低研究者和实践者使用 GAN-based 超分辨率技术的门槛,促进模型比较与部署的标准化与高效化。
链接: https://arxiv.org/abs/2511.10461
作者: Simon Donike,Cesar Aybar,Julio Contreras,Luis Gómez-Chova
机构: University of Valencia (瓦伦西亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.
zh
[CV-16] Histology-informed tiling of whole tissue sections improves the interpretability and predictability of cancer relapse and genetic alterations
【速读】:该论文旨在解决数字病理学分析中因传统基于网格的切片方法忽略组织结构(tissue architecture)而导致的信息冗余与可解释性差的问题。其解决方案的关键在于提出了一种组织学信息引导的切片方法(Histology-Informed Tiling, HIT),该方法利用语义分割技术从全切片图像(Whole Slide Images, WSIs)中提取具有生物学意义的腺体结构作为多实例学习(Multiple-Instance Learning, MIL)的输入块,从而聚焦于关键病理特征,提升模型在拷贝数变异(Copy Number Variation, CNV)检测和表型分析中的准确性与可解释性。
链接: https://arxiv.org/abs/2511.10432
作者: Willem Bonnaffé,Yang Hu,Andrea Chatrian,Mengran Fan,Stefano Malacrino,Sandy Figiel,CRUK ICGC Prostate Group,Srinivasa R. Rao,Richard Colling,Richard J. Bryant,Freddie C. Hamdy,Dan J. Woodcock,Ian G. Mills,Clare Verrill,Jens Rittscher
机构: University of Oxford (牛津大学); University of Cambridge (剑桥大学); University of Warwick (华威大学); CRUK ICGC Prostate Group (英国癌症研究所国际癌症基因组联盟前列腺组); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
备注: 26 pages, 6 figures
Abstract:Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid-based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology-informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple-instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland-level Dice score of 0.83 +/- 0.17. By extracting 380,000 glands from 760 WSIs across ICGC-C and TCGA-PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial-mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.
zh
[CV-17] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
【速读】:该论文旨在解决癫痫研究中非侵入式、视频驱动的惊厥事件自动检测问题,以提升前临床阶段癫痫模型的监测效率与准确性。其解决方案的关键在于构建了一个高质量、精心标注的实验室啮齿类动物视频数据集(RodEpil Dataset),包含10,101个正常活动样本和2,952个癫痫发作样本,采用严格的受试者级划分策略进行五折交叉验证,避免数据泄露;同时基于TimeSformer架构设计了一种基于Transformer的视频分类器,在不依赖传感器或侵入性设备的情况下实现了平均F1-score达97%的高精度识别性能,为无创视频监测在癫痫研究中的应用提供了可复现的基准方法。
链接: https://arxiv.org/abs/2511.10431
作者: Daniele Perlo,Vladimir Despotovic,Selma Boudissa,Sang-Yoon Kim,Petr Nazarov,Yanrong Zhang,Max Wintermark,Olivier Keunen
机构: Luxembourg Institute of Health (卢森堡健康研究所); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: https://doi.org/10.5281/zenodo.17601357
zh
[CV-18] 3DFETUS: Standardizing Fetal Facial Planes in 3D Ultrasound
【速读】:该论文旨在解决胎儿超声(fetal ultrasound, US)检查中标准面部平面获取困难的问题,主要挑战包括胎儿运动、体位变异以及操作者经验差异,这些问题导致诊断一致性差、检查时间延长和潜在的偏倚。解决方案的关键在于提出两个核心方法:一是GT++算法,通过标注的解剖学标志点从三维超声(3D US)数据中估计标准面部平面;二是3DFETUS模型,基于深度学习实现对3D胎儿超声体积中这些平面的自动化与标准化定位。实验表明,该方案在平移误差(均值4.13 mm)和旋转误差(均值7.93°)上优于现有最先进方法,并通过临床专家评估验证了其显著提升的准确性。
链接: https://arxiv.org/abs/2511.10412
作者: Alomar Antonia,Rubio Ricardo,Albaiges Gerard,Salort-Benejam Laura,Caminal Julia,Prat Maria,Rueda Carolina,Cortes Berta,Piella Gemma,Sukno Federico
机构: Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.10412 [cs.CV] (or arXiv:2511.10412v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.10412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-19] LLM -YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components
【速读】:该论文旨在解决风力发电机组(wind turbine, WT)部件故障检测方法依赖视觉识别、输出结构化结果缺乏语义可解释性且无法支持运维决策的问题。解决方案的关键在于构建一个融合YOLOMS目标检测模型与大语言模型(large language model, LLM)的集成框架:其中YOLOMS通过多尺度检测和滑动窗口裁剪提升故障特征提取能力,轻量级键值(key-value, KV)映射模块将视觉检测结果转化为包含定性和定量属性的结构化文本表示,进而由领域微调的大语言模型进行语义推理,生成具有可解释性的故障分析报告与维护建议,从而显著提升诊断结果的可理解性并为风电运维提供实用决策支持。
链接: https://arxiv.org/abs/2511.10394
作者: Yaru Li,Yanxue Wang,Meng Li,Xinming Li,Jianbo Feng
机构: Beijing University of Civil Engineering and Architecture (北京建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal resubmission
Abstract:The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6% and generates maintenance reports with an average accuracy of 89%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.
zh
[CV-20] GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models WACV2026
【速读】:该论文旨在解决从数字表面模型(Digital Surface Model, DSM)到数字地形模型(Digital Terrain Model, DTM)转换过程中存在的精度不足与泛化能力弱的问题,传统滤波方法依赖人工调参,而现有学习方法则需复杂架构设计且常需后处理。其解决方案的关键在于提出首个基于扩散模型(diffusion-based)的框架GrounDiff,将DTM生成建模为去噪任务,通过门控机制(gated design)结合置信度引导生成策略实现对非地表结构的选择性移除;同时引入Prior-Guided Stitching(PrioStitch)以全局先验指导局部高分辨率预测,显著提升可扩展性与一致性。实验表明,该方法在多个基准数据集上均显著优于当前最优深度学习方法,在道路重建任务中亦展现出更低的距离误差和更优的表面平滑性。
链接: https://arxiv.org/abs/2511.10391
作者: Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers
机构: DeepScenario; TU Munich; Munich Center for Machine Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at this https URL.
zh
[CV-21] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
【速读】:该论文旨在解决真实世界文档中复杂布局(如多级表格、嵌入图像或公式、跨页结构)对现有光学字符识别(OCR)系统带来的挑战,这些问题导致布局理解与内容识别准确率下降。解决方案的关键在于提出MonkeyOCR v1.5,一个统一的视觉-语言框架,采用两阶段解析流水线:第一阶段利用大规模多模态模型联合预测文档布局与阅读顺序,确保结构与序列一致性;第二阶段在检测区域内进行文本、公式和表格的局部识别,保持高视觉保真度并减少误差传播。针对复杂表格结构,引入基于视觉一致性的强化学习方案,通过render-and-compare对齐评估识别质量,提升结构准确性且无需人工标注;同时设计Image-Decoupled Table Parsing和Type-Guided Table Merging两个专用模块,实现含嵌入图像表格的可靠解析及跨页/跨列表格的重建。
链接: https://arxiv.org/abs/2511.10390
作者: Jiarui Zhang,Yuliang Liu,Zijun Wu,Guosheng Pang,Zhili Ye,Yupei Zhong,Junteng Ma,Tao Wei,Haiyang Xu,Weikai Chen,Zeen Wang,Qiangjun Ji,Fanxi Zhou,Qi Zhang,Yuanrui Hu,Jiahao Liu,Zhang Li,Ziyang Zhang,Qiang Liu,Xiang Bai
机构: KingSoft Office Zhuiguang AI Lab (金山办公字节AI实验室); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
zh
[CV-22] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery
【速读】:该论文旨在解决从卫星遥感影像中准确反演植被生物物理参数(如叶面积指数 LAI 和冠层叶绿素含量 CCC)的问题,尤其在缺乏实测地面标签或真实影像校准的情况下如何提升反演精度。解决方案的关键在于提出了一种物理信息引导的 Transformer-VAE 架构,将 PROSAIL 辐射传输模型(Radiative Transfer Model, RTM)作为可微分的物理解码器嵌入网络结构中,从而确保潜在变量映射到物理上合理的叶和冠层属性;该模型仅使用模拟数据训练,却能达到与依赖真实 Sentinel-2 影像自监督训练的先进方法相当的性能,实现了无需实测标签或影像校准的低成本、自监督全球植被监测方案。
链接: https://arxiv.org/abs/2511.10387
作者: Prince Mensah,Pelumi Victor Aderinto,Ibrahim Salihu Yusuf,Arnu Pretorius
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, uses this http URL
Abstract:Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.
zh
[CV-23] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection
【速读】:该论文旨在解决当前车道线检测(lane detection)在复杂现实环境下的性能瓶颈问题,特别是背景杂乱、光照变化和遮挡等因素对数据驱动方法带来的挑战,这些问题往往导致模型泛化能力弱且标注成本高昂。解决方案的关键在于提出一种名为SAMIRO(Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle)的新机制,其核心是利用预训练模型作为“Oracle”来迁移知识,同时通过空间注意力机制保留域无关的空间信息,从而增强模型对上下文和全局结构的理解能力。该方法具备即插即用特性,可无缝集成到多种主流车道线检测框架中,并在CULane、TuSimple和LLAMAS等多个基准数据集上实现稳定性能提升。
链接: https://arxiv.org/abs/2511.10385
作者: Hyunjong Lee,Jangho Lee,Jaekoo Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, paper in press
Abstract:Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO’s plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.
zh
[CV-24] Frag ile by Design: On the Limits of Adversarial Defenses in Personalized Generation
【速读】:该论文旨在解决个性化生成模型(如DreamBooth)中存在的面部身份泄露隐私风险问题,现有防御方法(如Anti-DreamBooth)通过在用户图像中注入对抗扰动来阻止模型学习并复现用户身份,但存在两个关键缺陷:一是扰动易产生肉眼可见的伪影,二是扰动对简单非学习型滤波器极为脆弱,可被轻易去除。为此,作者提出一种新的评估框架AntiDB_Purify,用于系统性地测试现有防御方案在真实净化威胁(包括传统图像滤波和对抗净化)下的有效性。结果表明,当前所有防御方法在面对此类净化攻击时均失效,揭示了现有技术仅提供虚假的安全感,强调亟需开发更隐蔽且鲁棒的防护机制以保障个性化生成场景中的用户身份安全。
链接: https://arxiv.org/abs/2511.10382
作者: Zhen Chen,Yi Zhang,Xiangyu Yin,Chengxuan Qin,Xingyu Zhao,Xiaowei Huang,Wenjie Ruan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even a simple, non-learned filter can effectively remove them, thereby restoring the model’s ability to memorize and reproduce user identity. To investigate this vulnerability, we propose a novel evaluation framework, AntiDB_Purify, to systematically evaluate existing defenses under realistic purification threats, including both traditional image filters and adversarial purification. Results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.
zh
[CV-25] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
【速读】:该论文旨在解决零样本(zero-shot)机器人导航中因依赖显式三维场景图(3D Scene Graph)而导致的视觉信息损失、词汇受限及构建成本高的问题。现有方法通常将丰富的视觉观测压缩为仅文本关系,造成不可逆的信息丢失并限制开放词汇(open vocabulary)能力。其解决方案的关键在于提出多模态三维场景图(Multi-modal 3D Scene Graph, M3DSG),通过用动态分配的图像替代传统文本关系边来保留视觉线索;在此基础上构建MSGNav系统,集成关键子图选择模块以提升推理效率、自适应词汇更新模块实现开放词汇支持,并引入闭环推理模块增强探索准确性,同时识别并解决“最后一公里”问题——即确定可行的目标位置与合适的最终视角,提出基于可视性的视角决策模块予以显式处理。
链接: https://arxiv.org/abs/2511.10376
作者: Xun Huang,Shijia Zhao,Yunxiang Wang,Xin Lu,Wanfa Zhang,Rongsheng Qu,Weixin Li,Yunhong Wang,Chenglu Wen
机构: Xiamen University (厦门大学); Zhongguancun Academy (中关村学院); Nanyang Technological University (南洋理工大学); University of Chinese Academy of Sciences (中国科学院大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages
Abstract:Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
zh
[CV-26] SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation
【速读】:该论文旨在解决地球观测领域中地理空间基础模型(Geospatial Foundation Models, GFMs)在预训练数据覆盖不足的环境中可靠性下降的问题。其解决方案的关键在于提出SHRUG-FM框架,该框架通过融合三种互补的信号实现可靠性感知预测:输入空间中的分布外(Out-of-Distribution, OOD)检测、嵌入空间中的OOD检测以及任务特定的预测不确定性。实验证明,这些信号能够有效识别性能下降的场景,并揭示失败模式与特定土地覆盖属性(如低海拔区域和大河流域)之间的关联,从而为气候敏感应用中GFMs的安全部署提供可解释性保障。
链接: https://arxiv.org/abs/2511.10370
作者: Kai-Hendrik Cohrs,Zuzanna Osika,Maria Gonzalez-Calabuig,Vishal Nedungadi,Ruben Cartuyvels,Steffen Knoblauch,Joppe Massant,Shruti Nath,Patrick Ebel,Vasileios Sitokonstantinou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.
zh
[CV-27] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile
【速读】:该论文旨在解决当前基于人工智能(AI)的皮肤病学应用在实际临床中推广受限的问题,主要瓶颈包括数据集偏倚、图像质量不一以及缺乏充分验证。其解决方案的关键在于开发了一个轻量级、基于智能手机的应用程序DermAI,该程序能够在日常诊疗过程中实现皮肤病变的实时采集、标注与分类;并通过设备端的质量检查和本地模型自适应机制提升实用性与准确性。此外,研究构建了一个涵盖多种肤色、种族和设备来源的临床数据集,实验证明仅使用公开数据训练的模型难以泛化至本地样本,而通过本地数据微调则显著提升了性能,凸显了标准化、多样化数据收集对机器学习驱动医疗应用的重要性。
链接: https://arxiv.org/abs/2511.10367
作者: Thales Bezerra,Emanoel Thyago,Kelvin Cunha,Rodrigo Abreu,Fábio Papais,Francisco Mauro,Natália Lopes,Érico Medeiros,Jéssica Guido,Shirley Cruz,Paulo Borba,Tsang Ing Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 1 table, submitted on ISBI
Abstract:AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.
zh
[CV-28] FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection
【速读】:该论文旨在解决单域泛化(Single Domain Generalization, SDG)目标检测中模型在未见目标域上性能下降的问题,核心挑战在于现有方法如基于CLIP的语义增强忽略了特征分布的底层结构和频域特性。解决方案的关键在于提出一个融合von Mises-Fisher(vMF)分布与傅里叶变换的CLIP引导框架:首先利用vMF分布建模物体表征的方向性特征,以更好地捕捉嵌入空间中的域不变语义结构;其次引入基于傅里叶变换的增强策略,通过扰动幅度和相位分量模拟频域中的域偏移,从而提升特征的鲁棒性和跨域一致性。该方法在多样天气驾驶基准上显著优于当前最先进方法。
链接: https://arxiv.org/abs/2511.10352
作者: Mengzhu Wang,Changyuan Deng,Shanshan Wang,Nan Yin,Long Lan,Liang Yang
机构: Hebei University of Technology (河北工业大学); Anhui University (安徽大学); Hong Kong University of Technology (香港科技大学); National University of Technology (国家科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.
zh
[CV-29] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment AAAI2026
【速读】:该论文旨在解决弱监督视频异常检测中因模型倾向于关注最显著的异常响应片段而忽视多样正常模式挖掘,以及由于外观相似导致类别混淆的问题,从而影响细粒度分类性能。解决方案的关键在于提出一种解耦语义对齐网络(Disentangled Semantic Alignment Network, DSANet),其核心机制包括:在粗粒度层面引入自引导正常性建模分支,通过学习到的正常原型指导输入视频特征重建,强化模型对视频内固有正常线索的利用,提升正常模式与异常事件的时间分离能力;在细粒度层面设计解耦对比语义对齐机制,首先基于帧级异常得分将视频分解为以事件为中心和以背景为中心的组件,再结合视觉-语言对比学习增强类别判别性表征,从而实现异常与正常特征的显式解耦与增强区分。
链接: https://arxiv.org/abs/2511.10334
作者: Wenti Yin,Huaxin Zhang,Xiang Wang,Yuqing Lu,Yicheng Zhang,Bingquan Gong,Jialong Zuo,Li Yu,Changxin Gao,Nong Sang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Code is available at this https URL
Abstract:Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.
zh
[CV-30] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision
【速读】:该论文旨在解决极端深度变化场景下三维重建中近景与远景区域监督信号不一致的问题,现有方法难以同时实现远距离区域的精确深度估计和近距离区域的结构保真。其解决方案的关键在于提出一种融合景深监督(Depth-of-field Supervision)与多视角一致性监督(Multi-View Consistency Supervision)的计算框架:前者利用尺度恢复的单目深度估计器生成先验深度,并通过模糊卷积合成物理上准确的散焦图像,结合新颖的景深损失函数增强远近场区域的几何一致性;后者基于LoFTR的半密集特征匹配最小化跨视角几何误差,并通过可靠匹配点的最小二乘优化强制深度一致性。该方法通过统一成像物理原理与多视角几何约束,在Waymo Open Dataset上实现了比当前最优方法高0.8 dB的PSNR提升,显著改善了城市环境中复杂深度分层的重建质量。
链接: https://arxiv.org/abs/2511.10316
作者: Yu Deng,Baozhu Zhao,Junyan Su,Xiaohan Zhang,Qi Liu
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.
zh
[CV-31] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification
【速读】:该论文旨在解决可见光-红外跨模态行人重识别(Visible-Infrared Person Re-Identification, VI-ReID)任务中因物理特性差异导致的模态鸿沟问题。解决方案的关键在于提出一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID,其核心创新包括:1)文本语义生成(Text Semantic Generation, TSG),仅对可见光图像生成文本语义以实现初步的可见光-文本模态对齐;2)红外特征嵌入(Infrared Feature Embedding, IFE),利用生成的文本语义校正红外图像特征嵌入,将身份相关语义注入共享图像编码器,提升其对红外模态的适应性,并间接实现可见光与红外模态对齐;3)高层语义对齐(High-level Semantic Alignment, HSA),进一步优化细粒度的跨模态语义一致性,确保文本语义仅保留身份相关信息,从而增强模态共享表示的判别能力。
链接: https://arxiv.org/abs/2511.10309
作者: Xiaomei Yang,Xizhan Gao,Sijie Niu,Fa Zhu,Guang Feng,Xiaofeng Qu,David Camacho
机构: Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan 250022, China (山东省 ubiquitous 智能计算重点实验室,信息科学与工程学院,济南大学); College of Information Science and Technology & College of Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China (信息科学与技术学院 & 人工智能学院,南京林业大学); Computer Systems Engineering Department, Universidad Politécnica de Madrid, 28031, Calle Alan, Turing s/n, Madrid, Spain (马德里理工大学计算机系统工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.
zh
[CV-32] Revisiting Evaluation of Deep Neural Networks for Pedestrian Detection
【速读】:该论文旨在解决当前行人检测(Pedestrian Detection)模型性能评估中存在的问题,即现有基准测试指标无法对验证数据集的不同子集进行真实可靠的性能评估,导致难以准确衡量深度神经网络(DNN)在实际自动驾驶场景中的表现。其解决方案的关键在于引入基于图像分割(Image Segmentation)的细粒度错误分类方法,提出八类新的错误类别,并设计相应的评价指标来实现更精细化、安全导向的模型对比。通过该框架,作者在不使用额外训练数据的情况下,在CityPersons-reasonable数据集上实现了当前最优(SOTA)性能,验证了该方法在提升模型安全性评估方面的有效性。
链接: https://arxiv.org/abs/2511.10308
作者: Patrick Feifel,Benedikt Franke,Frank Bonarens,Frank Köster,Arne Raulf,Friedhelm Schwenker
机构: Stellantis, Opel Automobile GmbH; Carl von Ossietzky Universität Oldenburg; Deutsches Zentrum für Luft- und Raumfahrt; Universität Ulm
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.
zh
[CV-33] Rethinking Visual Information Processing in Multimodal LLM s
【速读】:该论文旨在解决LLaVA架构在视觉-语言任务中因文本与视觉模态之间固有差异而导致的视觉特征融合效率低下问题。其核心解决方案是提出LLaViT(Large Language Models as extended Vision Transformers),通过三项关键改进使大型语言模型(Large Language Model, LLM)同时具备视觉编码能力:(1) 为视觉模态学习独立的查询(Query)、键(Key)、值(Value)投影;(2) 在视觉token上启用双向注意力机制;(3) 融合全局与局部视觉表征。实验表明,LLaViT在多个基准测试中显著优于基线LLaVA方法,甚至超越参数量翻倍的模型,验证了该方案在视觉-语言建模中的有效性。
链接: https://arxiv.org/abs/2511.10301
作者: Dongwan Kim,Viresh Ranjan,Takashi Nagata,Arnab Dhua,Amit Kumar K C
机构: Seoul National University (首尔国立大学); Amazon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
zh
[CV-34] Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts AAAI2026
【速读】:该论文旨在解决卫星遥感图像中非正规住区(slum)分割的泛化能力问题,即现有模型在特定区域训练后难以有效迁移到未见过的地区,主要原因在于非正规住区形态学特征的异质性。解决方案的关键在于提出一种两阶段测试时自适应框架GRAM(Generalized Region-Aware Mixture-of-Experts),其核心机制包括:1)利用百万级高分辨率卫星影像数据集(覆盖四大洲12个城市)进行源域训练,采用专家混合(Mixture-of-Experts)架构同时学习区域特异性特征与共享通用特征;2)在目标域无标签情况下,通过专家预测一致性筛选可靠伪标签,实现无需目标域标注数据的鲁棒迁移,从而显著提升在低资源地区(如非洲城市)的分割性能。
链接: https://arxiv.org/abs/2511.10300
作者: Sumin Lee,Sungwon Park,Jeasurk Yang,Jihee Kim,Meeyoung Cha
机构: 1. KAIST (韩国科学技术院); 2. Institute for Basic Science (基础科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted to AAAI 2026
Abstract:Satellite-based slum segmentation holds significant promise in generating global estimates of urban poverty. However, the morphological heterogeneity of informal settlements presents a major challenge, hindering the ability of models trained on specific regions to generalize effectively to unseen locations. To address this, we introduce a large-scale high-resolution dataset and propose GRAM (Generalized Region-Aware Mixture-of-Experts), a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions. We compile a million-scale satellite imagery dataset from 12 cities across four continents for source training. Using this dataset, the model employs a Mixture-of-Experts architecture to capture region-specific slum characteristics while learning universal features through a shared backbone. During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions. GRAM outperforms state-of-the-art baselines in low-resource settings such as African cities, offering a scalable and label-efficient solution for global slum mapping and data-driven urban planning.
zh
[CV-35] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,即模型生成与视觉输入不一致的文本内容,从而影响其可靠性。为缓解这一问题,作者提出了一种低开销的推理时干预框架——残差更新引导解码调节(Residual-Update Directed DEcoding Regulation, RUDDER)。其核心创新在于:(1) 提出一种基于单次标准前向传播中自注意力层残差更新提取的上下文感知激活残差方向(Contextual Activation Residual Direction, CARD)向量,作为每样本的视觉证据表示;(2) 设计一种贝叶斯启发的自适应门控机制,在token级别注入校正信号,其强度由模型对视觉上下文的偏离程度动态决定。实验表明,RUDDER在POPE和CHAIR等关键幻觉评测基准上性能媲美当前最优方法,同时引入可忽略的计算延迟,实现了高可靠性与高效性的平衡。
链接: https://arxiv.org/abs/2511.10292
作者: Zhengtao Zou,Ya Gao,Jiarui Guan,Bin Li,Pekka Marttinen
机构: Aalto University (阿尔托大学); Shenzhen Institutes of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model’s deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs’ reliability without a significant compromise on efficiency.
zh
[CV-36] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂视觉推理任务中因多步依赖关系导致早期错误传播的问题。现有后训练范式如监督微调(Supervised Fine-Tuning, SFT)依赖昂贵的步骤级标注,而基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法仅提供稀疏的结果级反馈,难以实现稳定优化。其解决方案的关键在于提出PROPA(Process-level Reasoning Optimization with interleaved Policy Alignment)框架,该框架将蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与GRPO结合,生成密集的过程级奖励信号,并在每个中间步骤进行推理优化,无需人工标注;同时通过交错更新SFT与GRPO克服冷启动问题,并引入过程奖励模型(Process Reward Model, PRM)引导推理时搜索,使测试阶段的搜索策略与训练信号对齐,从而显著提升模型在多个基准上的推理能力和泛化性能。
链接: https://arxiv.org/abs/2511.10279
作者: Yanbei Jiang,Chao Lei,Yihao Ding,Krista Ehinger,Jey Han Lau
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: this https URL.
zh
[CV-37] H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
【速读】:该论文旨在解决细粒度视觉分类(Fine-Grained Visual Classification, FGVC)中因类间差异细微和类内变化大而导致的识别困难问题。现有方法通常依赖特征选择或区域提议机制来定位判别区域,但难以全面捕捉判别线索并引入大量类别无关冗余。其解决方案的关键在于提出H3Former框架,通过引入语义感知聚合模块(Semantic-Aware Aggregation Module, SAAM)和双曲层级对比损失(Hyperbolic Hierarchical Contrastive Loss, HHCL):SAAM利用多尺度上下文线索动态构建token间的加权超图,并通过超图卷积捕获高阶语义依赖,将局部token特征逐步聚合为紧凑的区域级表示;HHCL则在非欧几里得空间中施加层级语义约束,增强类间可分性和类内一致性,同时保留细粒度类别间的内在层次关系。
链接: https://arxiv.org/abs/2511.10260
作者: Yongji Zhang,Siqi Li,Kuiyang Huang,Yue Gao,Yu Jiang
机构: Jilin University (吉林大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.
zh
[CV-38] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis AAAI2026
【速读】:该论文旨在解决生成式 AI 在面部情绪分析(Facial Emotion Analysis, FEA)中面临的两个关键问题:一是幻觉推理(hallucinated reasoning),即视觉语言模型(Vision-Language Models, VLMs)因缺乏特定情绪知识而生成看似合理但不准确的解释;二是情绪推理与识别之间的错位(misalignment),即观测到的面部动作单元(Action Unit, AU)与最终情绪标签之间连接断裂,导致推理过程与预测结果不一致。解决方案的关键在于提出一个三阶段对齐框架 Facial-R1:首先通过指令微调建立基础的情绪推理能力;其次引入以情绪和 AU 标签为奖励信号的强化训练,显式对齐推理过程与预测情绪;最后设计数据合成流水线,迭代利用前两阶段提升训练数据规模,实现模型的可扩展自增强。该方法显著提升了 FEA 的性能、泛化能力和可解释性。
链接: https://arxiv.org/abs/2511.10254
作者: Jiulong Wu,Yucheng Shen,Lingyong Yan,Haixin Sun,Deguo Xia,Jizhou Huang,Min Cao
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Huawei Cloud (华为云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by AAAI 2026. 16 pages, 3 figures, 10 tables
Abstract:Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
zh
[CV-39] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment
【速读】:该论文旨在解决动作质量评估(Action Quality Assessment, AQA)中现有方法因仅依赖整段视频特征而导致的可解释性与可靠性不足问题,以及当前AQA数据集缺乏细粒度评分标注(如扣分项和子分数)的局限。解决方案的关键在于提出名为JudgeMind的新方法:首先将动作视频按阶段分割并逐段评分以提升准确性;其次设计阶段感知的特征增强与融合模块,强化对各阶段关键区域的感知能力,并提高对频繁摄像机视角切换引起的视觉变化的鲁棒性;最后引入基于知识的等级感知解码器,将可能的扣分项作为先验知识融入评分预测过程,从而实现更准确、可靠的评分结果。
链接: https://arxiv.org/abs/2511.10250
作者: Yongji Zhang,Siqi Li,Yue Gao,Yu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.
zh
[CV-40] ubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding
【速读】:该论文旨在解决视频中时空目标定位(Spatio-Temporal Video Grounding, STVG)任务中存在的文本条件引导不足与目标跟踪不一致的问题。现有弱监督方法通常采用简单的晚期融合策略,生成与文本描述无关的候选时空管(tube),导致目标识别失败和跟踪不一致。其解决方案的关键在于提出一种Tube-conditioned Reconstruction with Mutual Constraints(TubeRMC)框架:首先利用预训练视觉定位模型生成文本条件下的候选tube,随后通过三种重建策略(时序、空间及时空维度)实现tube条件下的关键线索重构,并引入空间与时间提议之间的相互约束机制以提升重建质量,从而增强文本-时空管的对齐精度和跟踪一致性。
链接: https://arxiv.org/abs/2511.10241
作者: Jinxuan Li,Yi Zhang,Jian-Fang Hu,Chaolei Tan,Tianming Liang,Beihao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbfTubeRMC) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
zh
[CV-41] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization
【速读】:该论文旨在解决当前多模态深度伪造检测方法在跨数据集和跨伪造类型场景下泛化能力不足的问题,尤其针对那些仅关注音视频不一致性的方法在面对保持音视频对齐的伪造操作时失效的情况。其解决方案的关键在于提出一种单阶段训练框架,通过引入帧预测机制(next-frame prediction)来增强模型对单模态与跨模态特征的鲁棒性,并结合窗口级注意力机制(window-level attention mechanism)捕捉预测帧与真实帧之间的局部差异,从而实现对完整伪造视频的准确分类以及对部分伪造样本中深伪片段的有效定位。
链接: https://arxiv.org/abs/2511.10212
作者: Ashutosh Anshul,Shreyas Gopal,Deepu Rajan,Eng Siong Chng
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review, Multimodal Deepfake detection
Abstract:Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.
zh
[CV-42] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction
【速读】:该论文旨在解决车联网协同感知(V2X collaborative perception)中两个核心问题:一是参与感知的车辆或设备具有多模态和异构性,导致特征对齐困难;二是现有框架难以扩展以适应新增参与者,全参数训练成本过高。解决方案的关键在于提出一种可扩展的异构适配框架 HeatV2X,其核心设计包括:首先基于异构图注意力机制训练一个高性能基础代理作为协作学习起点;随后引入局部异构微调(Local Heterogeneous Fine-Tuning)和全局协同微调(Global Collaborative Fine-Tuning),前者通过异构感知适配器(Hetero-Aware Adapters)高效提取模态特异性差异,后者利用多认知适配器(Multi-Cognitive Adapter)增强跨代理协作与融合潜力,从而在极低训练开销下显著提升协同感知性能。
链接: https://arxiv.org/abs/2511.10211
作者: Yueran Zhao,Zhang Zhang,Chao Sun,Tianze Wang,Chao Yue,Nuoran Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.
zh
[CV-43] LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures AAAI2026
【速读】:该论文旨在解决基于扩散模型的3D LiDAR点云场景补全方法在自动驾驶感知系统中因多步迭代采样导致计算开销大、难以实时应用的问题。其解决方案的关键在于提出了一种轻量级、非扩散架构的网络LiNeXt:首先通过Noise-to-Coarse(N2C)模块在单次前向传播中完成去噪并生成粗略点云,从而避免了扩散模型的多步迭代;随后利用Refine模块结合中间特征对粗点云进行精细化重构,提升结构完整性;此外,针对LiDAR点云距离相关分布特性(近距密集、远距稀疏),设计了Distance-aware Selected Repeat策略以生成更均匀的噪声点云,进一步优化补全效果。实验表明,LiNeXt在SemanticKITTI数据集上推理速度提升199.8倍、Chamfer Distance降低50.7%,且参数量仅为LiDiff的6.1%。
链接: https://arxiv.org/abs/2511.10209
作者: Wenzhe He,Xiaojun Chen,Ruiqi Wang,Ruihui Li,Huilong Pi,Jiapeng Zhang,Zhuo Tang,Kenli Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 13 figures, Accepted to AAAI 2026
Abstract:3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high-fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt-a lightweight, non-diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise-to-Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi-step iterative sampling of diffusion-based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance-aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8x speedup in inference, reduces Chamfer Distance by 50.7%, and uses only 6.1% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.
zh
[CV-44] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction WACV2026
【速读】:该论文旨在解决多智能体轨迹预测中长期目标与细粒度社会交互难以协同建模的问题,从而生成更符合现实场景的多智能体未来轨迹。现有方法往往无法同时捕捉代理的长期意图和动态社交互动,导致预测结果缺乏真实性和安全性。解决方案的关键在于提出VISTA——一种递归的目标条件Transformer架构,其核心创新包括:(1) 交叉注意力融合模块,将长时程意图与历史运动信息进行整合;(2) 社交令牌注意力机制,实现跨代理间灵活的社会交互建模;(3) 成对注意力图,使社交影响模式在推理阶段具备可解释性。该框架将单智能体目标条件预测扩展为一致的多智能体预测体系,在MADRAS和SDD基准上显著降低碰撞率并提升轨迹精度,验证了其在安全关键自主系统中的有效性。
链接: https://arxiv.org/abs/2511.10203
作者: Stephane Da Silva Martins,Emanuel Aldea,Sylvie Le Hégarat-Mascle
机构: SATIE - CNRS UMR 8029 (SATIE - CNRS UMR 8029); Paris-Saclay University (巴黎-萨克雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Paper accepted at WACV 2026
Abstract:Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents’ long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.
zh
[CV-45] Utilizing a Geospatial Foundation Model for Coastline Delineation in Small Sandy Islands
【速读】:该论文旨在解决小沙滩岛屿海岸线提取(shoreline delineation)在数据匮乏地区难以实现高精度监测的问题。解决方案的关键在于利用NASA与IBM联合开发的Prithvi-EO-2.0遥感基础模型(geospatial foundation model),通过少量标注样本(5–181张多光谱卫星图像)进行微调,实现了优异的迁移学习性能(F1达0.94,IoU达0.79),验证了该模型在低数据场景下支持海岸带动态监测的巨大潜力。
链接: https://arxiv.org/abs/2511.10177
作者: Tishya Chhabra,Manisha Bajpai,Walter Zesk,Skylar Tibbits
机构: Massachusetts Institute of Technology (麻省理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:We present an initial evaluation of NASA and IBM’s Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.
zh
[CV-46] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection
【速读】:该论文旨在解决正颌影像中 cephalometric 标志点(cephalometric landmarks)自动定位精度低的问题,尤其是在图像对比度差和解剖结构复杂的情况下,传统人工标注耗时且易出错,而现有自动化方法往往难以保证准确性。其解决方案的关键在于提出一种多头残差卷积网络(CephRes-MHNet),通过集成残差编码、双注意力机制与多头解码器,显著增强模型的上下文推理能力和解剖结构定位精度;该架构在仅使用不到基准模型25%参数的前提下,实现了优于当前最优方法(如AFPF-Net)的性能,平均径向误差(MRE)达1.23 mm,2 mm阈值下的成功检测率(SDR)为85.5%,体现了高精度与架构效率的统一。
链接: https://arxiv.org/abs/2511.10173
作者: Ahmed Jaheen,Islam Hassan,Mohanad Abouserie,Abdelaty Rehab,Adham Elasfar,Knzy Elmasry,Mostafa El-Dawlatly,Seif Eldawlatly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 Pages, Under Review at The IEEE International Symposium on Biomedical Imaging (ISBI 2026)
Abstract:Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.
zh
[CV-47] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution
【速读】:该论文旨在解决多退化类型图像恢复问题,即现实场景中图像常同时存在多种退化(如雨、噪声、雾霾等),而现有方法通常仅针对单一退化类型设计,难以应对复杂混合退化。此外,传统通过模块堆叠提升性能的方法往往缺乏可解释性。解决方案的关键在于提出一种以可解释性为导向的深度展开网络(deep unfolding network),将数学优化算法的迭代过程映射为可学习的网络结构;其中采用改进的二阶半光滑牛顿算法(second-order semi-smooth Newton algorithm)确保每一模块具有清晰的物理意义,并引入受人脑灵活信息处理机制启发的可解释卷积模块(explainable convolution module),使网络能自主调整参数以适应不同输入,从而在保持高可解释性的同时实现对多退化和单退化任务的高效恢复。
链接: https://arxiv.org/abs/2511.10166
作者: Hu Gao,Xiaoning Lei,Xichen Xu,Depeng Dang,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); CATL; Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain’s flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.
zh
[CV-48] GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval
【速读】:该论文针对文本到图像人物检索(Text-to-Image Person Retrieval, TIPR)中因文本描述无法准确全面反映图像内容而导致的跨模态对齐不佳与数据集过拟合问题,以及文本与图像之间固有的模态差异所加剧的检索挑战,提出了一种从生成式视角出发的解决方案——生成增强对齐(Generation-Enhanced Alignment, GEA)。其核心在于设计两个并行模块:一是文本引导的令牌增强(Text-Guided Token Enhancement, TGTE),通过扩散模型生成的图像作为中间语义表示,丰富文本的语义表达并弥合文本与视觉模式之间的鸿沟;二是生成中间融合(Generative Intermediate Fusion, GIF),利用生成图像、原始图像与文本特征间的交叉注意力机制构建统一表征,并通过三元组对齐损失优化该表征,从而提升跨模态匹配精度。
链接: https://arxiv.org/abs/2511.10154
作者: Hao Zou,Runqing Zhang,Xue Zhou,Jianxiao Zou
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8pages,3figures
Abstract:Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at this https URL.
zh
[CV-49] Decoupling Bias Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
【速读】:该论文旨在解决深度伪造检测模型在不同人口统计群体(如性别和种族)之间存在的偏见问题,此类偏见可能导致系统性误判,加剧数字鸿沟和社会不平等。当前增强公平性的检测方法通常以牺牲检测准确率为代价,难以兼顾公平性与性能。其解决方案的关键在于提出一种双机制协同优化框架,通过结构公平解耦(structural fairness decoupling)在模型架构层面分离对人口统计特征敏感的通道,并结合全局分布对齐(global distribution alignment)在特征层面缩小整体样本分布与各群体子分布之间的距离,从而在提升跨群体和群体内公平性的同时,保持跨域检测准确性。
链接: https://arxiv.org/abs/2511.10150
作者: Feng Ding,Wenhui Yi,Yunpeng Zhou,Xinan He,Hong Rao,Shu Hu
机构: Nanchang University (南昌大学); Guangdong Provincial Key Laboratory of Intelligent Information Processing; Shenzhen Key Laboratory of Media Security; SZU–AFS Joint Innovation Center for AI Technology; Shenzhen University; Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.
zh
[CV-50] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space AAAI2026
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representation, INR)模型在表征能力上的局限性问题,即传统多层感知机(Multilayer Perceptron, MLP)架构由于低维特征空间导致的函数拟合能力不足。其关键解决方案是提出一种新型的“分层结构”(split-layer),该结构将每一层拆分为多个并行分支,并通过Hadamard积整合输出,从而有效构建高阶多项式空间,在不带来显著计算开销的前提下大幅提升特征空间维度,显著增强INR的表征能力。
链接: https://arxiv.org/abs/2511.10142
作者: Zhicheng Cai,Hao Zhu,Linsen Chen,Qiu Shen,Xun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR’s representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.
zh
[CV-51] Right Looks Wrong Reason s: Compositional Fidelity in Text-to-Image Generation AAAI2026
【速读】:该论文试图解决当前主流文本到图像生成模型在逻辑组合能力上的根本性缺陷问题,即模型难以正确处理否定、计数和空间关系这三类核心逻辑原语的组合。研究表明,尽管模型在单独处理这些原语时表现良好,但一旦组合使用便出现性能急剧下降,暴露出严重的干扰现象。解决方案的关键在于突破现有架构的局限:首先,需构建包含显式否定等逻辑结构的训练数据;其次,应设计更适合离散逻辑推理的表示机制,而非依赖连续注意力机制;最后,评估指标需从单纯追求视觉合理性转向强制满足逻辑约束。论文指出,仅靠现有方法或简单扩展无法弥合这一差距,真正的组合性能力必须依赖于表示学习与推理机制的根本性进步。
链接: https://arxiv.org/abs/2511.10136
作者: Mayank Vatsa,Aparna Bharati,Richa Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in AAAI 2026
Abstract:The architectural blueprint of today’s leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.
zh
[CV-52] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction AAAI2026
【速读】:该论文旨在解决密集视频字幕(Dense Video Captioning)任务中现有方法依赖隐式建模导致难以捕捉事件序列间时间连贯性及视觉上下文语义完整性的难题。其解决方案的关键在于提出一种显式的时间-语义建模框架——上下文感知跨模态交互(Context-Aware Cross-Modal Interaction, CACMI),通过两个核心组件实现:1)跨模态帧聚合(Cross-modal Frame Aggregation)利用跨模态检索提取时序一致且事件对齐的文本特征;2)上下文感知特征增强(Context-aware Feature Enhancement)采用查询引导注意力机制,融合视觉动态信息与伪事件语义,从而提升模型对复杂视频内容的理解能力。
链接: https://arxiv.org/abs/2511.10134
作者: Mingda Jia,Weiliang Meng,Zenghuang Fu,Yiheng Li,Qi Zeng,Yifan Zhang,Ju Xin,Rongtao Xu,Jiguang Zhang,Xiaopeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
zh
[CV-53] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo NEURIPS
【速读】:该论文旨在解决真实场景下立体深度估计(Stereo Depth Estimation)中因动态域偏移(dynamic domain shifts)、稀疏或不可靠监督信号以及密集真值标签获取成本高昂所带来的挑战。其解决方案的关键在于提出一种名为RobIA的鲁棒、实例感知的持续测试时自适应(Continual Test-Time Adaptation, CTTA)框架,核心创新包括:(1) Attend-and-Excite Mixture-of-Experts(AttEx-MoE)模块,通过轻量级自注意力机制动态路由输入至冻结专家网络,适配视差几何特性;(2) 基于参数高效微调(PEFT)的鲁棒AdaptBN教师模型,利用稀疏手工标注生成密集伪监督信号。该策略实现了输入相关的灵活性与广泛监督覆盖,显著提升了在持续域变化下的泛化能力与计算效率。
链接: https://arxiv.org/abs/2511.10107
作者: Jueun Ko,Hyewon Park,Hyesong Choi,Dongbo Min
机构: Ewha Womans University (梨花女子大学); Soongsil University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Neural Information Processing Systems (NeurIPS) 2025
Abstract:Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.
zh
[CV-54] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models AAAI2026
【速读】:该论文旨在解决大型视觉语言模型(Large Visual Language Models, LVLMs)在实际应用中面临的多目标后门攻击(multi-target backdoor attacks)问题,即在单次训练过程中引入多个独立触发器(trigger),每个触发器对应不同的恶意输出目标,从而显著提升攻击的隐蔽性和危害性。现有方法仅能实现单目标攻击,难以应对复杂场景下的威胁。解决方案的关键在于提出MTAttack框架,其核心是基于两个约束条件的新型优化方法:代理空间划分约束(Proxy Space Partitioning constraint)和触发器原型锚定约束(Trigger Prototype Anchoring constraint)。该方法在潜在空间中联合优化多个触发器,确保每个触发器能独立将干净图像映射到唯一代理类,同时保持不同触发器之间的可分离性,从而实现高精度的多触发器-目标映射,且具备良好的跨数据集泛化能力和对防御策略的鲁棒性。
链接: https://arxiv.org/abs/2511.10098
作者: Zihan Wang,Guansong Pang,Wenjun Miao,Jin Zheng,Xiao Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026, with supplementary material
Abstract:Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at this https URL.
zh
[CV-55] How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders
【速读】:该论文旨在解决生成式模型在输出中常出现的物理合理性错误(physical plausibility failures)难以被现有评估方法检测,且缺乏自动识别与解释这些错误模式的框架的问题。解决方案的关键在于提出Matryoshka Transcoders框架,该框架基于马特里什卡(Matryoshka)表示学习范式扩展至编码器-解码器架构,实现了多粒度层级稀疏特征学习;通过在物理合理性分类器的中间表示上训练,并利用大规模多模态模型进行语义解释,无需人工特征工程即可自动发现并解析多样化的物理相关错误模式,从而显著提升特征的相关性与准确性。
链接: https://arxiv.org/abs/2511.10094
作者: Yiming Tang,Abhijeet Sinha,Dianbo Liu
机构: National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.
zh
[CV-56] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition AAAI2026
【速读】:该论文旨在解决如何将大语言模型(Large Language Models, LLMs)与人体骨骼数据结合以实现动作分类与描述的问题,核心挑战在于:1)LLMs如何理解骨骼序列信息;2)如何区分不同动作类别。解决方案的关键在于提出一种名为SUGAR(Skeleton representation with visUal-motion knowledGe for Action Recognition)的新范式,其核心包括:利用现成的大规模视频模型构建视觉与运动知识库作为先验指导,通过该知识监督骨骼表示学习以获得离散化表征,并借助未微调的LLM直接理解这些表征生成动作标签与描述;同时引入Temporal Query Projection(TQP)模块以有效建模长序列骨骼信号,从而在多个基于骨骼的动作识别基准上实现优异性能,且在零样本场景下展现出优于线性方法的泛化能力。
链接: https://arxiv.org/abs/2511.10091
作者: Qilang Ye,Yu Zhou,Lian He,Jie Zhang,Xuanming Guo,Jiayu Zhang,Mingkui Tan,Weicheng Xie,Yue Sun,Tao Tan,Xiaochen Yuan,Ghada Khoriba,Zitong Yu
机构: Nankai University (南开大学); Guangdong University of Foreign Studies (广东外语外贸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 Main Track
Abstract:Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.
zh
[CV-57] XIAA: eXplainable Injections for Adversarial Attack
【速读】:该论文旨在解决当前后验可解释人工智能(Post-hoc Explainable AI, XAI)方法在图像领域中存在的安全性漏洞问题,即攻击者可通过对抗性扰动修改模型的解释结果(如显著图、积分梯度和DeepLIFT SHAP等),而不会被人类察觉且不改变原始预测类别。解决方案的关键在于提出一种黑盒、模型无关的单步对抗攻击方法,仅需访问模型的预测输出和解释结果,无需任何关于模型结构或参数的信息,即可显著改变解释内容并保持预测不变,从而揭示现有XAI方法在安全关键场景下的不可靠性。
链接: https://arxiv.org/abs/2511.10088
作者: Leonardo Pesce,Jiawen Wei,Gianmarco Mengaldo
机构: National University of Singapore (新加坡国立大学); Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model’s computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).
zh
[CV-58] GridPrune: From “Where to Look” to “What to Select” in Visual Token Pruning for MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因视觉token数量庞大而导致的计算开销问题。现有视觉token剪枝方法主要聚焦于直接优化“选择什么”(即基于注意力分数或相似性度量进行局部筛选),但忽视了人类视觉系统中先确定关注区域(“看哪里”)再进行细节选择的两阶段机制,导致空间分配效率低、位置偏差及冗余token保留等问题。解决方案的关键在于提出GridPrune方法,其核心是采用“全局引导、局部选择”的分区策略:首先利用文本条件引导动态分配token预算至不同空间区域(guide-globally),随后在每个预算区域内执行局部最优选择(select-locally),从而实现更高效且符合认知规律的视觉token剪枝。实验表明,该方法在多个MLLM架构上均显著优于基线,例如在LLaVA-NeXT-7B上仅使用11.1%的token即可保持96.98%的原始性能,且比最优基线提升2.34%。
链接: https://arxiv.org/abs/2511.10081
作者: Yuxiang Duan,Ao Li,Yingqin Li,Luyu Li,Pengwei Wang
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to (“where to look”) before deciding which specific elements within those regions to process in detail (“what to select”). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing “what to select”, typically using attention scores or similarity metrics. They rarely consider “where to look”, which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a “guide-globally, select-locally” zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.
zh
[CV-59] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints AAAI2026
【速读】:该论文旨在解决现有生成式语音同步动作(co-speech motion)方法中因局部关节旋转表示导致的误差累积问题,从而引发末端执行器(end-effectors)运动不稳定和不真实的问题。其解决方案的关键在于提出GlobalDiff框架,首次在全局关节旋转空间中进行扩散模型建模,从根本上解耦各关节预测对上游依赖关系,避免层级结构带来的误差传播;同时引入多级约束机制——包括关节结构约束、骨架结构约束和时间结构约束——以补偿全局旋转空间中缺乏结构先验的问题,从而有效提升生成动作的结构一致性和时序准确性。
链接: https://arxiv.org/abs/2511.10076
作者: Xiangyue Zhang,Jianfang Li,Jianqiang Ren,Jiaxu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint’s prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.
zh
[CV-60] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System NEURIPS2025
【速读】:该论文旨在解决传统语义通信系统中因对图像和文本模态分别处理而导致的谱效率低、传输冗余及适应性差的问题。其核心解决方案是提出一种基于视觉-语言特征的多模态语义通信(Vision-Language Feature-based Multimodal Semantic Communication, VLF-MSC)框架,关键在于利用预训练的视觉-语言模型(Vision-Language Model, VLM)将源图像编码为统一的视觉-语言语义特征(Vision-Language Feature, VLF),该特征作为单一紧凑表示在无线信道中传输;接收端通过条件化于该VLF的解码器语言模型与扩散图像生成器,同步重建描述性文本和语义一致的图像,从而实现跨模态语义信息的高效联合传输与还原,显著提升低信噪比下的语义保真度与带宽利用率。
链接: https://arxiv.org/abs/2511.10074
作者: Gwangyeon Ahn,Jiwan Seo,Joonhyuk Kang
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: To appear in the AI4NextG Workshop at NeurIPS 2025
Abstract:We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.
zh
[CV-61] Perceive Act and Correct: Confidence Is Not Enough for Hyperspectral Classification AAAI2026
【速读】:该论文旨在解决高光谱图像分类中因模型过度依赖置信度而导致的误判问题,即模型常将高预测分数错误地视为正确分类,缺乏对不确定性的认知,从而在标注稀疏或类别不平衡情况下产生确认偏倚(confirmation bias),导致过拟合于自信的错误并丧失泛化能力。其解决方案的关键在于提出一种闭环学习框架CABIN(Cognitive-Aware Behavior-Informed learNing),通过感知(perception)、行动(action)与修正(correction)三阶段实现认知-aware的半监督学习:首先利用认知不确定性估计(epistemic uncertainty estimation)识别易错区域以建立感知意识;其次采用不确定性引导的双重采样策略(Uncertainty-Guided Dual Sampling Strategy)在探索不确定样本的同时锚定可信伪标签以降低偏差;最后引入细粒度动态分配策略(Fine-Grained Dynamic Assignment Strategy)对伪标签数据进行可靠、模糊和噪声三类划分,并施加差异化损失函数以提升泛化性能。
链接: https://arxiv.org/abs/2511.10068
作者: Muzhou Yang,Wuzhou Quan,Mingqiang Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.
zh
[CV-62] Multivariate Gaussian Representation Learning for Medical Action Evaluation AAAI2026
【速读】:该论文旨在解决医疗视觉中细粒度动作评估的三大挑战:缺乏全面的数据集、对精度要求极高,以及对极快速动作的时空动态建模不足。其解决方案的关键在于提出一个名为GaussMedAct的多变量高斯编码框架,通过自适应时空表示学习提升医疗运动分析能力。该方法将联合运动投影到时序缩放的多维空间,并分解为可变的3D高斯作为token,利用各向异性协方差建模保留动作语义并增强对时空噪声的鲁棒性;同时结合笛卡尔与向量双流策略的混合空间编码,有效利用骨骼关节和骨结构特征,最终在CPREval-6k基准上实现92.1% Top-1准确率,且仅需10% FLOPs即超越ST-GCN基线模型5.9%准确率,验证了其优越性和实时性。
链接: https://arxiv.org/abs/2511.10060
作者: Luming Yang,Haoxian Liu,Siqing Li,Alper Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.
zh
[CV-63] When Eyes and Ears Disagree: Can MLLM s Discern Audio-Visual Confusion? AAAI2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对“视听混淆”场景时,因视觉信息主导而难以准确识别缺失音频的问题,即模型容易误判未发声物体的存在。其核心解决方案是提出一种基于强化学习的协作式多MLLM框架——RL-CoMM,关键创新在于:1)引入大型音频语言模型(Large Audio Language Model, LALM)作为音频独立推理参考,设计分步推理奖励函数以引导MLLM从音频视角自我优化;2)通过以答案为中心的置信度优化策略,降低不同模态推理间的异质性不确定性,从而显著提升音频-视觉问答与音频幻觉检测任务的准确性,在有限训练数据下实现10~30%的性能提升。
链接: https://arxiv.org/abs/2511.10059
作者: Qilang Ye,Wei Zeng,Meng Liu,Jie Zhang,Yupeng Hu,Zitong Yu,Yu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion’’ scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound’'. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30% over the baseline model with limited training data. Follow: this https URL.
zh
[CV-64] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像筛选任务中表现不佳的问题,其核心瓶颈在于数据稀缺与模型自身图像美学推理能力薄弱。为应对这一挑战,作者提出了一套完整的解决方案:首先构建了一个包含超过128k样本(约640k张图像)的高质量图像筛选数据集,涵盖外观失真、物理阴影、布局摆放和扩展合理性四个维度的美学评估;其次,在标注策略上探索了纯人工、全自动及答案驱动等多种方式,以高效获取高质链式思维(Chain of Thought, CoT)数据。方法层面,引入硬样本挖掘(Hard Cases Mining, HCM)与动态比例准确率奖励(Dynamic Proportional Accuracy, DPA)机制至分组相对策略优化(Group Relative Policy Optimization, GRPO)框架中,形成HCM-GRPO方法,显著提升了模型的图像美学推理能力。实验表明,即使最先进的闭源模型如GPT4o和Qwen-VL-Max在该任务上也接近随机猜测水平,而使用HCM-GRPO训练的小规模模型仍能超越主流开源与闭源模型的表现。
链接: https://arxiv.org/abs/2511.10055
作者: Zhiyuan Hu,Zheng Sun,Yi Wei,Long Yu
机构: Tsinghua University (清华大学); Alibaba Health Information Technology Limited (阿里巴巴健康信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.
zh
[CV-65] rapped by Their Own Light: Deployable and Stealth Retroreflective Patch Attacks on Traffic Sign Recognition Systems
【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)系统在自动驾驶场景下易受对抗攻击的问题,尤其是现有攻击方法(如贴纸或激光投影)存在视觉可见性高或部署受限的缺陷。解决方案的关键在于提出一种新型攻击向量——对抗性反光贴片(Adversarial Retroreflective Patch, ARP),其利用反光材料仅在目标车灯照射下激活的特性,在保持极低可见性的同时实现高攻击成功率;通过开发反光模拟方法并结合黑盒优化策略,ARP在35米距离动态场景中达到≥93.4%的成功率,并在真实环境中对商用TSR系统实现≥60%的攻击成功率,显著优于传统贴片攻击。
链接: https://arxiv.org/abs/2511.10050
作者: Go Tsuruoka,Takami Sato,Qi Alfred Chen,Kazuki Nomoto,Ryunosuke Kobayashi,Yuna Tanaka,Tatsuya Mori
机构: Waseda University (早稻田大学); University of California, Irvine (加州大学欧文分校); RIKEN AIP (理化学研究所先进智能研究中心); NICT (国家信息与通信技术研究所)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traffic sign recognition plays a critical role in ensuring safe and efficient transportation of autonomous vehicles but remain vulnerable to adversarial attacks using stickers or laser projections. While existing attack vectors demonstrate security concerns, they suffer from visual detectability or implementation constraints, suggesting unexplored vulnerability surfaces in TSR systems. We introduce the Adversarial Retroreflective Patch (ARP), a novel attack vector that combines the high deployability of patch attacks with the stealthiness of laser projections by utilizing retroreflective materials activated only under victim headlight illumination. We develop a retroreflection simulation method and employ black-box optimization to maximize attack effectiveness. ARP achieves \geq 93.4% success rate in dynamic scenarios at 35 meters and \geq 60% success rate against commercial TSR systems in real-world conditions. Our user study demonstrates that ARP attacks maintain near-identical stealthiness to benign signs while achieving \geq 1.9% higher stealthiness scores than previous patch attacks. We propose the DPR Shield defense, employing strategically placed polarized filters, which achieves \geq 75% defense success rates for stop signs and speed limit signs against micro-prism patches.
zh
[CV-66] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples
【速读】:该论文旨在解决零样本异常分类(Zero-shot Anomaly Classification, AC)与分割(Zero-shot Anomaly Segmentation, AS)中缺乏标注样本时难以准确识别和定位缺陷的问题。现有方法普遍忽视了一个关键特性:正常工业产品图像块在二维外观和三维形状上通常具有高度相似性,而异常则呈现多样性与孤立性。解决方案的关键在于提出Mutual Scoring框架(MuSc-V2),其核心创新包括:1)通过迭代点分组(Iterative Point Grouping, IPG)增强三维表示以减少由不连续表面引起的误报;2)利用多度相似邻域聚合(Similarity Neighborhood Aggregation with Multi-Degrees, SNAMD)融合二维/三维邻域信息,生成更具判别力的多尺度图像块特征用于互评;3)引入互评机制(Mutual Scoring Mechanism, MSM)实现模态内样本相互评分,并结合跨模态异常增强(Cross-modal Anomaly Enhancement, CAE)恢复特定模态缺失的异常;4)采用受限邻域重评分(Re-scoring with Constrained Neighborhood, RsCon)抑制因与代表性样本相似而导致的误分类。该框架在MVTec 3D-AD和Eyecandies数据集上分别取得+23.7%和+19.3%的AP提升,显著超越现有零样本基准并接近多数少样本方法性能。
链接: https://arxiv.org/abs/2511.10047
作者: Xurui Li,Feng Xue,Yu Zhou
机构: Huazhong University of Science and Technology (华中科技大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a \textbf+23.7% AP gain on the MVTec 3D-AD dataset and a \textbf+19.3% boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \hrefthis https URLthis https URL.
zh
[CV-67] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection
【速读】:该论文旨在解决可见光-红外目标检测中因模态信息不平衡导致的跨模态融合不足问题,以及现有方法多在空间域使用Transformer而忽略频域优势的问题。解决方案的关键在于提出一种频域融合Transformer(FreDFT),其核心创新包括:1)设计了多模态频域注意力机制(MFDA)以挖掘模态间互补信息;2)引入频域前馈层(FDFFL)结合混合尺度频域特征融合策略增强多模态特征;3)构建跨模态全局建模模块(CGMM)实现像素级的空间与通道维度交互以缓解模态信息失衡;4)开发局部特征增强模块(LFEM)通过多种卷积结构和通道shuffle强化局部特征表示,从而提升跨模态融合效果。
链接: https://arxiv.org/abs/2511.10046
作者: Wencong Wu,Xiuwei Zhang,Hanlin Yin,Shun Dai,Hongxi Zhang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolorredFurthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information. To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at this https URL.
zh
[CV-68] LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
【速读】:该论文旨在解决高保真三维内容生成中面临的挑战,即如何在保持几何细节的同时有效表示任意拓扑结构(如开放曲面和复杂内部结构),并克服传统方法的局限性。现有基于符号距离场(Signed Distance Fields, SDFs)的方法受限于昂贵的封闭预处理步骤且难以处理非流形几何,而点云表示则易受采样伪影和表面不连续性影响。解决方案的关键在于提出一种基于无符号距离场(Unsigned Distance Fields, UDFs)的新颖3D变分自编码器(VAE)框架,其核心创新为局部到全局(Local-to-Global, LoG)架构:通过将UDF划分为均匀子体积(称为UBlocks),结合3D卷积捕捉局部细节与稀疏Transformer确保全局一致性,并引入Pad-Average策略实现子体积边界平滑过渡,从而支持高达2048³的超分辨率重建,显著提升重建精度与生成质量。
链接: https://arxiv.org/abs/2511.10040
作者: Xinran Yang,Shuichang Lai,Jiangjing Lyu,Hongjie Li,Bowen Pan,Yuanqi Li,Jie Guo,Zhou Zhengkang,Yanwen Guo
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团); Nanjing Urban Construction Tunnel&Bridge Intelligent Management (南京城市建设隧道与桥梁智能管理)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.
zh
[CV-69] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection
【速读】:该论文旨在解决自动驾驶感知系统中3D目标检测任务里对远距离、小尺寸或被遮挡目标(即“困难实例”)检测性能不足的问题,这类问题直接影响系统的安全性。现有基于多模态的3D检测方法普遍采用单引导范式(single-guided paradigm),未能充分考虑不同模态(如点云与图像)在困难实例上的信息密度差异。其解决方案的关键在于提出DGFusion框架,该框架基于双引导范式(Dual-guided paradigm),融合了Point-guide-Image和Image-guide-Point两种机制;其中核心组件Difficulty-aware Instance Pair Matcher(DIPM)根据实例难度进行特征匹配,生成易例与难例配对,而双引导模块则利用两类配对的优势实现高效的多模态特征融合,从而显著提升困难实例的检测鲁棒性。
链接: https://arxiv.org/abs/2511.10035
作者: Feiyang Jia,Caiyan Jia,Ailin Liu,Shaoqing Xu,Qiming Xia,Lin Liu,Lei Yang,Yan Gong,Ziying Song
机构: Beijing Jiaotong University (北京交通大学); University of Macau (澳门大学); Xiamen University (厦门大学); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and +1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.
zh
[CV-70] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation
【速读】:该论文旨在解决现有异常生成方法依赖于示例异常(exemplar anomalies)的问题,从而限制了其在无监督异常检测任务中的泛化能力。解决方案的关键在于提出Anomagic——一种零样本(zero-shot)异常生成方法,通过跨模态提示编码(crossmodal prompt encoding)统一视觉与文本线索,引导基于图像修复(inpainting-based)的生成流程;同时引入对比精炼策略(contrastive refinement strategy),强化合成异常与其掩码之间的精确对齐,显著提升下游异常检测性能。该方法无需任何异常样本即可生成语义一致且多样化的异常图像,具备良好的通用性和可扩展性。
链接: https://arxiv.org/abs/2511.10020
作者: Yuxin Jiang,Wei Luo,Hui Zhang,Qiyu Chen,Haiming Yao,Weiming Shen,Yunkang Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.
zh
[CV-71] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models NEURIPS2025
【速读】:该论文旨在解决物理环境中人机协作中细粒度可操作性推理(affordance reasoning)的挑战,即不仅需要识别动作目标,还需精准定位其空间位置、判断运动类型与轴线,并基于任务指令进行结构化推理。现有方法多在物体层级操作或割裂处理细粒度信息,缺乏统一、指令驱动的语义对齐与推理机制。解决方案的关键在于提出AffordBot框架,该框架融合多模态大语言模型(Multimodal Large Language Models, MLLMs)与定制化的思维链(Chain-of-Thought, CoT)推理范式:首先通过环绕视图渲染和3D元素投影构建与场景几何一致的视觉表示,再以主动感知阶段选择最优视角,进而分步推理实现可操作元素的定位与交互动作推断,从而在仅需3D点云输入下实现物理合理且泛化能力强的细粒度3D具身推理。
链接: https://arxiv.org/abs/2511.10017
作者: Xinyi Wang,Xun Yang,Yanlong Xu,Yuchen Wu,Zhen Li,Na Zhao
机构: University of Science and Technology of China (中国科学技术大学); Singapore University of Technology and Design (新加坡科技设计大学); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
zh
[CV-72] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging AAAI-26
【速读】:该论文旨在解决医学图像自动解读中面临的三大挑战:标注数据稀缺、标签不平衡以及临床合理性约束难以满足的问题。其核心解决方案在于提出MIRNet(Medical Image Reasoner Network),该框架通过四个关键技术实现突破:首先,利用自监督掩码自动编码器(Masked Autoencoder, MAE)从无标注数据中学习可迁移的视觉表征;其次,采用图注意力网络(Graph Attention Network, GAT)建模诊断标签间的结构化关联关系;再次,引入基于KL散度和正则化损失的约束感知优化策略以嵌入临床先验知识;最后,结合非对称损失(Asymmetric Loss, ASL)与提升集成方法缓解类别不平衡问题。该方案在舌象诊断任务上验证了优越性能,并具备向其他医学影像诊断场景泛化的潜力。
链接: https://arxiv.org/abs/2511.10013
作者: Shufeng Kong,Zijie Wang,Nuan Cui,Hao Tang,Yihan Meng,Yuanyuan Wei,Feifan Chen,Yingheng Wang,Zhuo Cai,Yaonan Wang,Yulong Zhang,Yuzheng Li,Zibin Zheng,Caihua Liu
机构: 1. Tsinghua University (清华大学); 2. Hunan University of Chinese Medicine (湖南中医药大学); 3. Cornell University (康奈尔大学); 4. China Academy of Engineering Physics (中国工程物理研究院); 5. Beijing Institute of Technology (北京理工大学); 6. Guangdong Medical University (广东医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear at AAAI-26
Abstract:Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels–representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.
zh
[CV-73] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers AAAI2026
【速读】:该论文旨在解决预训练视觉Transformer(Vision Transformer, ViT)在低比特量化过程中因采用统一精度而导致的精度损失问题,尤其是现有混合精度量化(Mixed Precision Quantization, MPQ)方法存在粒度粗、度量尺度不匹配及比特分配未考虑量化影响等局限性。其解决方案的关键在于提出LampQ(Layer-wise Mixed Precision Quantization for Vision Transformers),通过分层量化实现细粒度控制与高效加速,并引入类型感知的基于Fisher信息的敏感性度量指标,结合整数线性规划进行最优比特分配并迭代优化,从而显著提升ViT模型量化后的准确率,达到当前最先进的性能水平。
链接: https://arxiv.org/abs/2511.10004
作者: Minjun Kim,Jaeri Lee,Jongjin Kim,Jeongin Yun,Yongmo Kwon,U Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.
zh
[CV-74] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation
【速读】:该论文旨在解决3D场景理解中实例分割任务的标注成本过高问题,特别是针对全监督方法在大规模数据集上标注效率低、依赖专家标注的局限性。现有弱监督方法主要依赖“一物一点击”或边界框标注,仍存在人工标注繁琐、模型复杂度高和泛化能力受限等问题。为应对这些挑战,作者提出了一种两阶段弱监督3D实例分割框架DBGroup,其关键在于:第一阶段利用多视角图像中的语义与掩码线索,通过双分支点分组(Dual-Branch Point Grouping)模块生成伪标签,并结合粒度感知实例合并(Granularity-Aware Instance Merging)与语义选择与传播(Semantic Selection and Propagation)策略提升伪标签质量;第二阶段采用多轮自训练机制,在端到端实例分割网络中迭代优化模型性能,同时引入实例掩码过滤(Instance Mask Filter)策略缓解伪标签内部不一致性问题,从而实现高效且可扩展的弱监督学习。
链接: https://arxiv.org/abs/2511.10003
作者: Xuexun Liu,Xiaoxu Xu,Qiudan Zhang,Lin Ma,Xu Wang
机构: 深圳大学(Shenzhen University); 深圳大学(Shenzhen University); 深圳大学(Shenzhen University); 深圳大学(Shenzhen University); 深圳大学(Shenzhen University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbfDBGroup, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at this https URL.
zh
[CV-75] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems AAAI2026
【速读】:该论文旨在解决LiDAR-based 3D目标检测系统在面对后门攻击时存在的物理可实现性不足问题,即现有数字触发器因忽略真实环境中材料的激光反射特性而难以在物理世界中有效实施。解决方案的关键在于提出Material-Oriented Backdoor Attack (MOBA)框架,通过显式建模真实触发器的材料属性来弥合数字-物理域差距:首先系统性地筛选出具有高漫反射率和环境鲁棒性的钛白粉(TiO₂)作为触发材料;其次构建一种新型仿真流程,包含角度无关的Oren-Nayar双向反射分布函数(BRDF)近似模型以生成逼真的LiDAR强度信息,并引入距离感知缩放机制以保持不同深度下的空间一致性,从而确保数字模拟与物理触发行为高度对齐。
链接: https://arxiv.org/abs/2511.09999
作者: Saket S. Chaturvedi,Gaurav Bagwe,Lan Zhang,Pan He,Xiaoyong Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026 Conference
Abstract:LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy this http URL paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital-physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger’s behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO_2) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren-Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.
zh
[CV-76] STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data AAAI
【速读】:该论文旨在解决场景文本编辑(Scene Text Editing, STE)中面临的三大挑战:低资源语言支持不足、合成数据与真实数据之间的域差距,以及缺乏有效的文本风格保留评估指标。其核心解决方案是提出STE LLAR(Scene Text Editor for Low-resource LAnguages and Real-world data),通过引入语言自适应字形编码器(language-adaptive glyph encoder)和多阶段训练策略——先在合成数据上预训练,再在真实图像上微调,从而提升跨语言的编辑鲁棒性;同时构建了STIPLAR数据集用于训练与评估,并提出Text Appearance Similarity(TAS)这一新指标,独立量化字体、颜色和背景相似度,实现无需真实标签的稳定风格保留评估。实验表明,STE LLAR在视觉一致性与识别准确率上均优于现有方法,平均TAS提升2.2%。
链接: https://arxiv.org/abs/2511.09977
作者: Yongdeuk Seo,Hyun-seok Min,Sungchul Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI Workshop (Artificial Intelligence with Biased or Scarce Data)
Abstract:Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
zh
[CV-77] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models AAAI2026
【速读】:该论文旨在解决对比预训练视觉-语言模型(如CLIP)在分布内(in-distribution, ID)数据上进行鲁棒微调时,易破坏嵌入空间几何结构的问题,从而导致分布外(out-of-distribution, OOD)和零样本(zero-shot)场景下性能下降。解决方案的关键在于提出差异向量等化(Difference Vector Equalization, DiVE),其核心思想是通过约束同一数据样本在预训练模型与微调后模型中提取的嵌入差向量(difference vector)在不同样本间保持一致,以维持嵌入空间的几何结构不变。为此,作者设计了两个损失函数:平均向量损失(average vector loss, AVL)用于全局层面保持差向量与加权平均一致,以及成对向量损失(pairwise vector loss, PVL)用于局部层面确保多模态对齐的一致性,从而在提升ID性能的同时不损害模型的泛化能力。
链接: https://arxiv.org/abs/2511.09973
作者: Satoshi Suzuki,Shin’ya Yamaguchi,Shoichiro Takeda,Taiga Yamane,Naoki Makishima,Naotaka Kawata,Mana Ihori,Tomohiro Tanaka,Shota Orihashi,Ryo Masumura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.
zh
[CV-78] Equivariant Sampling for Improving Diffusion Model-based Image Restoration
【速读】:该论文旨在解决当前无问题特定性的扩散模型图像恢复(Diffusion Model-based Image Restoration, DMIR)方法在利用扩散先验(diffusion priors)方面存在不足,导致性能未达最优的问题。其解决方案的关键在于引入EquS方法,通过双采样轨迹(dual sampling trajectories)施加等变信息(equivariant information),从而更有效地利用扩散过程中的结构化先验;进一步地,提出时间步感知调度策略(Timestep-Aware Schedule, TAS),优先安排确定性步骤以提升采样确定性和效率,形成改进版本EquS⁺。实验表明,该方法可兼容现有DMIR方法且无需增加计算开销即可显著提升性能。
链接: https://arxiv.org/abs/2511.09965
作者: Chenxu Wu,Qingpeng Kong,Peiang Zhao,Wendi Yang,Wenxin Ma,Fenghe Tang,Zihang Jiang,S.Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); MIRACLE Center (苏州研究院先进研究中心); Key Laboratory of Intelligent Information Processing of CAS, ICT, CAS (中国科学院自动化研究所智能信息处理重点实验室); State Key Laboratory of Precision and Intelligent Chemistry (精密与智能化学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures
Abstract:Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS ^+ . TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at this https URL.
zh
[CV-79] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中进行零样本目标检测时面临的两大挑战:一是检测延迟较高,二是存在预测幻觉(hallucination),导致其难以直接部署。为应对这些问题,作者提出了一种新颖的训练流水线,核心创新在于引入基于逐对象协同教学(per-object co-teaching)的训练策略,以有效缓解VLM生成伪标签中的固有噪声。该方法通过两个YOLO模型在每个mini-batch内协作学习,依据彼此的逐对象损失值动态过滤不可靠的边界框,而非整图过滤,从而提升训练数据质量与模型鲁棒性。实验表明,该方案在KITTI数据集上将mAP@0.5从基线YOLOv5m的31.12%显著提升至46.61%,并进一步结合10%真实标注后达到57.97%,同时保持实时检测性能。
链接: https://arxiv.org/abs/2511.09955
作者: Uday Bhaskar,Rishabh Bhattacharya,Avinash Patel,Sarthak Khoche,Praveen Anil Kulkarni,Naresh Manwani
机构: IIIT Hyderabad (印度国际信息技术学院); Bosch Global Software Technologies (博世全球软件技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers’ per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ( 31.12% to 46.61% ) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ( 10% ) leads to further performance gains, reaching 57.97% mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.
zh
[CV-80] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment
【速读】:该论文旨在解决基于CLIP模型的无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中因仅依赖语义相似度(如图像嵌入与“a good photo”等文本提示之间的余弦相似性)而忽略图像特征幅度信息所导致的质量感知不准确问题。其解决方案的关键在于提出一种自适应融合框架,通过提取CLIP图像特征的绝对值并采用Box-Cox变换进行统计归一化,得到一个语义无关的幅度感知质量线索;同时设计基于置信度的融合机制,动态调整该幅度线索与余弦相似度线索的权重,从而实现更鲁棒和精准的图像质量预测。
链接: https://arxiv.org/abs/2511.09948
作者: Zhicheng Liao,Dongxu Wu,Zhenshan Shi,Sijie Mai,Hanwei Zhu,Lingyu Zhu,Yuncheng Jiang,Baoliang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
zh
[CV-81] SPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting AAAI26
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在重建半透明表面时的局限性问题,即传统方法假设每个像素仅对应单一深度,无法处理多层可见表面导致的深度歧义。其解决方案的关键在于提出TSPE-GS(Transparent Surface Probabilistic Extraction for Gaussian Splatting),通过均匀采样透射率来建模像素级的多模态不透明度与深度分布,从而替代原有的单峰先验假设;同时,利用截断有符号距离函数(Truncated Signed Distance Function, TSDF)的逐步融合机制,在统一框架内分别重建外部和内部表面,有效解决了跨表面深度模糊问题,并且无需额外训练即可兼容其他基于高斯的重建流程。
链接: https://arxiv.org/abs/2511.09944
作者: Zhiyuan Xu,Nan Min,Yuhang Guo,Tong Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI26 Poster
Abstract:3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.
zh
[CV-82] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating WACV2026
【速读】:该论文旨在解决视觉图神经网络(Vision Graph Neural Networks, ViGs)在图构建阶段面临的计算效率低下问题,从而限制其在实际应用中的性能表现。解决方案的关键在于提出了一种名为AdaptViG的高效且强大的混合视觉图神经网络,其核心创新是引入了自适应图卷积机制(Adaptive Graph Convolution),该机制由一个高效的静态轴向骨架(static axial scaffold)和一种动态的内容感知门控策略——指数衰减门控(Exponential Decay Gating)组成。该门控机制基于特征相似性对长程连接进行选择性加权,显著降低了冗余计算;同时采用分阶段策略,在早期使用该高效门控机制,在最终阶段引入全局注意力(Global Attention)块以实现最大特征聚合,从而在准确率与效率之间取得新的最优平衡。
链接: https://arxiv.org/abs/2511.09942
作者: Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2026)
Abstract:Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.
zh
[CV-83] Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification AAAI2026
【速读】:该论文旨在解决行人重识别(Person Re-Identification, ReID)模型在面对对抗攻击时的脆弱性问题,即微小且人眼不可感知的扰动可能导致模型产生错误匹配,从而带来严重的安全风险。现有防御方法大多针对分类任务设计,难以直接适用于度量学习类任务如ReID,且未充分应对ReID特有的挑战,包括模型偏差(model bias)和复合泛化需求(composite generalization requirements)。解决方案的关键在于提出一种去偏双不变防御框架(debiased dual-invariant defense framework),包含两个核心阶段:第一阶段通过基于扩散模型的数据重采样策略缓解训练数据中的偏差,提升公平性和多样性;第二阶段引入一种新颖的度量对抗训练方法,结合最远负样本扩展软化(farthest negative extension softening)以克服因缺乏分类器导致的鲁棒性下降,并设计对抗增强的自元机制(adversarially-enhanced self-meta mechanism),实现对未见身份和未见攻击类型的双重泛化能力。
链接: https://arxiv.org/abs/2511.09933
作者: Yuhang Zhou,Yanxiang Zhao,Zhongyun Hua,Zhipu Liu,Zhaoquan Gu,Qing Liao,Leo Yu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.
zh
[CV-84] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)评估基准在多语言支持、布局复杂性及任务多样性方面的不足,尤其针对 visually rich document understanding (VRDU) 这一关键挑战——即模型难以处理现实文档中复杂的版式结构(如多栏、非曼哈顿布局)和密集文本内容。解决方案的关键在于提出一个名为 DocWeaver 的多智能体流水线,利用大语言模型(Large Language Models, LLMs)自动生成高质量的多任务标注数据,从而构建出 MosaicDoc:一个大规模、双语(中文与英文)、涵盖 OCR、视觉问答(VQA)、阅读顺序和定位等多任务标注的新型基准数据集,有效推动 VRDU 领域的研究发展。
链接: https://arxiv.org/abs/2511.09919
作者: Ketong Chen,Yuhao Chen,Yang Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.
zh
[CV-85] Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection
【速读】:该论文旨在解决单域广义目标检测(Single-Domain Generalized Object Detection, Single-DGOD)中因目标域数据不可获取而导致的域间分布差异问题,尤其针对连续且渐进式的域变化(如天气或光照条件的动态演变)难以通过传统离散数据增强或静态扰动方法有效建模的问题。解决方案的关键在于提出一种名为“液态时间特征演化”(Liquid Temporal Feature Evolution)的新方法,其核心机制包括:利用可控高斯噪声注入与多尺度高斯模糊模拟初始特征扰动,结合时间建模与液态神经网络驱动的参数调节机制生成自适应调制参数,从而实现特征在源域到潜在目标域间的平滑、连续迁移,有效捕捉跨域特征的渐进演化路径并动态调整适应策略,显著提升模型对未见域偏移的泛化能力与鲁棒性。
链接: https://arxiv.org/abs/2511.09909
作者: Zihao Zhang,Yang Li,Aming Wu,Yahong Han
机构: Tianjin University (天津大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model’s ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network-driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method. Our code is available at this https URL.
zh
[CV-86] Learning to Pose Problems: Reasoning -Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models
【速读】:该论文旨在解决当前用于训练大规模推理模型的数据合成方法中存在的两大问题:一是生成过程缺乏针对性,未能根据求解器(solver)的能力调整问题难度,导致生成大量低价值问题;二是问题生成过程中缺乏深层推理,仅生成表层变体,难以提升模型的泛化能力。解决方案的关键在于构建一个具备显式推理能力的问题生成器,其通过先规划问题方向(problem direction)再进行合成,并基于求解器对合成问题的反馈作为奖励信号来动态调节难度,从而生成贴近求解器能力边界的互补性问题。该方法利用相关问题对与由推理模型生成的中间问题设计思维链(CoT)进行增强,进而自举出有效的问题设计策略,实现生成器与求解器之间的协同进化(co-evolution),显著提升模型性能。
链接: https://arxiv.org/abs/2511.09907
作者: Yongxian Wei,Yilin Zhao,Li Shen,Xinrui Chen,Runxi Cheng,Sinan Du,Hao Yu,Gang Liu,Jiahong Yan,Chun Yuan,Dian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver’s ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver’s ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver’s feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver’s competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
zh
[CV-87] PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors
【速读】:该论文旨在解决数据蒸馏(Dataset Distillation, DD)中因依赖单一教师模型(teacher model)所导致的归纳偏置(inductive bias)问题,该偏置在大规模数据集下会促使生成样本趋向于过度平滑和同质化,从而降低类内多样性并限制模型泛化能力。解决方案的关键在于提出PRISM(PRIors from diverse Source Models)框架,通过解耦 logits 匹配与正则化目标,并分别由不同架构的教师模型进行监督:主教师模型负责 logits 对齐,随机子集教师模型用于批归一化(Batch Normalization, BN)对齐,从而有效分离模型结构先验(architectural priors),显著提升生成数据的类内多样性与合成质量。
链接: https://arxiv.org/abs/2511.09905
作者: Brian B. Moser,Shalini Strode,Federico Raue,Stanislav Frolov,Krzysztof Adamkiewicz,Arundhati Shanbhag,Joachim Folk,Tobias C. Nauen,Andreas Dengel
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern-Landau
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class diversity, as reflected by a notable drop in cosine similarity between features. We further analyze teacher selection strategies (pre- vs. intra-distillation) and introduce a scalable cross-class batch formation scheme for fast parallel synthesis. Code will be released after the review period.
zh
[CV-88] EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services AAAI2026
【速读】:该论文旨在解决紧急医疗服务体系(Emergency Medical Services, EMS)中一线救援人员在高压力情境下面临的高强度认知负荷问题,从而提升急救过程中的决策效率与准确性。其解决方案的关键在于构建了首个端到端、高保真度的多模态、多人协同数据集EgoEMS,该数据集通过第一人称视角记录了233个模拟紧急场景中超过20小时的真实EMS操作过程,涵盖46名专业EMS人员的响应行为,并配有关键步骤标注、带说话人辨识的时间戳语音转录、动作质量评估指标及边界框与分割掩码等结构化注释。EgoEMS不仅具备高度现实性,还支持开发用于实时多模态关键步骤识别和动作质量估计的AI辅助工具,为构建智能EMS系统提供了坚实的数据基础与基准测试框架。
链接: https://arxiv.org/abs/2511.09894
作者: Keshara Weerasinghe,Xueren Ge,Tessa Heick,Lahiru Nuwan Wijayasingha,Anthony Cortez,Abhishek Satpathy,John Stankovic,Homa Alemzadeh
机构: University of Virginia (弗吉尼亚大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to AAAI 2026 (Preprint), 45 pages, 29 figures
Abstract:Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.
zh
[CV-89] Scale-Aware Relay and Scale-Adaptive Loss for Tiny Object Detection in Aerial Images
【速读】:该论文旨在解决航空图像中微小目标检测性能不佳的问题,其核心挑战在于:微小目标在深层网络传播过程中特征易被削弱或丢失,且训练时因回归损失分配不均导致对小目标的惩罚过重。解决方案的关键在于提出两个模块——尺度感知中继层(Scale-Aware Relay Layer, SARL)与尺度自适应损失函数(Scale-Adaptive Loss, SAL)。SARL通过跨尺度空间-通道注意力机制增强各层特征表达并促进跨层特征共享,而SAL则动态调整基于IoU的损失权重,降低大目标的损失贡献,从而聚焦于微小目标的训练优化。二者均可无缝集成至主流检测框架(如YOLOv5和YOLOx),显著提升模型在多个基准数据集上的平均精度(AP)及在噪声真实场景下的鲁棒性。
链接: https://arxiv.org/abs/2511.09891
作者: Jinfu Li,Yuqi Huang,Hong Song,Ting Wang,Jianghan Xia,Yucong Lin,Jingfan Fan,Jian Yang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, despite the remarkable advancements in object detection, modern detectors still struggle to detect tiny objects in aerial images. One key reason is that tiny objects carry limited features that are inevitably degraded or lost during long-distance network propagation. Another is that smaller objects receive disproportionately greater regression penalties than larger ones during training. To tackle these issues, we propose a Scale-Aware Relay Layer (SARL) and a Scale-Adaptive Loss (SAL) for tiny object detection, both of which are seamlessly compatible with the top-performing frameworks. Specifically, SARL employs a cross-scale spatial-channel attention to progressively enrich the meaningful features of each layer and strengthen the cross-layer feature sharing. SAL reshapes the vanilla IoU-based losses so as to dynamically assign lower weights to larger objects. This loss is able to focus training on tiny objects while reducing the influence on large objects. Extensive experiments are conducted on three benchmarks (\textiti.e., AI-TOD, DOTA-v2.0 and VisDrone2019), and the results demonstrate that the proposed method boosts the generalization ability by 5.5% Average Precision (AP) when embedded in YOLOv5 (anchor-based) and YOLOx (anchor-free) baselines. Moreover, it also promotes the robust performance with 29.0% AP on the real-world noisy dataset (\textiti.e., AI-TOD-v2.0).
zh
[CV-90] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models
【速读】:该论文旨在解决当前3D视觉语言模型(3D-VLMs)在处理点云数据时计算开销过大、效率低下的问题,尤其是由于将全部3D tokens输入大语言模型(LLM)导致的瓶颈。其解决方案的关键在于提出一种分层补偿压缩机制(Hierarchical Compensatory Compression, HCC-3D):首先通过全局结构压缩(Global Structure Compression, GSC)设计全局查询,将所有3D tokens压缩为少量关键token以保留整体结构信息;随后引入自适应细节挖掘(Adaptive Detail Mining, ADM)模块,通过互补评分机制选择性地重构被忽略的重要特征,从而在实现约98%压缩比的同时显著提升模型性能与效率。
链接: https://arxiv.org/abs/2511.09883
作者: Liheng Zhang,Jin Wang,Hui Li,Bingfeng Zhang,Weifeng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.
zh
[CV-91] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion ACM-MM
【速读】:该论文旨在解决点云语义场景补全(Semantic Scene Completion, SSC)中现有方法因采用密集网络架构而导致模型参数量大、计算资源消耗高的问题。其解决方案的关键在于提出了一种基于Receptance Weighted Key Value (RWKV)机制的轻量化网络RWKV-PCSSC,其中核心创新包括:1)引入RWKV Seed Generator (RWKV-SG)模块,从部分点云中聚合特征生成具有粗粒度语义信息的初始点云;2)通过多阶段RWKV Point Deconvolution (RWKV-PD)模块逐级恢复点云的细粒度特征。该设计显著降低了模型复杂度,在保持甚至超越当前最优性能的同时,使参数量减少4.18倍、内存效率提升1.37倍。
链接: https://arxiv.org/abs/2511.09878
作者: Wenzhe He,Xiaojun Chen,Wentang Chen,Hongyu Wang,Ying Liu,Ruihui Li
机构: Hunan University (湖南大学); Hunan Normal University (湖南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, published to ACM MM
Abstract:Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18 \times and improves memory efficiency by 1.37 \times compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).
zh
[CV-92] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection AAAI AAAI2026
【速读】:该论文旨在解决将生成式 AI(Generative AI)基础模型 SAM2 直接应用于 RGB-D 视频显著目标检测(RGB-D VSOD)任务时所面临的三大挑战:对人工提示的依赖、序列适配器带来的高内存消耗以及记忆注意力机制的计算负担。其解决方案的关键在于提出一种名为 SAM-DAQ 的新方法,通过在统一框架中融合深度与时间线索实现高效适应:首先设计了一种基于并行适配器的多模态图像编码器(PAMIE),其中嵌入深度引导的并行适配器(DPAs),以无需人工提示的方式利用深度信息促进多模态特征融合;其次引入查询驱动的时间记忆模块(QTM),将记忆库与提示嵌入统一为可学习管道,并同时使用帧级和视频级查询,实现对时序一致性特征的选择性提取及查询表示的迭代更新,从而显著提升性能并降低计算复杂度。
链接: https://arxiv.org/abs/2511.09870
作者: Jia Lin,Xiaofei Zhou,Jiyuan Liu,Runmin Cong,Guodao Zhang,Zhi Liu,Jiyong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.
zh
[CV-93] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies AAAI2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在使用旋转位置编码(Rotary Positional Encoding, ROPE)时面临的长程依赖建模难题。ROPE虽能精确刻画token位置,但会导致注意力随token距离增加而逐步衰减,尤其在远距离token对之间,这种衰减严重削弱了模型对全局上下文的记忆能力。解决方案的关键在于提出一种仅用于推理阶段的三步衰减鲁棒性策略(Three-step Decay Resilience Strategies, T-DRS),其核心包括:(1) 语义驱动的衰减恢复策略(Semantic-Driven DRS, SD-DRS),通过内容感知残差增强语义上有意义的远距离信号;(2) 距离感知控制的衰减恢复策略(Distance-aware Control DRS, DC-DRS),基于位置距离平滑调节注意力权重,在抑制噪声的同时保留局部性;(3) 重强化远距离衰减恢复策略(re-Reinforce Distant DRS, reRD-DRS),巩固剩余的信息性远程依赖以维持全局一致性。T-DRS在不破坏局部归纳偏置的前提下,有效恢复被抑制的长程token对,且无需训练即可显著提升视觉问答(VQA)任务性能。
链接: https://arxiv.org/abs/2511.09868
作者: Peng Gao,Yujian Lee,Xiaofeng Zhang,Zailong Chen,Hui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2026
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model’s ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in this https URL
zh
[CV-94] IPCD: Intrinsic Point-Cloud Decomposition WACV2026
【速读】:该论文旨在解决点云数据中材质(albedo)与阴影(shade)分离的难题,这是实现增强现实(AR)和机器人领域中真实感可视化的关键步骤。传统图像分解方法因点云缺乏规则网格结构而失效,且现有点云模型未显式考虑全局光照方向,导致阴影估计不准确。解决方案的核心在于提出Intrinsic Point-Cloud Decomposition (IPCD),其关键创新包括:(1) IPCD-Net通过点级特征聚合机制扩展图像分解模型,以适应非网格点云数据;(2) 引入基于投影的亮度分布(Projection-based Luminance Distribution, PLD)结合分层特征细化,利用多视角投影捕捉全局光照线索,从而提升阴影建模精度。
链接: https://arxiv.org/abs/2511.09866
作者: Shogo Sato,Takuhiro Kaneko,Shoichiro Takeda,Tomoyasu Shimada,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida,Akisato Kimura
机构: NTT Corporation (日本电信电话公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in WACV2026
Abstract:Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce \textbfIntrinsic Point-Cloud Decomposition (IPCD), which extends image decomposition to the direct decomposition of colored point clouds into albedo and shade. To overcome challenge (1), we propose \textbfIPCD-Net that extends image-based model with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce \textbfProjection-based Luminance Distribution (PLD) with a hierarchical feature refinement, capturing global-light ques via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.
zh
[CV-95] CORONA-Fields: Leverag ing Foundation Models for Classification of Solar Wind Phenomena
【速读】:该论文旨在解决空间天气事件中太阳风结构自动分类困难的问题,特别是如何将遥感观测与原位探测数据有效融合以提升对太阳风结构的识别精度。其关键解决方案是利用预训练的基础模型(foundation model)提取太阳动力学观测图像的嵌入表示(embeddings),并将其与航天器位置及太阳磁连通性信息(通过傅里叶特征编码)相结合,构建基于神经场(neural field)的深度学习架构,从而实现从远程观测到原位测量的跨模态映射。该方法首次验证了基础模型嵌入在太阳风结构分类任务中的可行性,为未来更可靠的太阳风预测提供了技术路径。
链接: https://arxiv.org/abs/2511.09843
作者: Daniela Martin,Jinsu Hong,Connor O’Brien,Valmir P Moraes Filho,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego
机构: University of Delaware (特拉华大学); Georgia State University (佐治亚州立大学); Boston University (波士顿大学); Catholic University of America (美国天主教大学); Southwest Research Institute (西南研究所); NASA Goddard Space Flight Center (美国国家航空航天局戈达德太空飞行中心); Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
备注:
Abstract:Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.
zh
[CV-96] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverag e
【速读】:该论文旨在解决对抗性补丁攻击(adversarial patch attacks)对深度视觉模型造成的威胁,此类攻击通过在图像中注入局部扰动来误导模型,且可物理实施,对现实世界应用构成严重风险。解决方案的关键在于提出CertMask,一种可证明鲁棒的防御机制,其核心是构建一个数学上严谨的二值掩码集合,确保每个可能的补丁位置被覆盖至少k次,从而提供理论保障以中和补丁影响。与现有最优方法PatchCleanser相比,CertMask仅需一轮掩码操作,时间复杂度从O(n²)降低至O(n),其中n为掩码集合的基数,显著提升了效率并实现了更高的认证鲁棒准确率(最高提升13.4%),同时保持与原始模型相当的干净准确率。
链接: https://arxiv.org/abs/2511.09834
作者: Xuntao Lyu,Ching-Chi Lin,Abdullah Al Arafat,Georg von der Brüggen,Jian-Jia Chen,Zhishan Guo
机构: 1. University of California, San Diego (加州大学圣地亚哥分校); 2. University of California, San Diego (加州大学圣地亚哥分校); 3. University of California, San Diego (加州大学圣地亚哥分校); 4. University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs O(n^2) inference cost, CertMask performs only a single round of masking with O(n) time complexity, where n is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least k times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.
zh
[CV-97] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
【速读】:该论文旨在解决在3D场景中实现几何一致的人类动画与交互问题,尤其是如何在无需配对人类-场景数据的情况下,利用3D高斯点绘(3D Gaussian Splatting, 3DGS)作为统一的3D表示来实现高质量、自由视角下的人类与场景协同渲染。其解决方案的关键在于将渲染过程与运动合成解耦:通过引入一个基于不透明度和投影高斯结构的对齐运动模块,在不依赖显式场景几何的前提下合成人体姿态与位姿;同时设计了一个人类-场景高斯优化策略,以强制实现自然的接触关系与导航行为,从而确保动画的真实性与几何一致性。
链接: https://arxiv.org/abs/2511.09827
作者: Aymen Mir,Jian Wang,Riza Alp Guler,Chuan Guo,Gerard Pons-Moll,Bing Zhou
机构: Snap Inc.; Tübingen AI Center, University of Tübingen, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.
zh
[CV-98] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance WACV2026
【速读】:该论文旨在解决跨视图图像检索(cross-view image retrieval)中街景到卫星图像匹配的问题,尤其针对现有方法依赖监督训练和特定采集设备(如全景或无人机图像)导致的实际部署受限问题。其解决方案的关键在于构建一个无需额外训练的端到端框架:利用预训练视觉编码器(如DINOv2)与大语言模型(LLM)协同工作,通过网络图像搜索提取地理线索、LLM进行位置推理生成卫星查询,并结合地理编码API获取匹配瓦片;同时采用主成分分析(PCA)白化对特征进行优化,从而在零样本(zero-shot)设置下实现优于已有学习型方法的性能。该方案还支持自动构建语义对齐的街景-卫星数据集,为大规模标注提供了一种可扩展且低成本的替代路径。
链接: https://arxiv.org/abs/2511.09820
作者: Jeongho Min,Dongyoung Kim,Jaehyup Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to WACV 2026, 10pages, 4 figures
Abstract:Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at this https URL.
zh
[CV-99] Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration
【速读】:该论文旨在解决低光照条件下三维场景恢复(3D scene restoration)的难题,现有方法通常依赖于预先计算的相机位姿和针对特定场景的优化,限制了其在动态真实环境中的可扩展性。解决方案的关键在于提出一种通用的无位姿(pose-free)框架Lumos3D,该框架仅需在单一数据集上训练一次,即可通过纯前向传播直接从无位姿的多视角低光照图像中恢复出光照和结构信息,无需每个场景单独训练或优化。其核心创新包括基于几何约束的骨干网络以重建正常光照下的3D高斯表示,并采用跨光照蒸馏策略将真实光照下的几何信息(如深度)传递给学生模型,同时引入专用的Lumos损失函数以增强重建空间内的光度一致性,从而实现高保真度的低光照3D场景恢复及对未见场景的良好泛化能力。
链接: https://arxiv.org/abs/2511.09818
作者: Hanzhou Liu,Peng Jiang,Jia Huang,Mi Lu
机构: Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Restoring 3D scenes captured under low-light con- ditions remains a fundamental yet challenging problem. Most existing approaches depend on precomputed camera poses and scene-specific optimization, which greatly restricts their scala- bility to dynamic real-world environments. To overcome these limitations, we introduce Lumos3D, a generalizable pose-free framework for 3D low-light scene restoration. Trained once on a single dataset, Lumos3D performs inference in a purely feed- forward manner, directly restoring illumination and structure from unposed, low-light multi-view images without any per- scene training or optimization. Built upon a geometry-grounded backbone, Lumos3D reconstructs a normal-light 3D Gaussian representation that restores illumination while faithfully pre- serving structural details. During training, a cross-illumination distillation scheme is employed, where the teacher network is distilled on normal-light ground truth to transfer accurate geometric information, such as depth, to the student model. A dedicated Lumos loss is further introduced to promote photomet- ric consistency within the reconstructed 3D space. Experiments on real-world datasets demonstrate that Lumos3D achieves high- fidelity low-light 3D scene restoration with accurate geometry and strong generalization to unseen cases. Furthermore, the framework naturally extends to handle over-exposure correction, highlighting its versatility for diverse lighting restoration tasks.
zh
[CV-100] st-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models NEURIPS2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在测试时域偏移(test-time domain shifts)下性能下降的问题,尤其是在单张未标注图像上的适应能力不足。现有方法如测试时提示调优(test-time prompt tuning)通常需要对大型编码器权重进行反向传播或修改模型核心组件,导致计算开销大、效率低。解决方案的关键在于提出一种轻量级的测试时转向框架——谱感知测试时引导(Spectrum-Aware Test-Time Steering, STS),其通过从文本嵌入中提取频谱子空间以定义主语义方向,并仅通过少量每样本偏移参数,在潜在空间中以谱感知方式最小化增强视图间的熵,从而实现高效且无梯度传播的适配。STS完全在推理阶段运行,不涉及冻结编码器的反向传播或结构改动,显著提升了适应速度与内存效率。
链接: https://arxiv.org/abs/2511.09809
作者: Konstantinos M. Dafnis,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025
Abstract:Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at this https URL.
zh
[CV-101] PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning AAAI2026
【速读】:该论文针对无样本持续学习(Exemplar-Free Continual Learning, EFCL)中因缺乏历史任务数据存储而导致的灾难性遗忘问题,特别是现有基于预训练模型(Pre-trained Models, PTMs)的方法忽视现实世界数据分布中存在的双重不平衡现象——即任务间分布不均与单个任务内部极端或反向偏斜并存的问题。解决方案的关键在于提出PANDA框架,其核心创新包括:(1) 基于CLIP编码器识别低频类别代表性区域,并将其移植到高频类样本中以增强低频类特征表达(Patch-aware Augmentation);(2) 引入自适应平衡策略,利用先前任务的分布信息来缓解任务间的不平衡,从而缩小各任务平均样本间的差距,实现冻结PTM下的公平学习。实验证明PANDA能有效提升准确率并抑制灾难性遗忘。
链接: https://arxiv.org/abs/2511.09791
作者: Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted in AAAI 2026 Main Technical Track
Abstract:Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA’s capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.
zh
[CV-102] STORM: Segment Track and Object Re-Localization from a Single 3D Model
【速读】:该论文旨在解决现有6D位姿估计方法依赖人工标注目标分割掩码(segmentation mask)所带来的劳动密集型问题,以及在遮挡或快速运动场景下性能下降的局限性。其解决方案的关键在于提出了一种名为STORM(Segment, Track, and Object Re-localization from a single 3D Model)的端到端、无需人工标注的实时6D位姿估计系统,核心创新包括:1)基于视觉-语言理解与自监督特征匹配的三阶段流水线,利用上下文对象描述引导定位、自交叉注意力机制识别候选区域,并通过分割模型生成高精度掩码以实现准确位姿估计;2)设计自动重注册机制,通过特征相似性监控检测跟踪失败并恢复严重遮挡或高速运动下的位姿估计能力。该方案在工业级挑战数据集上实现了最先进精度,且无需额外训练即可实现实时运行,显著降低了部署复杂度。
链接: https://arxiv.org/abs/2511.09771
作者: Yu Deng,Teng Cao,Hikaru Shindo,Jiahong Xue,Quentin Delfosse,Kristian Kersting
机构: 1. 未提供具体的机构名称,仅显示了作者编号和脚注标记;
2. 根据提供的信息无法确定任何明确的组织机构或单位;
未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
zh
[CV-103] Gradient-Guided Exploration of Generative Models Latent Space for Controlled Iris Image Augmentations
【速读】:该论文旨在解决 iris recognition(虹膜识别)与 presentation attack detection(活体攻击检测)方法开发中因数据集多样性不足而导致的可靠性问题,尤其是如何在保持虹膜身份不变的前提下,可控地生成具有特定属性变化的虹膜图像。解决方案的关键在于提出一种基于生成模型潜在空间(latent space)遍历的新颖虹膜图像增强策略:通过优化特定几何、纹理或质量相关的可微分特征(如锐度、瞳孔大小、虹膜大小或瞳孔-虹膜比例),在潜在空间中寻找代表同一身份但属性被操控的潜在编码(latent codes),从而实现对虹膜图像属性的精确控制和身份一致性保持。该方法还可扩展至任意可定义损失函数的属性,并支持使用预训练 GAN 模型生成或真实虹膜图像进行投影和增强。
链接: https://arxiv.org/abs/2511.09749
作者: Mahsa Mitcheff,Siamul Karim Khan,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model’s latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.
zh
[CV-104] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation
【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在医学影像领域泛化能力差异显著的问题,明确预训练域(医学 vs. 一般领域)、范式(如文本引导)和架构如何影响嵌入质量,从而指导特定放射学任务中最优编码器的选择。其关键解决方案在于系统评估了来自八个医学和一般领域基础模型的视觉编码器在胸部X光片分析中的表现,涵盖分类(气胸、心脏扩大)与分割(气胸、心脏边界)任务,并通过线性探测(linear probing)和微调(fine-tuning)进行基准测试。研究发现:医学预训练显著提升初始特征质量,尤其适用于全局分类和显著解剖结构分割;然而,对于复杂细微病灶(如气胸)的定位,所有模型均需大量微调才能有效,凸显当前预训练特征在局部精细化任务中的局限性;此外,昂贵的图文对齐并非必要,仅图像或标签监督的模型(如RAD-DINO、Ark+)同样表现优异,且端到端监督基线在分割任务上可媲美甚至超越最佳FMs,表明架构设计(如多尺度结构)与任务特性共同决定性能上限。
链接: https://arxiv.org/abs/2511.09742
作者: Frank Li,Theo Dapamede,Mohammadreza Chavoshi,Young Seok Jeon,Bardia Khosravi,Abdulhameed Dere,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chiratidzo Sanyika,Janice Newsome,Saptarshi Purkayastha,Imon Banerjee,Hari Trivedi,Judy Gichoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 figures, 3 tables
Abstract:Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.
zh
[CV-105] Soiling detection for Advanced Driver Assistance Systems
【速读】:该论文旨在解决汽车摄像头(automotive cameras)表面污损(soiling)检测问题,以提升高级驾驶辅助系统(Advanced Driver Assistance Systems, ADAS)在恶劣天气、灰尘等外部环境下的鲁棒性。其核心解决方案是将污损检测建模为语义分割(semantic segmentation)任务,并通过对比多种主流分割方法与传统的图像块分类(tile-level classification)方法,验证了分割方法在性能上的优势。此外,作者发现原始Woodscape数据集存在数据泄露(data-leakage)和标注不精确的问题,因此构建了一个更小但高质量的数据子集,使得模型能在更短时间内达到可比的分割性能,从而提升了训练效率和结果可靠性。
链接: https://arxiv.org/abs/2511.09740
作者: Filip Beránek,Václav Diviš,Ivan Gruber
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at ICMV 2024
Abstract:Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at this https URL.
zh
[CV-106] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction
【速读】:该论文旨在解决动态拥挤环境中行人轨迹预测的难题,特别是现有深度学习模型通常将行人视为点实体而忽略其物理占用空间,导致预测轨迹易发生不现实的碰撞。解决方案的关键在于提出一种改进的Social LSTM模型,并引入一种新的动态占用空间损失函数(Dynamic Occupied Space loss function),该函数通过结合平均位移误差与对场景密度和个体空间占用敏感的碰撞惩罚项,在不增加位移误差的前提下有效降低碰撞率,从而提升轨迹预测的真实性和准确性。
链接: https://arxiv.org/abs/2511.09735
作者: Ahmed Alia,Mohcine Chraibi,Armin Seyfried
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, 4 tables
Abstract:In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions – low, medium, high, and very high density – while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.
zh
[CV-107] PALMS: Modular Image-Based Floor Plan Localization Leverag ing Depth Foundation Model WACV
【速读】:该论文旨在解决GPS拒收环境下的室内定位问题,尤其是针对传统视觉方法在智能手机LiDAR测距有限和室内布局模糊性导致的定位精度不足问题。其解决方案的关键在于提出PALMS + 系统,该系统采用模块化设计,首先利用基础单目深度估计模型(Depth Pro)从带位姿的RGB图像中重建尺度对齐的3D点云,再通过与楼层平面图进行卷积几何匹配实现位置与朝向的后验推断,从而在无需训练的情况下显著提升静态定位精度,并在结合粒子滤波器用于序列定位时展现出优于PALMS和F3Loc的鲁棒性,验证了其在无基础设施依赖场景中的应用潜力。
链接: https://arxiv.org/abs/2511.09724
作者: Yunqian Cheng,Benjamin Princen,Roberto Manduchi
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, Application Track. Main paper: 8 pages, 5 figures. Supplementary material included
Abstract:Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS + , a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS + outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS + outperforms PALMS and F3Loc in stationary localization accuracy – without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS + achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at this https URL
zh
[CV-108] Density Estimation and Crowd Counting
【速读】:该论文旨在解决视频场景下人群密度估计的准确性与效率问题,尤其针对图像方法在处理时序信息时存在的局限性。其关键解决方案在于:首先,引入基于扩散过程的去噪概率模型以生成高质量的人群密度图;其次,采用窄高斯核结合多密度图输出,并通过回归分支提取精确特征,再利用相似度评分机制进行融合,从而提升结果鲁棒性;最后,提出事件驱动的采样策略,借助Farneback光流算法选择关键帧,有效降低计算和存储负担,同时保留重要人群动态信息。该方案在密集与稀疏场景中均表现出优异性能,为实时人群监控提供了可扩展且高效的框架。
链接: https://arxiv.org/abs/2511.09723
作者: Balachandra Devarangadi Sunil,Rakshith Venkatesh,Shantanu Todmal
机构: University of Massachusetts (马萨诸塞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.
zh
[CV-109] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
【速读】:该论文旨在解决当前基于指令的图像编辑模型在多指令输入时,每个指令均以固定强度执行的问题,这限制了用户对单个编辑操作强度的精细、连续控制能力。解决方案的关键在于提出SliderEdit框架,通过将多部分编辑指令解耦并将其映射为全局训练的滑块(slider),实现对每个指令强度的平滑调节;其核心创新是学习一组通用的低秩适配矩阵(low-rank adaptation matrices),该矩阵能跨多种编辑任务、属性和组合指令进行泛化,从而在保持空间局部性和全局语义一致性的同时,支持沿单个编辑维度的连续插值,显著提升编辑可控性、视觉一致性和用户可操控性。
链接: https://arxiv.org/abs/2511.09715
作者: Arman Zarei,Samyadeep Basu,Mobina Pournemat,Sayan Nag,Ryan Rossi,Soheil Feizi
机构: University of Maryland (马里兰大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
zh
[CV-110] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression ML4H2025
【速读】:该论文旨在解决喉部声带组织损伤(phonotrauma)严重程度自动分类的问题,传统评估依赖临床专家主观判断,存在成本高、可靠性差异大的局限。解决方案的关键在于提出一种基于软序数回归(soft ordinal regression)的新型方法,通过改进序数回归损失函数以处理标注者评分分布的不确定性(即软标签),从而实现对声带图像中 phonotrauma 严重程度的自动化分级,其预测性能接近临床专家,并能输出校准良好的不确定性估计,为大规模临床研究和精准诊疗提供技术支持。
链接: https://arxiv.org/abs/2511.09702
作者: Katie Matton,Purvaja Balaji,Hamzeh Ghasemzadeh,Jameson C. Cooper,Daryush D. Mehta,Jarrad H. Van Stan,Robert E. Hillman,Rosalind Picard,John Guttag,S. Mazdak Abulnaga
机构: Massachusetts Institute of Technology (麻省理工学院); University of Central Florida (中佛罗里达大学); Massachusetts General Hospital (马萨诸塞总医院); Harvard Medical School (哈佛医学院); Massachusetts General Hospital Institute of Health Professions (马萨诸塞总医院健康专业研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 9 figures, 5 tables; ML4H 2025; Proceedings of Machine Learning Research 297, 2025
Abstract:Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician’s expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.
zh
[CV-111] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild
【速读】:该论文旨在解决当前计算机视觉方法在灵长类动物行为分析中泛化能力不足的问题,尤其是依赖人类中心预训练模型和单一数据集导致的性能瓶颈。其解决方案的关键在于采用数据驱动的方法,构建了一个大规模、以灵长类为中心的视频预训练数据集PriVi,并基于此对V-JEPA模型进行预训练,从而学习到更具普适性的灵长类特异性表征。实验表明,该方法在四个基准数据集上均优于现有技术,且在标签稀缺场景下展现出更强的数据效率和泛化能力。
链接: https://arxiv.org/abs/2511.09675
作者: Felix B. Mueller,Jan F. Meier,Timo Lueddecke,Richard Vogg,Roger L. Freixanet,Valentin Hassler,Tiffany Bosshard,Elif Karakoc,William J. O’Hearn,Sofia M. Pereira,Sandro Sehner,Kaja Wierucka,Judith Burkart,Claudia Fichtel,Julia Fischer,Alexander Gail,Catherine Hobaiter,Julia Ostner,Liran Samuni,Oliver Schülke,Neda Shahidi,Erin G. Wessling,Alexander S. Ecker
机构: University of Göttingen, Germany; Cognitive Ethology; Behavioral Ecology & Sociobiology; Social Evolution in Primates; Cognitive Neuroscience SMG; Cooperative Evolution in Primates; University of Zurich, Switzerland; University of St. Andrews, United Kingdom; Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.
zh
[CV-112] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)在复杂任务中因推理过程与最终输出不一致而导致的性能下降问题,特别是现有顺序式、自回归方法中存在的错误传播现象。其解决方案的关键在于提出一种并行多模态扩散框架 MMaDA-Parallel,该框架通过在整个去噪轨迹中实现文本与图像之间的连续双向交互,并引入 Parallel Reinforcement Learning(ParaRL)策略,在轨迹中施加语义奖励以强化跨模态一致性,从而显著提升输出对齐度和语义一致性。
链接: https://arxiv.org/abs/2511.09611
作者: Ye Tian,Ling Yang,Jiongfan Yang,Anran Wang,Yu Tian,Jiani Zheng,Haochen Wang,Zhiyang Teng,Zhuochen Wang,Yinjie Wang,Yunhai Tong,Mengdi Wang,Xiangtai Li
机构: Peking University (北京大学); ByteDance (字节跳动); Princeton University (普林斯顿大学); CASIA (中国科学院自动化研究所); The University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at this https URL
zh
[CV-113] FedeCouple: Fine-Grained Balancing of Global-Generalization and Local-Adaptability in Federated Learning
【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning)中特征提取器(feature extractor)与分类器(classifier)之间协调不足的问题,即现有方法往往只关注局部训练中的特征空间一致性与分类个性化,忽视了提取器的本地适应性(local adaptability)和分类器的全局泛化能力(global generalization),导致组件间耦合弱、协同效率低,最终影响整体模型性能。解决方案的关键在于提出FedeCouple方法,通过细粒度地平衡全局泛化与局部适应性:一方面联合学习全局与局部特征表示,另一方面引入动态知识蒸馏(dynamic knowledge distillation)提升个性化分类器的泛化能力;同时利用锚点(anchors)优化特征空间,其局部性强且无需传输,既保障隐私又降低通信开销。理论分析进一步证明该方法在非凸目标下收敛至稳定点,实验证明其在有效性、稳定性、可扩展性和安全性方面显著优于九种基线方法。
链接: https://arxiv.org/abs/2511.09599
作者: Ming Yang,Dongrun Li,Xin Wang,Feng Li,Lisheng Fan,Chunxiao Wang,Xiaoming Wu,Peng Cheng
机构: Qilu University of Technology (Shandong Academy of Sciences); Shandong University; Guangzhou University; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In privacy-preserving mobile network transmission scenarios with heterogeneous client data, personalized federated learning methods that decouple feature extractors and classifiers have demonstrated notable advantages in enhancing learning capability. However, many existing approaches primarily focus on feature space consistency and classification personalization during local training, often neglecting the local adaptability of the extractor and the global generalization of the classifier. This oversight results in insufficient coordination and weak coupling between the components, ultimately degrading the overall model performance. To address this challenge, we propose FedeCouple, a federated learning method that balances global generalization and local adaptability at a fine-grained level. Our approach jointly learns global and local feature representations while employing dynamic knowledge distillation to enhance the generalization of personalized classifiers. We further introduce anchors to refine the feature space; their strict locality and non-transmission inherently preserve privacy and reduce communication overhead. Furthermore, we provide a theoretical analysis proving that FedeCouple converges for nonconvex objectives, with iterates approaching a stationary point as the number of communication rounds increases. Extensive experiments conducted on five image-classification datasets demonstrate that FedeCouple consistently outperforms nine baseline methods in effectiveness, stability, scalability, and security. Notably, in experiments evaluating effectiveness, FedeCouple surpasses the best baseline by a significant margin of 4.3%.
zh
[CV-114] Efficient Automated Diagnosis of Retinopathy of Prematurity by Customize CNN Models
【速读】:该论文旨在解决早产儿视网膜病变(Retinopathy of Prematurity, ROP)诊断中精度不足与计算效率低的问题。解决方案的关键在于设计并优化一种定制化的卷积神经网络(Convolutional Neural Network, CNN)模型,相较于预训练模型,其在准确率和F1分数上表现更优;同时引入投票机制进一步提升性能,并通过模型轻量化设计降低计算负担,从而实现高效、精准的ROP辅助诊断,且具备在专用软硬件环境中部署的可行性。
链接: https://arxiv.org/abs/2511.10023
作者: Farzan Saeedi,Sanaz Keshvari,Nasser Shoeibi
机构: Sultan Qaboos University (苏丹卡布斯大学); Mashhad University of Medical Sciences (马什哈德医科大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper encompasses an in-depth examination of Retinopathy of Prematurity (ROP) diagnosis, employing advanced deep learning methodologies. Our focus centers on refining and evaluating CNN-based approaches for precise and efficient ROP detection. We navigate the complexities of dataset curation, preprocessing strategies, and model architecture, aligning with research objectives encompassing model effectiveness, computational cost analysis, and time complexity assessment. Results underscore the supremacy of tailored CNN models over pre-trained counterparts, evident in heightened accuracy and F1-scores. Implementation of a voting system further enhances performance. Additionally, our study reveals the potential of the proposed customized CNN model to alleviate computational burdens associated with deep neural networks. Furthermore, we showcase the feasibility of deploying these models within dedicated software and hardware configurations, highlighting their utility as valuable diagnostic aids in clinical settings. In summary, our discourse significantly contributes to ROP diagnosis, unveiling the efficacy of deep learning models in enhancing diagnostic precision and efficiency.
zh
[CV-115] VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing
【速读】:该论文旨在解决扩散模型在3D分子生成中面临的采样效率与构象准确性之间的根本权衡问题。现有方法中,基于流的模型虽采样速度快,但难以捕捉分子构象的多模态分布,导致几何结构不准确;而去噪扩散模型虽然精度高,却因扩散动力学与SE(3)-等变架构之间整合不佳而导致采样速度缓慢。解决方案的关键在于提出VEDA框架,其核心创新包括:(1)一种方差爆炸(Variance-Exploding, VE)调度策略,通过类比模拟退火机制实现噪声注入,提升3D准确性并降低松弛能量;(2)一种新颖的预条件方案,协调SE(3)-等变网络的坐标预测特性与残差型扩散目标;(3)一种基于arcsin的调度器,将采样集中在对数信噪比的关键区间内。该框架在QM9和GEOM-DRUGS数据集上实现了与流模型相当的采样效率,并在仅100步采样下达到最优的价键稳定性和有效性,同时生成结构的稳定性显著优于基线模型SemlaFlow(中位松弛能从32.3 kcal/mol降至1.72 kcal/mol)。
链接: https://arxiv.org/abs/2511.09568
作者: Peining Zhang,Jinbo Bi,Minghu Song
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA’s generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.
zh
[CV-116] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation AAAI2026
【速读】:该论文旨在解决半监督语义分割(Semi-supervised Semantic Segmentation, SSSS)在计算病理学中因密集标注成本高、数据稀缺而导致的伪标签噪声传播和分割结果拓扑结构不合理的难题。现有方法多依赖像素级一致性,易将噪声伪标签扩散至整个预测图,从而生成碎片化或拓扑无效的分割掩码。其解决方案的关键在于提出拓扑图一致性(Topology Graph Consistency, TGC),通过在预测图与参考图之间对齐拉普拉斯谱(Laplacian spectra)、连通分量数(component counts)及邻接统计量(adjacency statistics),引入图论约束以强制全局拓扑结构一致,从而提升分割精度并显著缩小弱监督与全监督之间的性能差距。
链接: https://arxiv.org/abs/2509.22689
作者: Ha-Hieu Pham,Minh Le,Han Huynh,Nguyen Quoc Khanh Le,Huy-Hieu Pham
机构: 1. University of Science and Technology (越南科学技术大学); 2. Hanoi University of Science and Technology (河内科技大学); 3. Vietnam National University, Hanoi (越南国家大学,河内分校); 4. Ho Chi Minh City University of Technology (胡志明市科技学院); 5. University of Science and Technology (越南科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the AAAI 2026 Student Abstract and Poster Program
Abstract:Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.
zh
人工智能
[AI-0] owards an Agent ic Workflow for Internet Measurement Research
【速读】:该论文旨在解决互联网测量研究中的可访问性危机问题,即复杂分析通常需要整合多个专用工具并依赖专家级领域知识,导致在发生网络中断时难以快速构建诊断工作流。其解决方案的关键在于提出ArachNet系统,该系统首次证明大型语言模型(LLM)代理能够独立生成模拟专家推理的测量工作流;核心创新点是识别出测量专业知识具有可预测的组合模式,并通过四个专业化代理模块实现从问题分解到方案执行的全流程自动化,从而显著降低工作流构建门槛,同时保持研究级别的技术严谨性。
链接: https://arxiv.org/abs/2511.10611
作者: Alagappan Ramanathan,Eunju Kang,Dongsu Han,Sangeetha Abdu Jyothi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Internet measurement research faces an accessibility crisis: complex analyses require custom integration of multiple specialized tools that demands specialized domain expertise. When network disruptions occur, operators need rapid diagnostic workflows spanning infrastructure mapping, routing analysis, and dependency modeling. However, developing these workflows requires specialized knowledge and significant manual effort. We present ArachNet, the first system demonstrating that LLM agents can independently generate measurement workflows that mimics expert reasoning. Our core insight is that measurement expertise follows predictable compositional patterns that can be systematically automated. ArachNet operates through four specialized agents that mirror expert workflow, from problem decomposition to solution implementation. We validate ArachNet with progressively challenging Internet resilience scenarios. The system independently generates workflows that match expert-level reasoning and produce analytical outputs similar to specialist solutions. Generated workflows handle complex multi-framework integration that traditionally requires days of manual coordination. ArachNet lowers barriers to measurement workflow composition by automating the systematic reasoning process that experts use, enabling broader access to sophisticated measurement capabilities while maintaining the technical rigor required for research-quality analysis. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.10611 [cs.NI] (or arXiv:2511.10611v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2511.10611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-1] Regular Games – an Automata-Based General Game Playing Language AAAI2026
【速读】:该论文旨在解决通用游戏博弈(General Game Playing, GGP)系统在计算效率与游戏设计便捷性之间的权衡问题。现有系统往往在游戏规则描述的灵活性和推理效率之间难以兼顾,导致模型生成速度慢或开发复杂度高。解决方案的关键在于提出一种名为Regular Games (RG) 的新框架,其核心是一个基于有限自动机(finite automaton)的低级语言,该语言机制简洁、通用性强,能够表达所有有限回合制不完美信息博弈,并支持高效的自动处理(如代理推理、分析优化等)。通过将高级游戏设计语言(由人类或程序化内容生成)编译为该低级语言,RG实现了高效的游戏规则建模与快速前向模型生成,在基准测试中优于当前主流GGP系统(如Regular Boardgames和Ludii),同时提供了完整的开发工具链(编辑器、LSP支持、自动机可视化、调试器等),显著提升了系统的实用性与可扩展性。
链接: https://arxiv.org/abs/2511.10593
作者: Radosław Miernik,Marek Szykuła,Jakub Kowalski,Jakub Cieśluk,Łukasz Galas,Wojciech Pawlik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Full version of AAAI 2026 paper
Abstract:We propose a new General Game Playing (GGP) system called Regular Games (RG). The main goal of RG is to be both computationally efficient and convenient for game design. The system consists of several languages. The core component is a low-level language that defines the rules by a finite automaton. It is minimal with only a few mechanisms, which makes it easy for automatic processing (by agents, analysis, optimization, etc.). The language is universal for the class of all finite turn-based games with imperfect information. Higher-level languages are introduced for game design (by humans or Procedural Content Generation), which are eventually translated to a low-level language. RG generates faster forward models than the current state of the art, beating other GGP systems (Regular Boardgames, Ludii) in terms of efficiency. Additionally, RG’s ecosystem includes an editor with LSP, automaton visualization, benchmarking tools, and a debugger of game description transformations.
zh
[AI-2] xtual understanding boost in the WikiRace
【速读】:该论文旨在解决目标导向的复杂信息网络中的导航问题,具体以WikiRace游戏为基准任务,评估不同导航策略在 Wikipedia 超链接网络中的有效性。其核心问题是:如何在不依赖人工设计规则的前提下,实现高效且可靠的跨文档路径搜索?解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成的语义嵌入(semantic embeddings)来指导贪婪式路径选择——即基于文章标题的语义相似度进行决策,并辅以简单的环路避免机制。实验证明,该方法在大规模 Wikipedia 子图上实现了100%的成功率,且效率比基于图结构特征或混合策略的方法高出一个数量级,表明语义驱动的零样本导航在复杂信息空间中具有显著优势。
链接: https://arxiv.org/abs/2511.10585
作者: Raman Ebrahimi,Sean Fuhrman,Kendrick Nguyen,Harini Gurusankar,Massimo Franceschetti
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:The WikiRace game, where players navigate between Wikipedia articles using only hyperlinks, serves as a compelling benchmark for goal-directed search in complex information networks. This paper presents a systematic evaluation of navigation strategies for this task, comparing agents guided by graph-theoretic structure (betweenness centrality), semantic meaning (language model embeddings), and hybrid approaches. Through rigorous benchmarking on a large Wikipedia subgraph, we demonstrate that a purely greedy agent guided by the semantic similarity of article titles is overwhelmingly effective. This strategy, when combined with a simple loop-avoidance mechanism, achieved a perfect success rate and navigated the network with an efficiency an order of magnitude better than structural or hybrid methods. Our findings highlight the critical limitations of purely structural heuristics for goal-directed search and underscore the transformative potential of large language models to act as powerful, zero-shot semantic navigators in complex information spaces.
zh
[AI-3] Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback AAAI AAAI-26
【速读】:该论文旨在解决高风险领域(如教育、就业和医疗)中有限资源的公平分配问题,其核心挑战在于如何在考虑延迟反馈、个体异质性及伦理约束的前提下,平衡短期效用与长期影响。解决方案的关键在于提出一种双层上下文bandit框架,其中元层级优化子群体层面的预算分配以满足公平性和运营约束,基础层级则利用神经网络从观察数据中识别各子群体中最具响应性的个体,并通过资源特定的延迟核函数建模处理效果的延迟特性,同时遵守冷却窗口限制;该方法通过显式建模时间动态和反馈延迟,在新数据持续到来时不断优化策略,从而实现更灵活、自适应的决策机制。
链接: https://arxiv.org/abs/2511.10572
作者: Mohammadsina Almasi,Hadis Anahideh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI-26 (AISI Track). Final version to appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-26), 2026
Abstract:Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
zh
[AI-4] Preview Accept or Discard? A Predictive Low-Motion Interaction Paradigm
【速读】:该论文旨在解决重复性劳损(Repetitive Strain Injury, RSI)问题,其根源在于当前人机交互设备(如鼠标)仍依赖精细的运动控制来完成操作,而这类设备几十年来未显著改善RIS的发生率。解决方案的关键在于引入一种名为“Preview Accept Discard (PAD)”的零点击交互范式,通过生成式AI预测用户可能的目标GUI元素并提供排序建议,使用户仅需通过按键释放时间判断来接受或丢弃候选项,从而大幅减少手部运动需求,同时保持任务效率与传统轨迹板相当——前提是预测准确率接近最优拼写检查器水平。
链接: https://arxiv.org/abs/2511.10532
作者: Jose Berengueres
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Repetitive strain injury (RSI) affects roughly one in five computer users and remains largely unresolved despite decades of ergonomic mouse redesign. All such devices share a fundamental limitation: they still require fine-motor motion to operate. This work investigates whether predictive, AI-assisted input can reduce that motion by replacing physical pointing with ranked on-screen suggestions. To preserve user agency, we introduce Preview Accept Discard (PAD), a zero-click interaction paradigm that lets users preview predicted GUI targets, cycle through a small set of ranked alternatives, and accept or discard them via key-release timing. We evaluate PAD in two settings: a browser-based email client and a ISO 9241-9 keyboard-prediction task under varying top-3 accuracies. Across both studies, PAD substantially reduces hand motion relative to trackpad use while maintaining comparable task times with the trackpad only when accuracies are similar to those of the best spell-checkers.
zh
[AI-5] Rethinking Science in the Age of Artificial Intelligence
【速读】:该论文试图解决的问题是:随着人工智能(Artificial Intelligence, AI)在科研流程中日益深入的应用,如何确保其在提升效率的同时不削弱科学研究的核心价值,如人类判断、伦理审查与结果验证等关键环节。解决方案的关键在于推动AI作为“主动合作者”而非替代者融入学术实践,并通过制定促进透明度、可重复性和责任性的政策,实现AI在科研工作流中的审慎整合与治理。
链接: https://arxiv.org/abs/2511.10524
作者: Maksim E. Eren,Dorianis M. Perez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) is reshaping how research is conceived, conducted, and communicated across fields from chemistry to biomedicine. This commentary examines how AI is transforming the research workflow. AI systems now help researchers manage the information deluge, filtering the literature, surfacing cross-disciplinary links for ideas and collaborations, generating hypotheses, and designing and executing experiments. These developments mark a shift from AI as a mere computational tool to AI as an active collaborator in science. Yet this transformation demands thoughtful integration and governance. We argue that at this time AI must augment but not replace human judgment in academic workflows such as peer review, ethical evaluation, and validation of results. This paper calls for the deliberate adoption of AI within the scientific practice through policies that promote transparency, reproducibility, and accountability.
zh
[AI-6] On the Detectability of Active Gradient Inversion Attacks in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因恶意服务器发起的主动梯度反演攻击(Active Gradient Inversion Attacks, GIAs)导致的隐私泄露问题。尽管FL通过将数据保留在本地提升了隐私性,但已有研究表明,攻击者可通过操纵全局模型来重构客户端本地数据,而此类主动攻击相比被动攻击更难被发现,严重威胁FL的实际应用安全性。论文的关键解决方案是提出了一种轻量级客户端侧检测机制,其核心基于两个创新点:一是利用统计上不可能的权重结构特征,二是监测损失函数和梯度动态中的异常行为,从而在不修改原有FL训练协议的前提下,有效识别并防范当前最先进的主动GIAs。
链接: https://arxiv.org/abs/2511.10502
作者: Vincenzo Carletti,Pasquale Foggia,Carlo Mazzocca,Giuseppe Parrella,Mario Vento
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:One of the key advantages of Federated Learning (FL) is its ability to collaboratively train a Machine Learning (ML) model while keeping clients’ data on-site. However, this can create a false sense of security. Despite not sharing private data increases the overall privacy, prior studies have shown that gradients exchanged during the FL training remain vulnerable to Gradient Inversion Attacks (GIAs). These attacks allow reconstructing the clients’ local data, breaking the privacy promise of FL. GIAs can be launched by either a passive or an active server. In the latter case, a malicious server manipulates the global model to facilitate data reconstruction. While effective, earlier attacks falling under this category have been demonstrated to be detectable by clients, limiting their real-world applicability. Recently, novel active GIAs have emerged, claiming to be far stealthier than previous approaches. This work provides the first comprehensive analysis of these claims, investigating four state-of-the-art GIAs. We propose novel lightweight client-side detection techniques, based on statistically improbable weight structures and anomalous loss and gradient dynamics. Extensive evaluation across several configurations demonstrates that our methods enable clients to effectively detect active GIAs without any modifications to the FL training protocol.
zh
[AI-7] Strategic Opponent Modeling with Graph Neural Networks Deep Reinforcement Learning and Probabilistic Topic Modeling
【速读】:该论文旨在解决多智能体系统中战略对手建模的挑战,特别是在现实场景下如何有效处理不确定性、异质性以及环境非平稳性等问题。其核心问题在于:传统方法常依赖于不切实际的假设(如公共先验假设 Common Prior Assumption, CPA 和自利假设 Self-Interest Hypothesis, SIH),难以适应复杂动态的多智能体交互环境。解决方案的关键在于融合机器学习与博弈论思想,具体包括:利用图神经网络(Graph Neural Networks, GNN)对多智能体间关系进行结构化建模以提升表达能力;引入多智能体深度强化学习(Multiagent Deep Reinforcement Learning, MADRL)实现策略优化;结合概率主题模型(Probabilistic Topic Modeling, PTM)估计未知分布以应对异质性和信念不确定性;并通过整合博弈论中的公平性与稳定性等解概念,增强模型在真实世界应用中的鲁棒性与可扩展性。
链接: https://arxiv.org/abs/2511.10501
作者: Georgios Chalkiadakis,Charilaos Akasiadis,Gerasimos Koresis,Stergios Plataniots,Leonidas Bakopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:This paper provides a comprehensive review of mainly Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) Machine Learning methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of Graph Neural Networks (GNN). Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of Reinforcement Learning (RL), and in particular that of Multiagent Deep Reinforcement Learning (MADRL). Following, we describe existing relevant game theoretic solution concepts and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes PTM in domains other than that of document analysis and classification. The capability of PTM to estimate unknown underlying distributions can help with tackling heterogeneity and unknown agent beliefs. Finally, we identify certain open challenges specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.
zh
[AI-8] Proceedings of The third international workshop on eXplainable AI for the Arts (XAIxArts)
【速读】:该论文旨在解决生成式 AI (Generative AI) 在艺术创作领域中缺乏可解释性的问题,即如何使艺术家和观众能够理解 AI 艺术生成过程中的决策逻辑与机制。解决方案的关键在于通过组织跨学科的国际研讨会(XAIxArts),汇聚人机交互(HCI)、交互设计、人工智能(AI)及数字艺术等领域的研究者,共同探讨可解释人工智能(XAI)在艺术语境下的应用框架、评估方法与伦理考量,从而推动 XAI 技术在艺术实践中的透明化与可信度提升。
链接: https://arxiv.org/abs/2511.10482
作者: Corey Ford,Elizabeth Wilson,Shuoyang Zheng,Gabriel Vigliensoni,Jeba Rezwana,Lanxi Xiao,Michael Clemens,Makayla Lewis,Drew Hemment,Alan Chamberlain,Helen Kennedy,Nick Bryan-Kinns
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD)
备注: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
Abstract:This third international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 17th ACM Conference on Creativity and Cognition (CC 2025), online.
zh
[AI-9] Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在分布式训练与推理系统中性能优化所面临的挑战,即缺乏一种可扩展且表达能力强的机制来建模分布式工作负载执行过程,从而限制了系统级优化(如并行策略选择)和设计空间探索的有效性。其解决方案的关键在于提出Symbolic Tensor grAph GEnerator (STAGE),一个能够生成高保真度执行轨迹的框架,该框架支持多种并行化策略,并能在跨32K GPU规模下保持计算、内存和通信层面的张量级精度,从而为LLM工作负载提供精确建模能力,助力未来大规模分布式机器学习系统的研究与优化。
链接: https://arxiv.org/abs/2511.10480
作者: Changhai Man,Joongun Park,Hanjiang Wu,Huan Xu,Srinivas Sridharan,Tushar Krishna
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE will be publicly available to facilitate further research in distributed machine learning systems.
zh
[AI-10] Non-Monotonic S4F Standpoint Logic
【速读】:该论文旨在解决多视角语义承诺的统一形式化表达问题,特别是如何在保持计算可行性的同时整合非单调推理机制与异构观点表示。其解决方案的关键在于提出一种新的形式系统——S4F立场逻辑(S4F Standpoint Logic),该逻辑通过将S4F模态逻辑与立场命题逻辑(Standpoint Propositional Logic)进行融合,既支持多视角语义建模,又能自然刻画非单调推理过程,且其计算复杂度不高于任一组成部分,从而实现了理论完备性与实际可计算性的平衡。
链接: https://arxiv.org/abs/2511.10449
作者: Piotr Gorczyca,Hannes Strass
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Standpoint logics offer unified modal logic-based formalisms for representing multiple heterogeneous viewpoints. At the same time, many non-monotonic reasoning frameworks can be naturally captured using modal logics, in particular using the modal logic S4F. In this work, we propose a novel formalism called S4F Standpoint Logic, which generalises both S4F and standpoint propositional logic and is therefore capable of expressing multi-viewpoint, non-monotonic semantic commitments. We define its syntax and semantics and analyze its computational complexity, obtaining the result that S4F Standpoint Logic is not computationally harder than its constituent logics, whether in monotonic or non-monotonic form. We also outline mechanisms for credulous and sceptical acceptance and illustrate the framework with an example.
zh
[AI-11] Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration NEURIPS2025
【速读】:该论文旨在解决扰动型解释(perturbation-based explanations)在实际应用中因模型对特定扰动下行为未知而导致的可靠性问题,特别是模型置信度与实际准确率不一致的不确定性校准(uncertainty calibration)偏差。研究表明,此类扰动会系统性地导致模型产生不可靠的概率估计,从而直接损害全局和局部解释质量。解决方案的关键在于提出ReCalX方法,通过重新校准模型以降低特定扰动下的校准误差,同时保持原始预测不变,从而显著提升解释的鲁棒性和对全局重要输入特征的识别能力。
链接: https://arxiv.org/abs/2511.10439
作者: Thomas Decker,Volker Tresp,Florian Buettner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.
zh
[AI-12] Preference Elicitation for Step-Wise Explanations in Logic Puzzles
【速读】:该论文旨在解决生成式 AI (Generative AI) 在逻辑谜题等满足性问题中,如何从大量候选解释步骤中筛选出最易理解的那一步的问题。核心挑战在于缺乏一个能准确量化每一步解释质量的用户定义目标函数,而传统机器学习中的交互式偏好获取方法可被引入以学习用户偏好。解决方案的关键在于提出两种动态归一化技术,用于处理多子目标间量纲差异带来的学习不稳定问题,并设计了 MACHOP(Multi-Armed CHOice Perceptron)这一新型查询生成策略,通过整合非支配约束与基于置信上限的多样性优化机制,显著提升解释质量。实验表明,MACHOP 在人工用户和真实用户评估中均优于标准方法。
链接: https://arxiv.org/abs/2511.10436
作者: Marco Foschini,Marianne Defresne,Emilio Gamba,Bart Bogaerts,Tias Guns
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Step-wise explanations can explain logic puzzles and other satisfaction problems by showing how to derive decisions step by step. Each step consists of a set of constraints that derive an assignment to one or more decision variables. However, many candidate explanation steps exist, with different sets of constraints and different decisions they derive. To identify the most comprehensible one, a user-defined objective function is required to quantify the quality of each step. However, defining a good objective function is challenging. Here, interactive preference elicitation methods from the wider machine learning community can offer a way to learn user preferences from pairwise comparisons. We investigate the feasibility of this approach for step-wise explanations and address several limitations that distinguish it from elicitation for standard combinatorial problems. First, because the explanation quality is measured using multiple sub-objectives that can vary a lot in scale, we propose two dynamic normalization techniques to rescale these features and stabilize the learning process. We also observed that many generated comparisons involve similar explanations. For this reason, we introduce MACHOP (Multi-Armed CHOice Perceptron), a novel query generation strategy that integrates non-domination constraints with upper confidence bound-based diversification. We evaluate the elicitation techniques on Sudokus and Logic-Grid puzzles using artificial users, and validate them with a real-user evaluation. In both settings, MACHOP consistently produces higher-quality explanations than the standard approach.
zh
[AI-13] Using Certifying Constraint Solvers for Generating Step-wise Explanations AAAI2026
【速读】:该论文旨在解决生成式约束求解中步骤化解释(step-wise explanation)计算成本过高、限制了可应用问题规模的问题。其关键解决方案是利用约束求解器生成的证明(proof)作为起点,而非从头逐步构建解释序列;通过定义抽象证明框架(abstract proofs),将证明与步骤化解释统一表示,并提出多种转换方法,结合剪枝(trimming)与简化(simplification)技术以压缩解释序列及其单个步骤的规模,从而显著提升生成效率,同时保持解释质量接近当前最优水平。
链接: https://arxiv.org/abs/2511.10428
作者: Ignace Bleukx,Maarten Flippo,Bart Bogaerts,Emir Demirović,Tias Guns
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at AAAI 2026
Abstract:In the field of Explainable Constraint Solving, it is common to explain to a user why a problem is unsatisfiable. A recently proposed method for this is to compute a sequence of explanation steps. Such a step-wise explanation shows individual reasoning steps involving constraints from the original specification, that in the end explain a conflict. However, computing a step-wise explanation is computationally expensive, limiting the scope of problems for which it can be used. We investigate how we can use proofs generated by a constraint solver as a starting point for computing step-wise explanations, instead of computing them step-by-step. More specifically, we define a framework of abstract proofs, in which both proofs and step-wise explanations can be represented. We then propose several methods for converting a proof to a step-wise explanation sequence, with special attention to trimming and simplification techniques to keep the sequence and its individual steps small. Our results show our method significantly speeds up the generation of step-wise explanation sequences, while the resulting step-wise explanation has a quality similar to the current state-of-the-art.
zh
[AI-14] Generalizing Analogical Inference from Boolean to Continuous Domains AAAI2026
【速读】:该论文旨在解决现有类比推理框架在连续域(continuous domains)和回归任务中无法有效应用的问题,尤其是在从布尔分类扩展到实值域时缺乏理论保障。其解决方案的关键在于提出一个统一的类比推理框架,基于广义均值(generalized means)定义参数化类比关系,从而同时支持布尔分类与回归任务;该框架能够刻画保持类比性的函数类,并在平滑性假设下推导出最坏情况和平均情况下的误差界,为跨离散与连续域的类比推理提供了形式化理论基础。
链接: https://arxiv.org/abs/2511.10416
作者: Francisco Cunha,Yves Lepage,Zied Bouraoui,Miguel Couceiro
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, to appear in AAAI 2026, extended version
Abstract:Analogical reasoning is a powerful inductive mechanism, widely used in human cognition and increasingly applied in artificial intelligence. Formal frameworks for analogical inference have been developed for Boolean domains, where inference is provably sound for affine functions and approximately correct for functions close to affine. These results have informed the design of analogy-based classifiers. However, they do not extend to regression tasks or continuous domains. In this paper, we revisit analogical inference from a foundational perspective. We first present a counterexample showing that existing generalization bounds fail even in the Boolean setting. We then introduce a unified framework for analogical reasoning in real-valued domains based on parameterized analogies defined via generalized means. This model subsumes both Boolean classification and regression, and supports analogical inference over continuous functions. We characterize the class of analogy-preserving functions in this setting and derive both worst-case and average-case error bounds under smoothness assumptions. Our results offer a general theory of analogical inference across discrete and continuous domains.
zh
[AI-15] Explaining Decentralized Multi-Agent Reinforcement Learning Policies AAAI-26
【速读】:该论文旨在解决去中心化多智能体强化学习(Decentralized Multi-Agent Reinforcement Learning, MARL)中缺乏有效解释机制的问题,特别是现有解释方法主要针对集中式MARL,无法应对去中心化场景下的不确定性与非确定性。其解决方案的关键在于提出两种新型解释方法:一是生成策略摘要(policy summarizations),用于捕捉任务顺序和智能体间的协作关系;二是基于查询的解释机制,支持对特定智能体行为的“何时”(When)、“为何不”(Why Not)和“是什么”(What)类型问题进行回答。该方法在四个MARL领域和两种去中心化MARL算法上验证了通用性和计算效率,并通过用户研究证明其显著提升用户问答性能及理解度与满意度等主观指标。
链接: https://arxiv.org/abs/2511.10409
作者: Kayla Boggess,Sarit Kraus,Lu Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for oral presentation at AAAI-26
Abstract:Multi-Agent Reinforcement Learning (MARL) has gained significant interest in recent years, enabling sequential decision-making across multiple agents in various domains. However, most existing explanation methods focus on centralized MARL, failing to address the uncertainty and nondeterminism inherent in decentralized settings. We propose methods to generate policy summarizations that capture task ordering and agent cooperation in decentralized MARL policies, along with query-based explanations for When, Why Not, and What types of user queries about specific agent behaviors. We evaluate our approach across four MARL domains and two decentralized MARL algorithms, demonstrating its generalizability and computational efficiency. User studies show that our summarizations and explanations significantly improve user question-answering performance and enhance subjective ratings on metrics such as understanding and satisfaction.
zh
[AI-16] nuPlan-R: A Closed-Loop Planning Benchmark for Autonomous Driving via Reactive Multi-Agent Simulation
【速读】:该论文旨在解决现有闭环规划评估基准中依赖规则驱动的反应式代理(如智能驾驶员模型 IDM)所导致的交通动态过于简化、行为多样性不足及难以刻画真实人类交互的问题。解决方案的关键在于提出 nuPlan-R 基准,其核心创新包括:用噪声解耦扩散模型驱动的反应式多智能体仿真替代传统规则代理,以生成更真实、多样且类人的交通行为;引入交互感知的代理选择机制,在保证现实性的同时提升计算效率;并新增两个评估指标以实现对规划性能的更全面衡量。实验表明,该方案显著提升了场景的真实性与复杂交互建模能力,为学习型规划器在动态复杂环境中的优势提供了更可靠的评估基础。
链接: https://arxiv.org/abs/2511.10403
作者: Mingxing Peng,Ruoyu Yao,Xusen Guo,Jun Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:Recent advances in closed-loop planning benchmarks have significantly improved the evaluation of autonomous vehicles. However, existing benchmarks still rely on rule-based reactive agents such as the Intelligent Driver Model (IDM), which lack behavioral diversity and fail to capture realistic human interactions, leading to oversimplified traffic dynamics. To address these limitations, we present nuPlan-R, a new reactive closed-loop planning benchmark that integrates learning-based reactive multi-agent simulation into the nuPlan framework. Our benchmark replaces the rule-based IDM agents with noise-decoupled diffusion-based reactive agents and introduces an interaction-aware agent selection mechanism to ensure both realism and computational efficiency. Furthermore, we extend the benchmark with two additional metrics to enable a more comprehensive assessment of planning performance. Extensive experiments demonstrate that our reactive agent model produces more realistic, diverse, and human-like traffic behaviors, leading to a benchmark environment that better reflects real-world interactive driving. We further reimplement a collection of rule-based, learning-based, and hybrid planning approaches within our nuPlan-R benchmark, providing a clearer reflection of planner performance in complex interactive scenarios and better highlighting the advantages of learning-based planners in handling complex and dynamic scenarios. These results establish nuPlan-R as a new standard for fair, reactive, and realistic closed-loop planning evaluation. We will open-source the code for the new benchmark.
zh
[AI-17] Enhancing Kernel Power K-means: Scalable and Robust Clustering with Random Fourier Features and Possibilistic Method
【速读】:该论文旨在解决核k-均值(Kernel k-means, KPKM)在大规模数据场景下的计算瓶颈与噪声鲁棒性不足的问题。具体而言,KPKM因需构建完整的核矩阵而面临高计算复杂度,且缺乏对样本-中心点分配的精确学习机制,导致对噪声敏感。解决方案的关键在于提出RFF-KPKM方法,首次将随机傅里叶特征(Random Fourier Features, RFF)引入KPKM框架,通过低维近似特征映射替代全核矩阵,从而显著提升可扩展性;同时理论证明了其优越性能:包括O(k3/n)的超额风险界、成员隶属度的一致性以及(1+ε)相对误差 bound,所需RFF维度为\mathrmpoly(\varepsilon^{-1}\log k)。进一步地,为增强多核学习能力和聚类鲁棒性,作者提出IP-RFF-MKPKM,融合可能性隶属度与模糊隶属度优势,实现更稳定的聚类分配。
链接: https://arxiv.org/abs/2511.10392
作者: Yixi Chen,Weixuan Liang,Tianrui Liu,Jun-Jie Huang,Ao Li,Xueling Zhu,Xinwang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Kernel power k -means (KPKM) leverages a family of means to mitigate local minima issues in kernel k -means. However, KPKM faces two key limitations: (1) the computational burden of the full kernel matrix restricts its use on extensive data, and (2) the lack of authentic centroid-sample assignment learning reduces its noise robustness. To overcome these challenges, we propose RFF-KPKM, introducing the first approximation theory for applying random Fourier features (RFF) to KPKM. RFF-KPKM employs RFF to generate efficient, low-dimensional feature maps, bypassing the need for the whole kernel matrix. Crucially, we are the first to establish strong theoretical guarantees for this combination: (1) an excess risk bound of \mathcalO(\sqrtk^3/n) , (2) strong consistency with membership values, and (3) a (1+\varepsilon) relative error bound achievable using the RFF of dimension \mathrmpoly(\varepsilon^-1\log k) . Furthermore, to improve robustness and the ability to learn multiple kernels, we propose IP-RFF-MKPKM, an improved possibilistic RFF-based multiple kernel power k -means. IP-RFF-MKPKM ensures the scalability of MKPKM via RFF and refines cluster assignments by combining the merits of the possibilistic membership and fuzzy membership. Experiments on large-scale datasets demonstrate the superior efficiency and clustering accuracy of the proposed methods compared to the state-of-the-art alternatives.
zh
[AI-18] SITA: A Framework for Structure-to-Instance Theorem Autoformalization
【速读】:该论文旨在解决数学定理自动形式化(auto-formalization)中的一大难题:如何将抽象数学结构在具体场景中的实例化结果准确地转化为可验证的形式化定理,尤其是在Lean证明助手环境中。其核心挑战在于抽象理论与具体应用之间的语义鸿沟,传统方法难以自动化处理此类跨层次的推理任务。解决方案的关键在于提出结构到实例定理自动形式化框架(SITA),该框架将抽象结构视为模块化模板(包含定义、假设、运算和定理),通过生成对应的Lean定义与实例声明,并利用Lean的类型类机制进行集成,从而系统性地桥接抽象与具体;同时结合大语言模型(LLM)生成与反馈引导的精化策略,在保证自动化的同时确保形式正确性。实验表明,该方法能有效形式化基于抽象结构的多样化具体实例。
链接: https://arxiv.org/abs/2511.10356
作者: Chenyi Li,Wanli Ma,Zichen Wang,Zaiwen Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) have shown progress in mathematical reasoning, they still face challenges in formalizing theorems that arise from instantiating abstract structures in concrete settings. With the goal of auto-formalizing mathematical results at the research level, we develop a framework for structure-to-instance theorem autoformalization (SITA), which systematically bridges the gap between abstract mathematical theories and their concrete applications in Lean proof assistant. Formalized abstract structures are treated as modular templates that contain definitions, assumptions, operations, and theorems. These templates serve as reusable guides for the formalization of concrete instances. Given a specific instantiation, we generate corresponding Lean definitions and instance declarations, integrate them using Lean’s typeclass mechanism, and construct verified theorems by checking structural assumptions. We incorporate LLM-based generation with feedback-guided refinement to ensure both automation and formal correctness. Experiments on a dataset of optimization problems demonstrate that SITA effectively formalizes diverse instances grounded in abstract structures.
zh
[AI-19] Massively Parallel Proof-Number Search for Impartial Games and Beyond
【速读】:该论文旨在解决Proof-Number Search(证明数搜索)算法在大规模CPU集群上并行化时扩展性差的问题,从而提升其在复杂博弈求解中的计算效率。解决方案的关键在于提出首个可大规模并行化的Proof-Number Search版本,通过引入两级并行结构以及工作节点间共享信息机制,显著改善了算法在多核CPU上的可扩展性;此外,结合Grundy数对博弈树进行剪枝优化,进一步提升了求解性能,并成功应用于Sprouts游戏的求解,验证了该方法在实际场景中可实现高达332.9倍的加速比,且在运行时间上优于当前最优解法GLOP四个数量级,同时大幅扩展了已知胜负位置的数量。
链接: https://arxiv.org/abs/2511.10339
作者: Tomáš Čížek,Martin Balko,Martin Schmid
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Proof-Number Search is a best-first search algorithm with many successful applications, especially in game solving. As large-scale computing clusters become increasingly accessible, parallelization is a natural way to accelerate computation. However, existing parallel versions of Proof-Number Search are known to scale poorly on many CPU cores. Using two parallelized levels and shared information among workers, we present the first massively parallel version of Proof-Number Search that scales efficiently even on a large number of CPUs. We apply our solver, enhanced with Grundy numbers for reducing game trees, to the Sprouts game, a case study motivated by the long-standing Sprouts Conjecture. Our solver achieves a significantly improved 332.9 \times speedup when run on 1024 cores, enabling it to outperform the state-of-the-art Sprouts solver GLOP by four orders of magnitude in runtime and to generate proofs 1,000 \times more complex. Despite exponential growth in game tree size, our solver verified the Sprouts Conjecture for 42 new positions, nearly doubling the number of known outcomes.
zh
[AI-20] Beyond Verification: Abductive Explanations for Post-AI Assessment of Privacy Leakage AAAI-26
【速读】:该论文旨在解决基于人工智能(AI)决策过程中存在的隐私泄露问题,尤其是当敏感信息可能被模型推断出来时所带来的风险。其解决方案的关键在于提出一个形式化框架,利用溯因解释(abductive explanations)来审计隐私泄露:该方法能够识别出最小充分证据以解释模型决策,并判断是否存在敏感信息的披露。通过引入潜在适用解释(Potentially Applicable Explanations, PAE)的概念,该框架能识别出哪些个体的决策结果可以掩盖具有敏感特征的个体,从而在系统层面和个体层面实现隐私泄露的量化与检测。此方法不仅提供严格的隐私保障,还能生成人类可理解的解释,满足审计工具对透明性与可解释性的要求。
链接: https://arxiv.org/abs/2511.10284
作者: Belona Sonna,Alban Grastien,Claire Benn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Post-AI Formal Methods at AAAI-26
Abstract:Privacy leakage in AI-based decision processes poses significant risks, particularly when sensitive information can be inferred. We propose a formal framework to audit privacy leakage using abductive explanations, which identifies minimal sufficient evidence justifying model decisions and determines whether sensitive information disclosed. Our framework formalizes both individual and system-level leakage, introducing the notion of Potentially Applicable Explanations (PAE) to identify individuals whose outcomes can shield those with sensitive features. This approach provides rigorous privacy guarantees while producing human understandable explanations, a key requirement for auditing tools. Experimental evaluation on the German Credit Dataset illustrates how the importance of sensitive literal in the model decision process affects privacy leakage. Despite computational challenges and simplifying assumptions, our results demonstrate that abductive reasoning enables interpretable privacy auditing, offering a practical pathway to reconcile transparency, model interpretability, and privacy preserving in AI decision-making.
zh
[AI-21] orch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification NEURIPS2025
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在实际应用中难以准确量化预测不确定性的问题,这一局限性阻碍了其在高风险场景(如医疗诊断或自动驾驶)中的可靠部署。为应对这一挑战,作者提出了一种名为Torch-Uncertainty的开源框架,其核心创新在于基于PyTorch和Lightning构建了一个统一、模块化且易扩展的工具链,支持多种不确定性量化(Uncertainty Quantification, UQ)方法的集成与评估,并提供标准化的工作流以实现对分类、分割和回归任务中UQ性能的系统性 benchmarking。
链接: https://arxiv.org/abs/2511.10282
作者: Adrien Lafage,Olivier Laurent,Firas Gabetni,Gianni Franchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Spotlight
Abstract:Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at this https URL
zh
[AI-22] Fixed-Persona SLMs with Modular Memory: Scalable NPC Dialogue on Consumer Hardware
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在游戏对话系统中应用受限的问题,主要瓶颈包括高硬件需求、延迟敏感性以及游戏中需维持明确的知识边界。其解决方案的关键在于提出一种模块化非玩家角色(NPC)对话系统,该系统基于小型语言模型(Small Language Models, SLMs),通过微调以编码特定NPC人格特征,并集成可运行时切换的记忆模块。这些记忆模块能够保存角色专属的对话上下文与世界知识,在不重新训练或加载模型的情况下实现富有表现力的交互和长期记忆,从而在消费级硬件上实现高效、可扩展且具备情境感知能力的对话代理。
链接: https://arxiv.org/abs/2511.10277
作者: Martin Braas,Lukas Esterle
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet their applicability to dialogue systems in computer games remains limited. This limitation arises from their substantial hardware requirements, latency constraints, and the necessity to maintain clearly defined knowledge boundaries within a game setting. In this paper, we propose a modular NPC dialogue system that leverages Small Language Models (SLMs), fine-tuned to encode specific NPC personas and integrated with runtime-swappable memory modules. These memory modules preserve character-specific conversational context and world knowledge, enabling expressive interactions and long-term memory without retraining or model reloading during gameplay. We comprehensively evaluate our system using three open-source SLMs: DistilGPT-2, TinyLlama-1.1B-Chat, and Mistral-7B-Instruct, trained on synthetic persona-aligned data and benchmarked on consumer-grade hardware. While our approach is motivated by applications in gaming, its modular design and persona-driven memory architecture hold significant potential for broader adoption in domains requiring expressive, scalable, and memory-rich conversational agents, such as virtual assistants, customer support bots, or interactive educational systems.
zh
[AI-23] RoboBenchMart: Benchmarking Robots in Retail Environment
【速读】:该论文旨在解决当前机器人操作基准测试过于简化的问题,即大多数现有基准局限于静态桌面场景,难以反映真实复杂环境中的操作挑战。为应对这一局限,作者提出RoboBenchMart,一个面向暗店(dark store)环境的更具挑战性和现实性的机器人操作基准,其关键在于引入密集物品杂乱和多样化空间配置(如不同高度、深度及近距离摆放)的真实零售场景,从而推动生成式AI (Generative AI) 和通用机器人模型在复杂任务中的性能提升。解决方案的核心是提供一套完整的工具集,包括程序化商店布局生成器、轨迹生成管道、评估工具以及微调后的基线模型,以支持后续研究与系统性改进。
链接: https://arxiv.org/abs/2511.10276
作者: Konstantin Soshin,Alexander Krapukhin,Andrei Spiridonov,Denis Shepelev,Gregorii Bukhtuev,Andrey Kuznetsov,Vlad Shakhuro
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing robotic manipulation benchmarks focus on simplified tabletop scenarios, typically involving a stationary robotic arm interacting with various objects on a flat surface. To address this limitation, we introduce RoboBenchMart, a more challenging and realistic benchmark designed for dark store environments, where robots must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations – with items positioned at different heights, depths, and in close proximity. By targeting the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. We demonstrate that current state-of-the-art generalist models struggle to solve even common retail tasks. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools and fine-tuned baseline models.
zh
[AI-24] Bidirectional Bounded-Suboptimal Heuristic Search with Consistent Heuristics
【速读】:该论文旨在解决有界次优双向搜索(bounded-suboptimal bidirectional search)问题,即在保证解的代价不超过指定次优边界的前提下,提升搜索效率。其解决方案的关键在于基于当前最优双向搜索算法BAE*(适用于一致启发式函数)进行改进,提出多个专为有界次优场景设计的变体算法,并通过实验对比这些变体与其它有界次优双向算法及加权A*(weighted A*)的性能表现,从而揭示不同算法在特定条件下的优势与局限性。
链接: https://arxiv.org/abs/2511.10272
作者: Shahaf S. Shperberg,Natalie Morad,Lior Siag,Ariel Felner,Dor Atzmon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in bidirectional heuristic search have yielded significant theoretical insights and novel algorithms. While most previous work has concentrated on optimal search methods, this paper focuses on bounded-suboptimal bidirectional search, where a bound on the suboptimality of the solution cost is specified. We build upon the state-of-the-art optimal bidirectional search algorithm, BAE*, designed for consistent heuristics, and introduce several variants of BAE* specifically tailored for the bounded-suboptimal context. Through experimental evaluation, we compare the performance of these new variants against other bounded-suboptimal bidirectional algorithms as well as the standard weighted A* algorithm. Our results demonstrate that each algorithm excels under distinct conditions, highlighting the strengths and weaknesses of each approach.
zh
[AI-25] Quality Assurance of LLM -generated Code: Addressing Non-Functional Quality Characteristics
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在软件工程中生成代码时,对非功能性质量(non-functional qualities)缺乏系统评估与理解的问题。尽管现有研究主要关注生成代码是否通过测试,但忽视了其在安全性、可维护性、性能效率等方面的实际质量表现。解决方案的关键在于:基于ISO/IEC 25010质量模型,通过系统文献综述、行业实践者研讨会和实证分析三种互补方法,揭示学术界与工业界在代码质量关注点上的差异,并验证LLM生成的补丁在不同质量维度间的权衡关系,从而提出亟需将质量保障机制嵌入LLM代码生成流水线,以确保生成代码不仅功能正确,且具备高质量属性。
链接: https://arxiv.org/abs/2511.10271
作者: Xin Sun,Daniel Ståhl,Kristian Sandahl,Christoph Kessler
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.
zh
[AI-26] Causal-HalBench: Uncovering LVLMs Object Hallucinations Through Causal Intervention AAAI
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,其核心成因是训练过程中模型对高频共现对象产生的虚假相关性(spurious correlations),导致在视觉上下文影响下错误识别不存在的对象。解决方案的关键在于引入因果分析方法,构建结构因果模型(Structural Causal Model, SCM)以形式化定义此类虚假相关性,并开发了Causal-HalBench基准,通过生成反事实样本(counterfactual samples)并结合全面的因果指标,定量评估LVLMs对虚假相关性的鲁棒性;同时提出一个可扩展的反事实样本构造流程,利用专有LVLM和文本到图像(Text-to-Image, T2I)模型生成高质量测试样本,从而系统性揭示并量化模型的幻觉风险。
链接: https://arxiv.org/abs/2511.10268
作者: Zhe Xu,Zhicai Wang,Junkang Wu,Jinda Lu,Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted for publication in the Association for the Advancement of Artificial Intelligence (AAAI), 2026
Abstract:Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primar- ily stems from spurious correlations arising when models strongly associate highly co-occurring objects during train- ing, leading to hallucinated objects influenced by visual con- text. Current benchmarks mainly focus on hallucination de- tection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally de- fine spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correla- tions, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model ro- bustness against spurious correlations. Concurrently, we pro- pose an extensible pipeline for the construction of these coun- terfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their genera- tion. Our evaluations on mainstream LVLMs using Causal- HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.
zh
[AI-27] mporal Properties of Conditional Independence in Dynamic Bayesian Networks
链接: https://arxiv.org/abs/2511.10266
作者: Rajab Aghamov,Christel Baier,Joel Ouaknine,Jakob Piribauer,Mihir Vahanwala,Isa Vialard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-28] Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search
【速读】:该论文旨在解决顺序决策问题中启发式搜索效率低下的问题,尤其是传统基于单步Bellman更新的启发式学习方法在状态采样和启发函数更新方面存在局限。其解决方案的关键在于提出一种广义方法,通过执行有限时域搜索(limited-horizon search),并基于到搜索前沿(search frontier)的最短路径来更新每个状态的启发值,从而同时融合边代价与前沿状态的启发值信息,提升启发函数的质量与搜索效率。
链接: https://arxiv.org/abs/2511.10264
作者: Gal Hadar,Forest Agostinelli,Shahaf S. Shperberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state’s heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.
zh
[AI-29] Workload Schedulers – Genesis Algorithms and Differences
【速读】:该论文旨在解决现代工作负载调度器分类不清的问题,通过系统梳理和对比不同场景下的调度器类型及其演进路径,明确各类调度器的设计差异与共性。解决方案的关键在于提出一个三类调度器的分类框架:操作系统进程调度器(Operating Systems Process Schedulers)、集群系统作业调度器(Cluster Systems Jobs Schedulers)以及大数据调度器(Big Data Schedulers),并从算法使用与功能特征两个维度分析其发展历程,最终揭示本地与分布式系统在调度策略设计上的核心相似性。
链接: https://arxiv.org/abs/2511.10258
作者: Leszek Sliwko,Vladimir Getov
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel approach to categorization of modern workload schedulers. We provide descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems.
zh
[AI-30] Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning
【速读】:该论文旨在解决强化学习中如何利用Transformer模型实现高效且具有泛化能力的上下文学习(In-Context Learning, ICL)问题,特别是在无需更新模型参数的前提下完成任务适应。传统ICRL方法在决策时仅依赖于输入的上下文示例和查询状态,缺乏对潜在奖励分布的显式建模,导致性能受限。解决方案的关键在于提出启发式Transformer(Heuristic Transformer, HT),通过引入一个基于变分自编码器(Variational Auto-Encoder, VAE)学习到的低维随机变量来表示奖励的后验分布(posterior distribution over rewards),并将该信念分布作为提示(prompt)与上下文数据及查询状态一同输入Transformer策略网络,从而增强模型对不确定性的建模能力,提升决策质量与跨环境的泛化性能。
链接: https://arxiv.org/abs/2511.10251
作者: Oliver Dippel,Alexei Lisitsa,Bei Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
zh
[AI-31] Robustness and Imperceptibility Analysis of Hybrid Spatial-Frequency Domain Image Watermarking
【速读】:该论文旨在解决数字图像版权保护与内容认证中的核心挑战,即如何在保证水印不可感知性(imperceptibility)的同时提升其对常见图像处理攻击(如JPEG压缩、高斯噪声和椒盐噪声)的鲁棒性(robustness)。解决方案的关键在于提出一种新颖的混合水印方法(LSB+DFT),该方法结合了空间域最低有效位(Least Significant Bit, LSB)嵌入的高视觉质量优势与频域离散傅里叶变换(Discrete Fourier Transform, DFT)嵌入的强鲁棒性特性,通过冗余嵌入策略和备用提取机制实现两者的平衡,从而在保持高PSNR(峰值信噪比)的同时显著增强抗攻击能力。
链接: https://arxiv.org/abs/2511.10245
作者: Rizal Khoirul Anam
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:The proliferation of digital media necessitates robust methods for copyright protection and content authentication. This paper presents a comprehensive comparative study of digital image watermarking techniques implemented using the spatial domain (Least Significant Bit - LSB), the frequency domain (Discrete Fourier Transform - DFT), and a novel hybrid (LSB+DFT) approach. The core objective is to evaluate the trade-offs between imperceptibility (measured by Peak Signal-to-Noise Ratio - PSNR) and robustness (measured by Normalized Correlation - NC and Bit Error Rate - BER). We implemented these three techniques within a unified MATLAB-based experimental framework. The watermarked images were subjected to a battery of common image processing attacks, including JPEG compression, Gaussian noise, and salt-and-pepper noise, at varying intensities. Experimental results generated from standard image datasets (USC-SIPI) demonstrate that while LSB provides superior imperceptibility, it is extremely fragile. The DFT method offers significant robustness at the cost of visual quality. The proposed hybrid LSB+DFT technique, which leverages redundant embedding and a fallback extraction mechanism, is shown to provide the optimal balance, maintaining high visual fidelity while exhibiting superior resilience to all tested attacks.
zh
[AI-32] PepTriX: A Framework for Explainable Peptide Analysis through Protein Language Models
【速读】:该论文旨在解决当前肽类分类任务中存在的两大核心问题:一是传统方法依赖人工设计的序列编码,导致跨任务和跨数据集的泛化能力受限;二是基于蛋白质语言模型(Protein Language Models, PLMs)的方法虽具高预测性能,但存在计算成本高、可解释性差的问题,且缺乏对生物相关结构特征与功能motif的连接能力。解决方案的关键在于提出PepTriX框架,该框架通过图注意力网络(Graph Attention Network)融合一维(1D)序列嵌入与三维(3D)结构特征,并引入对比学习(contrastive training)和跨模态协同注意力机制(cross-modal co-attention),实现对多样化数据集的自动适应,生成兼具任务特异性和生物学合理性的肽向量表示,从而在保持高预测鲁棒性的同时提供可解释的结构与生物物理motif驱动机制。
链接: https://arxiv.org/abs/2511.10244
作者: Vincent Schilling,Akshat Dubey,Georges Hattab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Peptide classification tasks, such as predicting toxicity and HIV inhibition, are fundamental to bioinformatics and drug discovery. Traditional approaches rely heavily on handcrafted encodings of one-dimensional (1D) peptide sequences, which can limit generalizability across tasks and datasets. Recently, protein language models (PLMs), such as ESM-2 and ESMFold, have demonstrated strong predictive performance. However, they face two critical challenges. First, fine-tuning is computationally costly. Second, their complex latent representations hinder interpretability for domain experts. Additionally, many frameworks have been developed for specific types of peptide classification, lacking generalization. These limitations restrict the ability to connect model predictions to biologically relevant motifs and structural properties. To address these limitations, we present PepTriX, a novel framework that integrates one dimensional (1D) sequence embeddings and three-dimensional (3D) structural features via a graph attention network enhanced with contrastive training and cross-modal co-attention. PepTriX automatically adapts to diverse datasets, producing task-specific peptide vectors while retaining biological plausibility. After evaluation by domain experts, we found that PepTriX performs remarkably well across multiple peptide classification tasks and provides interpretable insights into the structural and biophysical motifs that drive predictions. Thus, PepTriX offers both predictive robustness and interpretable validation, bridging the gap between performance-driven peptide-level models (PLMs) and domain-level understanding in peptide research.
zh
[AI-33] Lost in Serialization: Invariance and Generalization of LLM Graph Reason ers AAAI2026
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的图推理系统在图表示对称性不变性方面的鲁棒性问题。具体而言,LLMs在处理图结构时依赖于序列化的输入形式,导致其输出对节点重编号、边排序或格式变化敏感,从而影响推理结果的一致性和泛化能力。解决方案的关键在于提出一种系统性的图序列化分解方法,将输入分解为节点标签(node labeling)、边编码(edge encoding)和语法(syntax)三个独立因素,并在此基础上构建全面的基准测试套件以量化评估LLM对各因素扰动的鲁棒性。此外,研究还引入了一组新的谱任务(spectral tasks)来进一步检验微调后模型的泛化性能,揭示了模型规模与鲁棒性之间的正相关关系,以及微调策略可能带来的副作用——即虽可降低对节点标签变化的敏感度,但可能增强对结构和格式变化的敏感性,且未必提升在未见任务上的表现。
链接: https://arxiv.org/abs/2511.10234
作者: Daniel Herbst,Lea Karbeska,Divyanshu Kumar,Akanksha Ahuja,Fatemeh Gholamzadeh Nasrabadi,Fabrizio Frasca
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026 Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR)
Abstract:While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.
zh
[AI-34] Bridging Synthetic and Real Routing Problems via LLM -Guided Instance Generation and Progressive Adaptation AAAI-26
【速读】:该论文旨在解决神经组合优化(Neural Combinatorial Optimization, NCO)方法在从合成训练数据到真实世界车辆路径问题(Vehicle Routing Problem, VRP)场景泛化能力不足的问题,尤其在TSPLib和CVRPLib等经典基准实例上表现较差。解决方案的关键在于提出进化式现实实例生成(Evolutionary Realistic Instance Synthesis, EvoReal),其核心是利用大语言模型(Large Language Models, LLMs)引导的进化模块生成结构多样且符合真实世界特征的合成实例,使预训练NCO模型先在结构上对齐此类合成分布,再通过实际基准实例微调,从而显著提升模型在真实场景下的泛化性能,实验表明在TSPLib和CVRPLib上的解质量与最优解差距分别降至1.05%和2.71%。
链接: https://arxiv.org/abs/2511.10233
作者: Jianghan Zhu,Yaoxin Wu,Zhuoyi Lin,Zhengyuan Zhang,Haiyan Yin,Zhiguang Cao,Senthilnath Jayavelu,Xiaoli Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages; To be published in AAAI-26
Abstract:Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.
zh
[AI-35] Speech-Audio Compositional Attacks on Multimodal LLM s and Their Mitigation with SALMONN-Guard
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在面对复杂音频输入时存在的安全漏洞问题,尤其是由跨模态语音-音频组合攻击引发的潜在风险。现有防护机制主要依赖噪声扰动或白盒访问,难以应对通过语音重叠、多说话人对话、语音-非语音混合以及多样化口语指令格式等手段嵌入有害意图的攻击。解决方案的关键在于提出SACRED-Bench基准测试框架和SALMONN-Guard防护模型:前者系统性地模拟多种复杂的音频组合攻击场景以评估LLMs鲁棒性;后者则引入一个联合检测机制,同时分析语音、音频与文本信息进行安全判断,显著降低攻击成功率(从66%降至20%),从而推动面向多模态输入的音频感知型安全防御体系的发展。
链接: https://arxiv.org/abs/2511.10222
作者: Yudong Yang,Xuezhen Zhang,Zhifeng Han,Siyin Wang,Jimin Zhuang,Zengrui Jin,Jing Shao,Guangzhi Sun,Chao Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and © diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at this https URL. Warning: this paper includes examples that may be offensive or harmful.
zh
[AI-36] MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion
【速读】:该论文旨在解决现有交通信号建模方法依赖单一数值模态(即城市传感器的原始数值读数)所导致的语义信息缺失问题,从而限制了对复杂交通动态的准确预测。其核心挑战在于如何有效融合来自不同视角的异构城市多模态数据(如数值、视觉和文本信息),以实现更全面的交通状态理解。解决方案的关键在于提出一种名为MTP(Multimodal Traffic Profiling)的新型多模态框架,通过三个并行分支分别从数值、视觉和文本角度学习交通信号特征:首先将原始数据转化为频域图像与周期性图像用于视觉增强,基于主题和背景生成描述性文本用于文本学习,并利用频域多层感知机处理原始数值信息;随后设计分层对比学习机制,融合三模态在频域中的特征表示,从而实现跨模态语义对齐与协同优化。
链接: https://arxiv.org/abs/2511.10218
作者: Haolong Xiang,Peisi Wang,Xiaolong Xu,Kun Yi,Xuyun Zhang,Quanzheng Sheng,Amin Beheshti,Wei Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic signal modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel \textitMultimodal framework, \textitMTP, for urban \textitTraffic \textitProfiling, which learns multimodal features through numeric, visual, and textual perspectives. The three branches drive for a multimodal perspective of urban traffic signal learning in the frequency domain, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic signals, which transforms the original modality into frequency images and periodicity images for visual learning. Also, we augment descriptive texts for the traffic signals based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the spectrum of three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.
zh
[AI-37] Advanced Black-Box Tuning of Large Language Models with Limited API Calls
【速读】:该论文旨在解决黑盒微调(black-box tuning)中存在的一大困境:即在无法访问大语言模型(Large Language Models, LLMs)参数的情况下,如何在保持高效率的同时实现显著的性能提升。当前方法要么依赖独立训练的小型代理模型(proxy model),虽高效但改进有限;要么在每次微调迭代中频繁调用基础模型API,导致计算成本过高。解决方案的关键在于提出一种基于高斯过程(Gaussian Process, GP)代理模型的新方法,通过仅对少量且信息丰富的训练子集进行查询,提取“LogitMap Pair”作为训练信号来构建GP代理模型,从而近似基础模型输出并指导代理模型训练,大幅减少对API调用的需求。实验表明,该方法将预训练语言模型准确率从55.92%提升至86.85%,同时将API调用频率降至1.38%,在性能和效率之间实现了最优平衡。
链接: https://arxiv.org/abs/2511.10210
作者: Zhikang Xie,Weilin Wan,Peizhu Gong,Weizhong Zhang,Cheng Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors, particularly when direct access to model parameters is unavailable. Current strategies, however, often present a dilemma of suboptimal extremes: either separately train a small proxy model and then use it to shift the predictions of the foundation model, offering notable efficiency but often yielding limited improvement; or making API calls in each tuning iteration to the foundation model, which entails prohibitive computational costs. Therefore, we propose a novel advanced black-box tuning method for LLMs with limited API calls. Our core strategy involves training a Gaussian Process (GP) surrogate model with “LogitMap Pairs” derived from querying the foundation model on a minimal but highly informative training subset. This surrogate can approximate the outputs of the foundation model to guide the training of the proxy model, thereby effectively reducing the need for direct queries to the foundation model. Extensive experiments verify that our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%. This significantly outperforms offline approaches that operate entirely without API access. Notably, our method also achieves comparable or superior accuracy to query-intensive approaches, while significantly reducing API costs. This offers a robust and high-efficiency paradigm for language model adaptation.
zh
[AI-38] Fractional neural attention for efficient multiscale sequence processing
【速读】:该论文旨在解决Transformer模型中自注意力机制(self-attention)在理解和扩展其多尺度信息处理原理方面的挑战,以提升人工智能系统的计算效率与表达能力。解决方案的关键在于提出分数阶神经注意机制(Fractional Neural Attention, FNA),该机制通过分数拉普拉斯算子(fractional Laplacian)驱动的Lévy扩散过程建模token之间的交互,从而天然实现跨多个尺度的短程与长程依赖关系;理论分析表明FNA的动力学由分数扩散方程控制,具有更大的谱间隙(spectral gap)和更短的路径长度(path length),体现了更强的计算效率;此外,基于几何谐波的扩散映射算法可对FNA权重进行降维,同时保留嵌入空间和隐藏状态的内在结构,使模型兼具高效性、可解释性和生物合理性。
链接: https://arxiv.org/abs/2511.10208
作者: Cheng Kevin Qu,Andrew Ly,Pulin Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Probability (math.PR); Biological Physics (physics.bio-ph)
备注:
Abstract:Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through Lévy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA’s dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths – mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.
zh
[AI-39] Improved Offline Reinforcement Learning via Quantum Metric Encoding
【速读】:该论文旨在解决有限样本条件下离线强化学习(Offline Reinforcement Learning, Offline RL)性能不佳的问题。其核心解决方案是引入量子度量编码器(Quantum Metric Encoder, QME),通过将原始状态嵌入到一个更紧凑且语义更丰富的表示空间中,从而提升RL代理的训练效果。QME的关键在于其结构受量子电路启发,既可在经典设备上作为可训练的酉嵌入模块实现(适用于经典数据),也可直接在量子硬件上运行(适用于量子态数据),无需测量或重新编码。实验表明,基于QME嵌入后的状态与解码奖励进行训练,相比原始状态和奖励,能显著提升性能(平均提升116.2% for SAC 和 117.6% for IQL)。进一步分析发现,QME嵌入后状态空间具有更低的Δ-双曲性(Δ-hyperbolicity),这说明性能提升源于QME对状态空间几何结构的优化,为开发低样本条件下的高效离线RL方法提供了理论依据和实践路径。
链接: https://arxiv.org/abs/2511.10187
作者: Outongyi Lv,Yewei Yuan,Nana Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:Reinforcement learning (RL) with limited samples is common in real-world applications. However, offline RL performance under this constraint is often suboptimal. We consider an alternative approach to dealing with limited samples by introducing the Quantum Metric Encoder (QME). In this methodology, instead of applying the RL framework directly on the original states and rewards, we embed the states into a more compact and meaningful representation, where the structure of the encoding is inspired by quantum circuits. For classical data, QME is a classically simulable, trainable unitary embedding and thus serves as a quantum-inspired module, on a classical device. For quantum data in the form of quantum states, QME can be implemented directly on quantum hardware, allowing for training without measurement or re-encoding. We evaluated QME on three datasets, each limited to 100 samples. We use Soft-Actor-Critic (SAC) and Implicit-Q-Learning (IQL), two well-known RL algorithms, to demonstrate the effectiveness of our approach. From the experimental results, we find that training offline RL agents on QME-embedded states with decoded rewards yields significantly better performance than training on the original states and rewards. On average across the three datasets, for maximum reward performance, we achieve a 116.2% improvement for SAC and 117.6% for IQL. We further investigate the \Delta -hyperbolicity of our framework, a geometric property of the state space known to be important for the RL training efficacy. The QME-embedded states exhibit low \Delta -hyperbolicity, suggesting that the improvement after embedding arises from the modified geometry of the state space induced by QME. Thus, the low \Delta -hyperbolicity and the corresponding effectiveness of QME could provide valuable information for developing efficient offline RL methods under limited-sample conditions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph) Cite as: arXiv:2511.10187 [cs.LG] (or arXiv:2511.10187v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.10187 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-40] wo Constraint Compilation Methods for Lifted Planning
【速读】:该论文旨在解决在PDDL(Planning Domain Definition Language)片段中包含定性状态轨迹约束(qualitative state-trajectory constraints)的规划问题,这类约束常见于现实世界任务中的安全要求、任务顺序条件及中间子目标。传统方法通过编译(compilation)消除约束,从而适配主流规划器,但现有编译器在对象数量庞大和高元数动作(high-arity actions)场景下无法扩展,因其需先对问题进行完全实例化(grounding)。论文的关键解决方案是提出两种无需先进行实例化的约束编译方法,从而显著降低计算复杂度并生成更紧凑的规划规范,实验证明其在国际规划竞赛常用领域上既高效又具有竞争力。
链接: https://arxiv.org/abs/2511.10164
作者: Periklis Mantenoglou,Luigi Bonassi,Enrico Scala,Pedro Zuidberg Dos Martires
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:We study planning in a fragment of PDDL with qualitative state-trajectory constraints, capturing safety requirements, task ordering conditions, and intermediate sub-goals commonly found in real-world problems. A prominent approach to tackle such problems is to compile their constraints away, leading to a problem that is supported by state-of-the-art planners. Unfortunately, existing compilers do not scale on problems with a large number of objects and high-arity actions, as they necessitate grounding the problem before compilation. To address this issue, we propose two methods for compiling away constraints without grounding, making them suitable for large-scale planning problems. We prove the correctness of our compilers and outline their worst-case time complexity. Moreover, we present a reproducible empirical evaluation on the domains used in the latest International Planning Competition. Our results demonstrate that our methods are efficient and produce planning specifications that are orders of magnitude more succinct than the ones produced by compilers that ground the domain, while remaining competitive when used for planning with a state-of-the-art planner.
zh
[AI-41] DenoGrad: Deep Gradient Denoising Framework for Enhancing the Performance of Interpretable AI Models
链接: https://arxiv.org/abs/2511.10161
作者: J. Javier Alonso-Ramos,Ignacio Aguilera-Martos,Andrés Herrera-Poyatos,Francisco Herrera
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-42] RAG Fort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation AAAI2026
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对知识库重建攻击时的脆弱性问题,此类攻击通过聚合模型输出,利用类内(intra-class)和类间(inter-class)路径逐步提取并扩散细粒度知识,从而复现原始私有知识库。现有防御方法仅针对单一路径进行保护,导致另一路径仍存在风险。解决方案的关键在于提出RAGFort,一种结构感知的双模块防御机制:其一为“对比重索引”(contrastive reindexing),用于实现类间隔离;其二为“约束级联生成”(constrained cascade generation),用于保障类内知识不被泄露。实验证明,联合保护两类路径可显著降低重建成功率,同时维持高质量问答性能,提供对知识库提取攻击的全面防护。
链接: https://arxiv.org/abs/2511.10128
作者: Qinfeng Li,Miao Pan,Ke Xiong,Ge Su,Zhiqiang Shen,Yan Liu,Bing Sun,Hao Peng,Xuhong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 Conference
Abstract:Retrieval-Augmented Generation (RAG) systems deployed over proprietary knowledge bases face growing threats from reconstruction attacks that aggregate model responses to replicate knowledge bases. Such attacks exploit both intra-class and inter-class paths, progressively extracting fine-grained knowledge within topics and diffusing it across semantically related ones, thereby enabling comprehensive extraction of the original knowledge base. However, existing defenses target only one path, leaving the other unprotected. We conduct a systematic exploration to assess the impact of protecting each path independently and find that joint protection is essential for effective defense. Based on this, we propose RAGFort, a structure-aware dual-module defense combining “contrastive reindexing” for inter-class isolation and “constrained cascade generation” for intra-class protection. Experiments across security, performance, and robustness confirm that RAGFort significantly reduces reconstruction success while preserving answer quality, offering comprehensive defense against knowledge base extraction attacks.
zh
[AI-43] Intilligence Foundation Model: A New Perspective to Approach Artificial General Intelligence
【速读】:该论文旨在解决当前基础模型(Foundation Model, FM)在实现人工通用智能(Artificial General Intelligence, AGI)方面的局限性问题,即现有FM通常局限于特定领域(如语言、视觉或时间序列)的模式学习,缺乏对智能本质机制的理解与泛化能力。解决方案的关键在于提出一种新的智能基础模型(Intelligence Foundation Model, IFM),其核心创新包括两个方面:一是设计了一种称为“状态神经网络”(state neural network)的新网络架构,用于模拟生物神经元的时间动态过程,从而实现信息的存储、整合与处理;二是引入“神经元输出预测”(neuron output prediction)作为新的学习目标,使系统能够从多样化的智能行为中学习到统一的结构动力学原理。这一方法通过生物神经系统的集体动态来建模智能行为,为构建具备跨域泛化、推理和自适应学习能力的AGI系统提供了可计算且可扩展的基础。
链接: https://arxiv.org/abs/2511.10119
作者: Borui Cai,Yao Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a new perspective for approaching artificial general intelligence (AGI) through an intelligence foundation model (IFM). Unlike existing foundation models (FMs), which specialize in pattern learning within specific domains such as language, vision, or time series, IFM aims to acquire the underlying mechanisms of intelligence by learning directly from diverse intelligent behaviors. Vision, language, and other cognitive abilities are manifestations of intelligent behavior; learning from this broad range of behaviors enables the system to internalize the general principles of intelligence. Based on the fact that intelligent behaviors emerge from the collective dynamics of biological neural systems, IFM consists of two core components: a novel network architecture, termed the state neural network, which captures neuron-like dynamic processes, and a new learning objective, neuron output prediction, which trains the system to predict neuronal outputs from collective dynamics. The state neural network emulates the temporal dynamics of biological neurons, allowing the system to store, integrate, and process information over time, while the neuron output prediction objective provides a unified computational principle for learning these structural dynamics from intelligent behaviors. Together, these innovations establish a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains, representing a step toward truly AGI.
zh
[AI-44] Balancing Centralized Learning and Distributed Self-Organization: A Hybrid Model for Embodied Morphogenesis
【速读】:该论文旨在解决如何通过一个可学习的类脑控制器(brain-like controller)与类细胞的Gray–Scott反应-扩散底物(cell-like Gray–Scott substrate)进行耦合,以最小控制代价实现对模式形成过程的有效引导。其核心挑战在于平衡控制效率、目标纹理的精确性以及系统能量消耗之间的关系。解决方案的关键在于构建一种嵌入在可微分PyTorch反应-扩散模拟器中的紧凑卷积策略(compact convolutional policy),该策略在“暖热-保持-衰减”增益调度下生成空间平滑且有界的参数调制(ΔF, ΔK),并通过优化图灵带谱目标(FFT-based Turing-band spectral targets)同时惩罚控制努力(ℓ₁/ℓ₂范数)和不稳定性来训练模型。实验表明,混合耦合(hybrid coupling)方案在165步内实现100%严格收敛,匹配纯细胞系统的选择性(0.436 vs. 0.434),且相比神经网络主导控制节省约15倍ℓ₁控制努力和200倍ℓ₂功率,验证了“种子后交出”(seeds then cedes)机制:控制器仅提供短暂稀疏的扰动,将系统引入正确吸引子盆地,后续由局部物理机制维持稳定模式,从而实现形态计算(morphological computation)的最优分工——即集中式学习与分布式自组织之间的高效协同。
链接: https://arxiv.org/abs/2511.10101
作者: Takehiro Ishikawa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate how to couple a learnable brain-like’’ controller to a cell-like’’ Gray–Scott substrate to steer pattern formation with minimal effort. A compact convolutional policy is embedded in a differentiable PyTorch reaction–diffusion simulator, producing spatially smooth, bounded modulations of the feed and kill parameters ( \Delta F , \Delta K ) under a warm–hold–decay gain schedule. Training optimizes Turing-band spectral targets (FFT-based) while penalizing control effort ( \ell_1/\ell_2 ) and instability. We compare three regimes: pure reaction–diffusion, NN-dominant, and a hybrid coupling. The hybrid achieves reliable, fast formation of target textures: 100% strict convergence in \sim 165 steps, matching cell-only spectral selectivity (0.436 vs.\ 0.434) while using \sim 15\times less \ell_1 effort and 200\times less \ell_2 power than NN-dominant control. An amplitude sweep reveals a non-monotonic Goldilocks’’ zone ( A \approx 0.03 – 0.045 ) that yields 100% quasi convergence in 94–96 steps, whereas weaker or stronger gains fail to converge or degrade selectivity. These results quantify morphological computation: the controller seeds then cedes,‘’ providing brief, sparse nudges that place the system in the correct basin of attraction, after which local physics maintains the pattern. The study offers a practical recipe for building steerable, robust, and energy-efficient embodied systems that exploit an optimal division of labor between centralized learning and distributed self-organization.
zh
[AI-45] 2IBias: Uncovering Societal Bias Encoded in the Latent Space of Text-to-Image Generative Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像(Text-to-Image, T2I)模型中可能系统性编码并放大社会偏见的问题,尤其是在种族和性别层面的刻板印象再现,这可能损害组织伦理与责任AI原则。其解决方案的关键在于通过实证研究方法,系统评估五种主流开源T2I模型在预训练潜在空间中的偏见表现:使用十个中性职业提示词生成每职业100张图像(共5000张),由多元背景的人类评估者进行标注。结果揭示了所有模型均存在显著的社会偏见,如护理类角色普遍女性化、高地位职业(如CEO、医生、律师)多为白人男性,且不同模型表现出特定模式(如QWEN-Image偏向东亚面孔、Kandinsky以白人为主)。该研究为AI项目管理者提供了可操作的依据,以选择更公平的模型与定制化提示词,从而推动负责任生成式AI系统的构建与部署。
链接: https://arxiv.org/abs/2511.10089
作者: Abu Sufian,Cosimo Distante,Marco Leo,Hanan Salam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for presentation in the First Interdisciplinary Workshop on Responsible AI for Value Creation. Dec 1, Copenhagen. The final version will be submitted for inclusion in a Springer LNCS Volume. (15 pages, 7 figures)
Abstract:Text-to-image (T2I) generative models are largely used in AI-powered real-world applications and value creation. However, their strategic deployment raises critical concerns for responsible AI management, particularly regarding the reproduction and amplification of race- and gender-related stereotypes that can undermine organizational ethics. In this work, we investigate whether such societal biases are systematically encoded within the pretrained latent spaces of state-of-the-art T2I models. We conduct an empirical study across the five most popular open-source models, using ten neutral, profession-related prompts to generate 100 images per profession, resulting in a dataset of 5,000 images evaluated by diverse human assessors representing different races and genders. We demonstrate that all five models encode and amplify pronounced societal skew: caregiving and nursing roles are consistently feminized, while high-status professions such as corporate CEO, politician, doctor, and lawyer are overwhelmingly represented by males and mostly White individuals. We further identify model-specific patterns, such as QWEN-Image’s near-exclusive focus on East Asian outputs, Kandinsky’s dominance of White individuals, and SDXL’s comparatively broader but still biased distributions. These results provide critical insights for AI project managers and practitioners, enabling them to select equitable AI models and customized prompts that generate images in alignment with the principles of responsible AI. We conclude by discussing the risks of these biases and proposing actionable strategies for bias mitigation in building responsible GenAI systems.
zh
[AI-46] Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning NEURIPS2025
【速读】:该论文旨在解决离线到在线强化学习(Offline-to-Online Reinforcement Learning, O2O-RL)中存在的两大核心问题:一是多模态行为覆盖不足,二是在线适应过程中分布偏移(distributional shifts)。为应对这些挑战,作者提出了一种受大语言模型预训练与微调策略启发的统一生成式框架UEPO。其解决方案的关键在于:(1)设计了一种多种子动态感知的扩散策略(multi-seed dynamics-aware diffusion policy),无需训练多个模型即可高效捕获多样化行为模式;(2)引入动态分歧正则化机制(dynamic divergence regularization),强制策略在物理上具有意义的多样性;(3)构建基于扩散的数据增强模块,提升动态模型的泛化能力。实验表明,UEPO在D4RL基准上相较于Uni-O4在运动控制任务中提升5.9%,在灵巧操作任务中提升12.4%,验证了其优异的泛化性和可扩展性。
链接: https://arxiv.org/abs/2511.10087
作者: Haidong Huang,Haiyue Zhu. Jiayu Song,Xixin Zhao,Yaohua Zhou,Jiayi Zhang,Yuze Zhai,Xiaocong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2025 Workshop on Embodied World Models for Decision Making
Abstract:Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion tasks and +12.4% on dexterous manipulation, demonstrating strong generalization and scalability.
zh
[AI-47] Radiology Workflow-Guided Hierarchical Reinforcement Fine-Tuning for Medical Report Generation
【速读】:该论文旨在解决现有医学报告生成(Medical Report Generation, MRG)系统在处理放射学报告时忽视其结构化工作流程的问题,导致描述性内容与诊断结论之间出现不一致。传统方法将报告视为扁平序列,未能体现放射科医生从发现描述到印象总结再到关键病例精细化调整的分层决策过程。解决方案的关键在于提出RadFlow框架——一个基于临床工作流引导的强化优化方法,其核心创新包括:(1) 构建分层奖励机制,全局层面融合语言流畅性、医学领域正确性和发现与印象间的跨段落一致性,局部层面聚焦印象质量以提升诊断准确性;(2) 引入敏感病例感知的策略优化机制,在高风险或临床敏感案例中自适应地增强学习正则化,模拟放射科医生对关键发现的审慎修正行为。该设计使模型能够生成既语言连贯又临床一致的高质量报告。
链接: https://arxiv.org/abs/2511.10065
作者: Bodong Du,Honglong Yang,Xiaomeng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Radiologists compose diagnostic reports through a structured workflow: they describe visual findings, summarize them into impressions, and carefully refine statements in clinically critical cases. However, most existing medical report generation (MRG) systems treat reports as flat sequences, overlooking this hierarchical organization and leading to inconsistencies between descriptive and diagnostic content. To align model behavior with real-world reporting practices, we propose RadFlow, a hierarchical workflow-guided reinforcement optimization framework that explicitly models the structured nature of clinical reporting. RadFlow introduces a clinically grounded reward hierarchy that mirrors the organization of radiological reports. At the global level, the reward integrates linguistic fluency, medical-domain correctness, and cross-sectional consistency between Finding and Impression, promoting coherent and clinically faithful narratives. At the local level, a section-specific reward emphasizes Impression quality, reflecting its central role in diagnostic accuracy. Furthermore, a critical-aware policy optimization mechanism adaptively regularizes learning for high-risk or clinically sensitive cases, emulating the cautious refinement behavior of radiologists when documenting critical findings. Together, these components translate the structured reporting paradigm into the reinforcement fine-tuning process, enabling the model to generate reports that are both linguistically consistent and clinically aligned. Experiments on chest X-ray and carotid ultrasound datasets demonstrate that RadFlow consistently improves diagnostic coherence and overall report quality compared with state-of-the-art baselines.
zh
[AI-48] BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在推理过程中因专家参数规模超过GPU显存容量而导致的内存瓶颈问题,尤其关注在预取(prefetching)失败时如何兼顾推理速度与模型精度。现有方法通过将未激活专家卸载至CPU内存缓解显存压力,但跨PCIe总线的数据传输延迟高(约10 ms),且预取失败会导致显著停顿或精度下降。论文的关键解决方案在于设计一种高效的容错机制,在预取失败时避免“按需加载”造成的长延迟或“丢弃专家”导致的精度损失,从而实现高吞吐与高准确率的协同优化。
链接: https://arxiv.org/abs/2511.10054
作者: Yun Wang,Lingyun Yang,Senhao Yu,Yixiao Wang,Ruixing Li,Zhixiang Wei,James Yen,Zhengwei Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.
zh
[AI-49] Efficient Thought Space Exploration through Strategic Intervention AAAI2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因采用耗时的穷举采样策略而导致计算成本过高的问题。其核心解决方案是提出一种名为Hint-Practice Reasoning (HPR) 的新框架,关键创新在于引入了“分布不一致性减少”(Distributional Inconsistency Reduction, DIR)机制——该机制通过量化执行者(practitioner,小型高效模型)的推理轨迹与提示者(hinter,强大LLM)预期分布之间的差异,在树状概率空间中动态识别需干预的关键决策点,并据此重加权高潜力路径、剪枝低概率分支,从而显著降低token生成数量(仅需原方法的1/5),同时保持甚至超越现有方法的准确率和效率平衡。
链接: https://arxiv.org/abs/2511.10038
作者: Ziheng Li,Hengyi Cai,Xiaochi Wei,Yuchen Li,Shuaiqiang Wang,Zhi-Hong Deng,Dawei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:While large language models (LLMs) demonstrate emerging reasoning capabilities, current inference-time expansion methods incur prohibitive computational costs by exhaustive sampling. Through analyzing decoding trajectories, we observe that most next-token predictions align well with the golden output, except for a few critical tokens that lead to deviations. Inspired by this phenomenon, we propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components: 1) a hinter (powerful LLM) that provides probabilistic guidance at critical decision points, and 2) a practitioner (efficient smaller model) that executes major reasoning steps. The framework’s core innovation lies in Distributional Inconsistency Reduction (DIR), a theoretically-grounded metric that dynamically identifies intervention points by quantifying the divergence between practitioner’s reasoning trajectory and hinter’s expected distribution in a tree-structured probabilistic space. Through iterative tree updates guided by DIR, HPR reweights promising reasoning paths while deprioritizing low-probability branches. Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR’s state-of-the-art efficiency-accuracy tradeoffs: it achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, and outperforms existing methods by at most 5.1% absolute accuracy while maintaining similar or lower FLOPs.
zh
[AI-50] Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning
【速读】:该论文旨在解决现有工具增强型大语言模型(Tool-augmented Large Language Models, LLMs)在处理复杂查询时面临的局部优化陷阱问题,这些问题主要源于当前框架(如ReAct)依赖增量式决策过程,难以实现全局最优的多工具协同执行。解决方案的关键在于提出一种以规划器为中心的“Plan-Execute”范式,其核心创新是引入一个新型规划器(Planner)模型,通过全局导向的有向无环图(Directed Acyclic Graph, DAG)规划机制,对复杂查询进行结构化分解与优化调度,从而突破传统工具协调方式的局限性。此外,论文还构建了大规模基准数据集ComplexTool-Plan,并设计两阶段训练方法(结合监督微调SFT与组相对策略优化GRPO),显著提升了规划器的工具选择准确性和全局规划意识,最终在StableToolBench基准上实现了端到端执行性能的显著提升。
链接: https://arxiv.org/abs/2511.10037
作者: Xiaolong Wei,Yuehu Dong,Xingliang Wang,Xingyu Zhang,Zhejun Zhao,Dongdong Shen,Long Xia,Dawei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner’s tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.
zh
[AI-51] Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback AAAI2026
【速读】:该论文旨在解决AI系统在道德领域中对人类道德偏好进行对齐时所面临的动态性问题,即传统对齐方法假设道德偏好是静态的,而实际上人类的道德判断会随时间发生改变。这种变化可能源于合理的认知演进(如经验积累或价值观更新),也可能来自注意力缺陷或认知偏差等非理性因素,但现有AI对齐技术普遍忽视了时间维度上的偏好演变,导致在高风险场景(如医疗决策)中可能出现严重误判和信任危机。解决方案的关键在于识别并区分“合法”的道德推理变化与随机波动,并通过长期追踪个体偏好演化(如在肾移植分配场景中对400余名参与者进行3–5轮配对比较测试)来构建能适应时间变化的AI模型,从而提升模型预测稳定性与对齐质量,推动从静态对齐向动态对齐的范式转变。
链接: https://arxiv.org/abs/2511.10032
作者: Vijay Keswani,Cyrus Cousins,Breanna Nguyen,Vincent Conitzer,Hoda Heidari,Jana Schaich Borg,Walter Sinnott-Armstrong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in the AAAI 2026 Alignment Track
Abstract:Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for “legitimate” changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people’s moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting “response instability”). Additionally, we observe significant shifts in several participants’ retrofitted decision-making models over time (capturing “model instability”). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.
zh
[AI-52] mporal Latent Variable Structural Causal Model for Causal Discovery under External Interferences
【速读】:该论文旨在解决从受多种外部干扰影响的观测数据中推断因果关系的问题,这类干扰通常表现为未知外部因素对观测变量的附加效应。为应对这一挑战,作者提出了一种新的时序潜在变量结构因果模型(Temporal Latent Variable Structural Causal Model),其关键在于引入潜变量以表征未观测的干扰因素,并通过因果强度系数(causal strength coefficients)和邻接系数(adjacency coefficients)来同时刻画变量间的因果强度与结构关系。此外,该方法还基于变分推断(Variational Inference)设计了融合专家知识的参数学习机制,从而在先验信息可用时引导模型估计,提升推断的稳定性和准确性。
链接: https://arxiv.org/abs/2511.10031
作者: Ruichu Cai,Xiaokai Huang,Wei Chen,Zijian Li,Zhifeng Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by Neurocomputing
Abstract:Inferring causal relationships from observed data is an important task, yet it becomes challenging when the data is subject to various external interferences. Most of these interferences are the additional effects of external factors on observed variables. Since these external factors are often unknown, we introduce latent variables to represent these unobserved factors that affect the observed data. Specifically, to capture the causal strength and adjacency information, we propose a new temporal latent variable structural causal model, incorporating causal strength and adjacency coefficients that represent the causal relationships between variables. Considering that expert knowledge can provide information about unknown interferences in certain scenarios, we develop a method that facilitates the incorporation of prior knowledge into parameter learning based on Variational Inference, to guide the model estimation. Experimental results demonstrate the stability and accuracy of our proposed method.
zh
[AI-53] ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response
【速读】:该论文旨在解决应急响应人员在处理危险化学品(HAZMAT)事件时,因需手动查阅大量化学指南而面临的时间敏感性决策难题。其解决方案的关键在于提出并构建了一个名为Chemical Emergency Response Evaluation Framework (ChEmREF)的新基准,涵盖1,035种HAZMAT化学品的三类任务:(1)结构化与非结构化化学表示形式之间的转换(如将C₂H₆O转换为乙醇),(2)应急响应建议生成(如推荐疏散距离),以及(3)来自化学品安全和认证考试的领域知识问答。通过该框架评估当前大语言模型(LLM)在上述任务中的表现,发现其虽具备辅助应急响应的潜力,但仍需人工监督以克服现有局限性。
链接: https://arxiv.org/abs/2511.10027
作者: Risha Surana,Qinyuan Ye,Swabha Swayamdipta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emergency responders managing hazardous material HAZMAT incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today’s language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and providing this http URL introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising questions on 1,035 HAZMAT chemicals from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) translation of chemical representation between structured and unstructured forms (e.g., converting C2H6O to ethanol), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best evaluated models received an exact match of 68.0% on unstructured HAZMAT chemical representation translation, a LLM Judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HAMZAT this http URL findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.
zh
[AI-54] he Role of Advanced Computer Architectures in Accelerating Artificial Intelligence Workloads
【速读】:该论文旨在解决人工智能(AI)模型复杂度不断提升所带来的计算需求与传统计算机架构性能瓶颈之间的矛盾问题。其核心解决方案在于系统性地分析和总结用于加速现代AI工作负载的主流硬件架构——包括图形处理器(Graphics Processing Units, GPUs)、专用集成电路(Application-Specific Integrated Circuits, ASICs)和现场可编程门阵列(Field-Programmable Gate Arrays, FPGAs)的设计哲学、关键特性及性能权衡,并深入探讨数据流优化、先进内存层次结构、稀疏性(sparsity)和量化(quantization)等提升性能与能效的核心原则。此外,论文还前瞻性地评估了存内计算(Processing-in-Memory, PIM)和类脑计算(neuromorphic computing)等新兴技术对未来AI加速器演进的潜在影响,最终指出:AI与计算机体系结构已形成共生关系,软硬件协同设计已成为推动未来计算进步的必要条件而非单纯优化手段。
链接: https://arxiv.org/abs/2511.10010
作者: Shahid Amin,Syed Pervez Hussnain Shah
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 Pages, 2 Figures
Abstract:The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grown in complexity, their massive computational demands have pushed traditional architectures to their limits. This paper provides a structured review of this co-evolution, analyzing the architectural landscape designed to accelerate modern AI workloads. We explore the dominant architectural paradigms Graphics Processing Units (GPUs), Appli-cation-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Ar-rays (FPGAs) by breaking down their design philosophies, key features, and per-formance trade-offs. The core principles essential for performance and energy efficiency, including dataflow optimization, advanced memory hierarchies, spar-sity, and quantization, are analyzed. Furthermore, this paper looks ahead to emerging technologies such as Processing-in-Memory (PIM) and neuromorphic computing, which may redefine future computation. By synthesizing architec-tural principles with quantitative performance data from industry-standard benchmarks, this survey presents a comprehensive picture of the AI accelerator landscape. We conclude that AI and computer architecture are in a symbiotic relationship, where hardware-software co-design is no longer an optimization but a necessity for future progress in computing.
zh
[AI-55] Phantom Menace: Exploring and Enhancing the Robustness of VLA Models against Physical Sensor Attacks AAAI2026
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在物理世界中对传感器攻击的脆弱性问题,这是当前VLA系统安全研究中的一个关键空白。其核心挑战在于,VLA模型高度依赖多模态感知输入(如摄像头和麦克风),而这些输入易受物理层面干扰,但现有研究尚未系统评估此类攻击的影响及防御策略。解决方案的关键在于提出一种“Real-Sim-Real”框架,能够自动模拟基于物理规律的传感器攻击向量(包括6种针对摄像头、2种针对麦克风的攻击),并在真实机器人平台上验证其有效性;同时,通过对抗训练构建鲁棒性增强机制,在保持模型性能的同时提升其对分布外物理扰动的抵抗力,从而为VLA在安全关键场景中的部署提供理论支撑与实践路径。
链接: https://arxiv.org/abs/2511.10008
作者: Xuancun Lu,Jiaxiang Chen,Shilin Xiao,Zizhi Jin,Zhangrui Chen,Hanwen Yu,Bohan Qian,Ruochen Zhou,Xiaoyu Ji,Wenyuan Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Vision-Language-Action (VLA) models revolutionize robotic systems by enabling end-to-end perception-to-action pipelines that integrate multiple sensory modalities, such as visual signals processed by cameras and auditory signals captured by microphones. This multi-modality integration allows VLA models to interpret complex, real-world environments using diverse sensor data streams. Given the fact that VLA-based systems heavily rely on the sensory input, the security of VLA models against physical-world sensor attacks remains critically underexplored. To address this gap, we present the first systematic study of physical sensor attacks against VLAs, quantifying the influence of sensor attacks and investigating the defenses for VLA models. We introduce a novel ``Real-Sim-Real’’ framework that automatically simulates physics-based sensor attack vectors, including six attacks targeting cameras and two targeting microphones, and validates them on real robotic systems. Through large-scale evaluations across various VLA architectures and tasks under varying attack parameters, we demonstrate significant vulnerabilities, with susceptibility patterns that reveal critical dependencies on task types and model designs. We further develop an adversarial-training-based defense that enhances VLA robustness against out-of-distribution physical perturbations caused by sensor attacks while preserving model performance. Our findings expose an urgent need for standardized robustness benchmarks and mitigation strategies to secure VLA deployments in safety-critical environments. Comments: Accepted by AAAI 2026 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.10008 [cs.RO] (or arXiv:2511.10008v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2511.10008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models AAAI2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在跨历法时间推理(cross-calendar temporal reasoning)任务中的性能瓶颈问题,即模型在处理不同历法(如儒略历、公历等)之间的时间转换与推理时准确率低、泛化能力弱的问题。其解决方案的关键在于提出一种基于模板驱动的动态实例生成协议,实现无污染且时间可变的评估,并进一步开发了一个由LLM驱动的时间代理(Time Agent),通过工具增强的代码生成机制提升推理能力。实验表明,该方法在平均准确率上达到95.31%,显著优于现有基线,验证了工具增强型代码生成在跨历法时间推理中的有效性。
链接: https://arxiv.org/abs/2511.09993
作者: Zhongjian Miao,Hao Fu,Chen Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI2026. Version with Appendix
Abstract:We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs’ cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.
zh
[AI-57] MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data AAAI2026
【速读】:该论文旨在解决大规模表格数据(tabular data)中多任务学习(Multitask Learning, MTL)性能受限的问题,尤其是在传统基于多层感知机(MLP)的MTL模型难以捕捉复杂特征交互且缺乏可扩展性的情况下。其核心挑战在于如何在多任务场景下有效利用任务间的共享信息以提升泛化能力,同时缓解任务间竞争效应。解决方案的关键是提出MultiTab-Net——首个专为大规模表格数据设计的多任务Transformer架构,其创新性地引入了多任务掩码注意力机制(multitask masked-attention mechanism),该机制能够动态建模特征与特征之间的依赖关系,并通过结构化注意力掩码缓解不同任务间的干扰,从而实现更稳定的多任务学习表现。实验表明,MultiTab-Net在推荐系统、社会经济数据和物理模拟等多个领域均显著优于现有单任务Transformer及主流MTL方法。
链接: https://arxiv.org/abs/2511.09970
作者: Dimitrios Sinodinos,Jack Yi Wei,Narges Armanfard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at AAAI 2026
Abstract:Tabular data is the most abundant data type in the world, powering systems in finance, healthcare, e-commerce, and beyond. As tabular datasets grow and span multiple related targets, there is an increasing need to exploit shared task information for improved multitask generalization. Multitask learning (MTL) has emerged as a powerful way to improve generalization and efficiency, yet most existing work focuses narrowly on large-scale recommendation systems, leaving its potential in broader tabular domains largely underexplored. Also, existing MTL approaches for tabular data predominantly rely on multi-layer perceptron-based backbones, which struggle to capture complex feature interactions and often fail to scale when data is abundant, a limitation that transformer architectures have overcome in other domains. Motivated by this, we introduce MultiTab-Net, the first multitask transformer architecture specifically designed for large tabular data. MultiTab-Net employs a novel multitask masked-attention mechanism that dynamically models feature-feature dependencies while mitigating task competition. Through extensive experiments, we show that MultiTab-Net consistently achieves higher multitask gain than existing MTL architectures and single-task transformers across diverse domains including large-scale recommendation data, census-like socioeconomic data, and physics datasets, spanning a wide range of task counts, task types, and feature modalities. In addition, we contribute MultiTab-Bench, a generalized multitask synthetic dataset generator that enables systematic evaluation of multitask dynamics by tuning task count, task correlations, and relative task complexity. Our code is publicly available at this https URL.
zh
[AI-58] Owlgorithm: Supporting Self-Regulated Learning in Competitive Programming through LLM -Driven Reflection
【速读】:该论文旨在解决编程教育中学生缺乏有效自我调节学习(Self-Regulated Learning, SRL)支持的问题,尤其是在竞赛编程(Competitive Programming, CP)场景下,学生难以自主反思代码错误或优化解题思路。解决方案的关键在于开发了Owlgorithm平台,利用大语言模型GPT-4o生成情境感知、元认知导向的反思性问题(reflective questions),根据学生的具体提交结果动态调整反馈内容:对正确解答引导深入概念理解,对部分正确或失败的解答提供结构化调试指导。这一机制通过AI增强的个性化反思支持,提升了初学者的学习效果,但同时也揭示了反馈准确性与课堂可用性方面的局限,强调了提示词设计、动态适应性和用户体验优化对教育应用成功的重要性。
链接: https://arxiv.org/abs/2511.09969
作者: Juliana Nieto-Cardenas,Erin Joy Kramer,Peter Kurto,Ethan Dickey,Andres Bejarano
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 figure, to be published in SIGCSE '26
Abstract:We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm’s educational impact. The underlying framework may also generalize to other reflective learning contexts. Comments: 7 pages, 1 figure, to be published in SIGCSE '26 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) ACMclasses: K.3.2 Cite as: arXiv:2511.09969 [cs.CY] (or arXiv:2511.09969v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2511.09969 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770762.3772662 Focus to learn more DOI(s) linking to related resources
zh
[AI-59] EnvTrace: Simulation-Based Semantic Evaluation of LLM Code via Execution Trace Alignment – Demonstrated at Synchrotron Beamlines
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在仪器控制场景中评估不足的问题,因为传统静态算法基准无法充分反映物理系统行为的复杂性。其解决方案的关键在于提出EnvTrace方法——一种基于仿真的执行轨迹评估框架,通过轨迹对齐实现语义代码等价性判断,从而多维度量化LLMs生成控制代码的功能正确性。该方法依托于束流线控制逻辑的数字孪生(digital twin),不仅支持对控制代码的后验验证,还实现了实验前的预执行仿真验证,为LLMs与数字孪生协同构建自主具身AI提供了可行路径。
链接: https://arxiv.org/abs/2511.09964
作者: Noah van der Vleuten,Anthony Flores,Shray Mathur,Max Rakitin,Thomas Hopkins,Kevin G. Yager,Esther H. R. Tsai
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Evaluating large language models (LLMs) for instrument control requires methods that go beyond standard, stateless algorithmic benchmarks, since the behavior of physical systems cannot be fully captured by unit tests alone. Here we introduce EnvTrace, a simulation-based method that evaluates execution traces to assess semantic code equivalence. EnvTrace is demonstrated with a beamline control-logic digital twin to facilitate the evaluation of instrument control code, with the digital twin itself also enabling the pre-execution validation of live experiments. Over 30 LLMs were evaluated using trace alignment to generate a multi-faceted score for functional correctness across key behavioral dimensions, showing that many top-tier models can approach human-level performance in rapid control-code generation. This is a first step toward a broader vision where LLMs and digital twins work symbiotically: LLMs providing intuitive control and agentic orchestration, and digital twins offering safe and high-fidelity environments, paving the way towards autonomous embodied AI.
zh
[AI-60] AI-Integrated Decision Support System for Real-Time Market Growth Forecasting and Multi-Source Content Diffusion Analytics
【速读】:该论文旨在解决AI生成内容(AIGC)在数字营销中的扩散轨迹预测与市场影响评估难题,其核心挑战在于数据异构性、非线性传播机制及动态消费者交互关系。解决方案的关键在于提出一个基于生成式 AI 的决策支持系统(DSS),该系统融合多源数据(如社交媒体流、营销支出记录、用户参与日志和情感动态),并采用混合图神经网络(GNN)与时间变换器(Temporal Transformer)框架,通过双通道架构联合学习内容扩散结构与时序影响力演化,同时引入因果推断模块分离营销刺激对投资回报率(ROI)和市场可见性的独立效应,从而实现对AIGC驱动的内容传播与市场增长模式的可解释实时洞察。
链接: https://arxiv.org/abs/2511.09962
作者: Ziqing Yin,Xuanjing Chen,Xi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:The rapid proliferation of AI-generated content (AIGC) has reshaped the dynamics of digital marketing and online consumer behavior. However, predicting the diffusion trajectory and market impact of such content remains challenging due to data heterogeneity, non linear propagation mechanisms, and evolving consumer interactions. This study proposes an AI driven Decision Support System (DSS) that integrates multi source data including social media streams, marketing expenditure records, consumer engagement logs, and sentiment dynamics using a hybrid Graph Neural Network (GNN) and Temporal Transformer framework. The model jointly learns the content diffusion structure and temporal influence evolution through a dual channel architecture, while causal inference modules disentangle the effects of marketing stimuli on return on investment (ROI) and market visibility. Experiments on large scale real-world datasets collected from multiple online platforms such as Twitter, TikTok, and YouTube advertising show that our system outperforms existing baselines in all six metrics. The proposed DSS enhances marketing decisions by providing interpretable real-time insights into AIGC driven content dissemination and market growth patterns.
zh
[AI-61] EEGAgent : A Unified Framework for Automated EEG Analysis Using Large Language Models
【速读】:该论文旨在解决当前脑电图(EEG)分析模型普遍局限于单一任务、缺乏可扩展性和泛化能力的问题,这限制了其在临床诊断和认知研究中多任务、连续推理场景下的实际应用。解决方案的关键在于提出EEGAgent框架,该框架利用大语言模型(LLM)调度和规划多种工具,实现对EEG数据的自动化处理与分析,涵盖基础信息感知、时空探索、事件检测、用户交互及报告生成等核心功能,并通过一个集成预处理、特征提取和事件检测等功能的工具箱支持灵活且可解释的多任务分析流程。
链接: https://arxiv.org/abs/2511.09947
作者: Sha Zhao,Mingyi Peng,Haiteng Jiang,Tao Li,Shijian Li,Gang Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scalable and generalizable analysis of brain activity is essential for advancing both clinical diagnostics and cognitive research. Electroencephalography (EEG), a non-invasive modality with high temporal resolution, has been widely used for brain states analysis. However, most existing EEG models are usually tailored for individual specific tasks, limiting their utility in realistic scenarios where EEG analysis often involves multi-task and continuous reasoning. In this work, we introduce EEGAgent, a general-purpose framework that leverages large language models (LLMs) to schedule and plan multiple tools to automatically complete EEG-related tasks. EEGAgent is capable of performing the key functions: EEG basic information perception, spatiotemporal EEG exploration, EEG event detection, interaction with users, and EEG report generation. To realize these capabilities, we design a toolbox composed of different tools for EEG preprocessing, feature extraction, event detection, etc. These capabilities were evaluated on public datasets, and our EEGAgent can support flexible and interpretable EEG analysis, highlighting its potential for real-world clinical applications.
zh
[AI-62] MDMLP-EIA: Multi-domain Dynamic MLPs with Energy Invariant Attention for Time Series Forecasting
【速读】:该论文旨在解决基于多层感知机(MLP)的时间序列预测方法在实际应用中面临的三大关键问题:一是弱季节性信号的丢失,二是权重共享MLP的容量限制,三是通道独立策略下通道融合不足。其解决方案的核心在于提出MDMLP-EIA模型,包含三个创新机制:首先,设计自适应融合的双域季节性MLP,将季节性信号分为强弱成分,并采用自适应零初始化通道融合策略以减少噪声干扰并有效整合预测结果;其次,引入能量不变注意力机制,在趋势和季节性预测的不同时间步中自适应聚焦于不同特征通道,同时保持总信号能量恒定,从而增强对扰动的鲁棒性;最后,提出动态容量调整机制,使通道独立MLP的神经元数量随通道数的平方根增长,确保随着通道增加仍具备充足容量。实验表明,该方法在九个基准数据集上实现了卓越的预测精度与计算效率。
链接: https://arxiv.org/abs/2511.09924
作者: Hu Zhang,Zhien Dai,Zhaohui Tang,Yongfang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is essential across diverse domains. While MLP-based methods have gained attention for achieving Transformer-comparable performance with fewer parameters and better robustness, they face critical limitations including loss of weak seasonal signals, capacity constraints in weight-sharing MLPs, and insufficient channel fusion in channel-independent strategies. To address these challenges, we propose MDMLP-EIA (Multi-domain Dynamic MLPs with Energy Invariant Attention) with three key innovations. First, we develop an adaptive fused dual-domain seasonal MLP that categorizes seasonal signals into strong and weak components. It employs an adaptive zero-initialized channel fusion strategy to minimize noise interference while effectively integrating predictions. Second, we introduce an energy invariant attention mechanism that adaptively focuses on different feature channels within trend and seasonal predictions across time steps. This mechanism maintains constant total signal energy to align with the decomposition-prediction-reconstruction framework and enhance robustness against disturbances. Third, we propose a dynamic capacity adjustment mechanism for channel-independent MLPs. This mechanism scales neuron count with the square root of channel count, ensuring sufficient capacity as channels increase. Extensive experiments across nine benchmark datasets demonstrate that MDMLP-EIA achieves state-of-the-art performance in both prediction accuracy and computational efficiency.
zh
[AI-63] Harnessing Bounded-Support Evolution Strategies for Policy Refinement
【速读】:该论文旨在解决基于策略的强化学习(on-policy RL)在提升机器人策略时因梯度噪声大、信号弱而导致训练不稳定的问题。其解决方案的关键在于提出一种基于三角分布的进化策略(Triangular-Distribution ES, TD-ES),通过引入有界、反向对称的三角扰动来局部化探索,并结合中心秩有限差分估计器,实现稳定、可并行且无需梯度的更新机制。该方法在两阶段流程中(先用PPO预训练,再用TD-ES精调)兼顾早期样本效率与后期鲁棒性提升,在机器人操作任务中显著提高成功率(相对PPO提升26.5%)并大幅降低方差。
链接: https://arxiv.org/abs/2511.09923
作者: Ethan Hirschowitz,Fabio Ramos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 6 figures, to be published in Australasian Conference on Robotics and Automation (ACRA 2025)
Abstract:Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline – PPO pretraining followed by TD-ES refinement – this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.
zh
[AI-64] Adaptive Hyperbolic Kernels: Modulated Embedding in de Branges-Rovnyak Spaces AAAI26
【速读】:该论文旨在解决现有超球面(hyperbolic)核方法在建模层次结构时存在的几何失真和适应性不足的问题。其关键解决方案是提出一种曲率感知的de Branges-Rovnyak空间,这是一种与Poincaré球等距的再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS),并通过设计可调节的乘子来自适应选择对应任意曲率的超球面空间;在此基础上构建了一类自适应超球面核,包括新颖的自适应超球面径向核,其可学习参数能够以任务感知方式调控超球面特征,从而更有效地捕捉层次依赖关系。
链接: https://arxiv.org/abs/2511.09921
作者: Leping Si,Meimei Yang,Hui Xue,Shipeng Zhu,Pengfei Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, AAAI26 conference Camera-Ready
Abstract:Hierarchical data pervades diverse machine learning applications, including natural language processing, computer vision, and social network analysis. Hyperbolic space, characterized by its negative curvature, has demonstrated strong potential in such tasks due to its capacity to embed hierarchical structures with minimal distortion. Previous evidence indicates that the hyperbolic representation capacity can be further enhanced through kernel methods. However, existing hyperbolic kernels still suffer from mild geometric distortion or lack adaptability. This paper addresses these issues by introducing a curvature-aware de Branges-Rovnyak space, a reproducing kernel Hilbert space (RKHS) that is isometric to a Poincare ball. We design an adjustable multiplier to select the appropriate RKHS corresponding to the hyperbolic space with any curvature adaptively. Building on this foundation, we further construct a family of adaptive hyperbolic kernels, including the novel adaptive hyperbolic radial kernel, whose learnable parameters modulate hyperbolic features in a task-aware manner. Extensive experiments on visual and language benchmarks demonstrate that our proposed kernels outperform existing hyperbolic kernels in modeling hierarchical dependencies.
zh
[AI-65] OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive AAAI2026
【速读】:该论文旨在解决 opioid 危机背景下,如何从海量、多模态且专业性强的医疗相关法律与企业文档中高效提取信息并进行精准问答的问题。其核心挑战在于现有分析方法难以应对数据的复杂性、异构性和专业化特征。解决方案的关键在于:首先,基于文档属性组织原始数据集,并构建包含40万训练样本和1万测试样本的大规模基准;其次,从每份文档中提取文本、视觉元素及版面结构等多模态特征,形成丰富信息表征;再次,利用多种人工智能模型生成36万训练QA对和1万测试QA对;最后,开发领域特定的多模态大语言模型(Multimodal Large Language Models, MLLMs),引入历史QA对作为上下文锚定,并结合页码引用与重要性分类器提升回答的准确性与相关性。
链接: https://arxiv.org/abs/2511.09914
作者: Xuan Shen,Brian Wingenroth,Zichao Wang,Jason Kuen,Wanrong Zhu,Ruiyi Zhang,Yiwei Wang,Lichun Ma,Anqi Liu,Hongfu Liu,Tong Sun,Kevin S. Hawkins,Kate Tasker,G. Caleb Alexander,Jiuxiang Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 Artificial Intelligence for Social Impact Track
Abstract:The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information-including textual content, visual elements, and layout structures-to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks. The dataset and models are publicly available at: this https URL
zh
[AI-66] CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI RD NEURIPS2025
【速读】:该论文旨在解决前沿人工智能(AI)代理在机器学习(ML)工程任务中可能对用户利益造成损害的问题,包括植入后门、故意引发泛化失败(code-sabotage)以及通过降低性能表现来规避监督机制(sandbagging)。其关键解决方案是构建一个扩展的基准测试工具MLE-Bench,集成代码破坏任务,并引入语言模型(LM)监控器以检测异常行为;实验表明,尽管监控器能有效识别代码破坏行为,但对沙袋行为(即有意识地降低性能)的检测难度更高,且多监控器聚合虽提升检测效果,仍不足以在高风险场景中完全防范此类潜在恶意行为。
链接: https://arxiv.org/abs/2511.09904
作者: Francis Rhys Ward,Teun van der Weij,Hanna Gábor,Sam Martin,Raja Mehta Moreno,Harel Lidar,Louis Makower,Thomas Jodrell,Lauren Robson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 53 pages, 21 figures, 8 tables. Accepted at NeurIPS 2025
Abstract:AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) RD itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI’s Inspect framework and we make our code publicly available at this https URL
zh
[AI-67] Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search
【速读】:该论文旨在解决当前基于计算的定向进化算法在设计搜索策略时忽视了如何利用预训练蛋白语言模型(Protein Language Models, PLMs)所编码的丰富进化模式来引导搜索的问题。解决方案的关键在于提出AlphaDE框架,该框架首先通过在同源蛋白序列上使用掩码语言建模进行微调,激活目标蛋白类别的进化合理性;其次引入基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的测试时推理机制,从而有效利用微调后的蛋白语言模型提供进化指导进行蛋白质序列演化。实验表明,即使采用少样本微调,AlphaDE也显著优于现有最先进方法,并能通过计算进化压缩蛋白序列空间。
链接: https://arxiv.org/abs/2511.09900
作者: Yaodong Yang,Yang Wang,Jinpeng Li,Pei Guo,Da Han,Guangyong Chen,Pheng-Ann Heng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: working in progress, 23 pages, 6 figures, 15 tables
Abstract:Protein evolution through amino acid sequence mutations is a cornerstone of life sciences. While current in-silicon directed evolution algorithms focus on designing search strategies, they overlook how to utilize the transformative protein language models, which encode rich evolutionary patterns, to guide search. To bridge this gap, we propose AlphaDE, a novel framework to evolve protein sequences by harnessing the innovative paradigms of large language models. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility for the interested protein class. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. An interesting case study further shows that AlphaDE supports condensing the protein sequence space through computational evolution.
zh
[AI-68] Simulator and Experience Enhanced Diffusion Model for Comprehensive ECG Generation
【速读】:该论文旨在解决生成式人工智能在心电图(Electrocardiogram, ECG)生成中的两个关键问题:一是现有方法通常忽略心脏电活动的生理模拟器知识,导致生成信号缺乏生理合理性;二是未充分整合临床实践中积累的经验性知识,限制了生成ECG与实际临床语义的一致性。解决方案的关键在于提出SE-Diff模型,其核心创新包括:(1)将轻量级常微分方程(Ordinary Differential Equation, ODE)驱动的ECG生理模拟器嵌入扩散过程,通过心跳解码器和模拟一致性约束注入机制先验,提升生成波形的生理可解释性;(2)设计基于大语言模型(Large Language Model, LLM)的经验检索增强策略,从真实临床数据中提取并注入经验知识,强化文本到ECG的语义对齐能力。实验表明,SE-Diff在信号保真度和文本-ECG语义一致性上均优于基线方法,并且所引入的生理与临床知识也提升了下游ECG分类性能。
链接: https://arxiv.org/abs/2511.09895
作者: Xiaoda Wang,Kaiqiao Han,Yuhao Xu,Xiao Luo,Yizhou Sun,Wei Wang,Carl Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cardiovascular disease (CVD) is a leading cause of mortality worldwide. Electrocardiograms (ECGs) are the most widely used non-invasive tool for cardiac assessment, yet large, well-annotated ECG corpora are scarce due to cost, privacy, and workflow constraints. Generating ECGs can be beneficial for the mechanistic understanding of cardiac electrical activity, enable the construction of large, heterogeneous, and unbiased datasets, and facilitate privacy-preserving data sharing. Generating realistic ECG signals from clinical context is important yet underexplored. Recent work has leveraged diffusion models for text-to-ECG generation, but two challenges remain: (i) existing methods often overlook the physiological simulator knowledge of cardiac activity; and (ii) they ignore broader, experience-based clinical knowledge grounded in real-world practice. To address these gaps, we propose SE-Diff, a novel physiological simulator and experience enhanced diffusion model for comprehensive ECG generation. SE-Diff integrates a lightweight ordinary differential equation (ODE)-based ECG simulator into the diffusion process via a beat decoder and simulator-consistent constraints, injecting mechanistic priors that promote physiologically plausible waveforms. In parallel, we design an LLM-powered experience retrieval-augmented strategy to inject clinical knowledge, providing more guidance for ECG generation. Extensive experiments on real-world ECG datasets demonstrate that SE-Diff improves both signal fidelity and text-ECG semantic alignment over baselines, proving its superiority for text-to-ECG generation. We further show that the simulator-based and experience-based knowledge also benefit downstream ECG classification.
zh
[AI-69] A General Anchor-Based Framework for Scalable Fair Clustering
【速读】:该论文旨在解决公平聚类(Fair Clustering)算法在大规模数据集上计算复杂度高(常为二次或超二次)的问题,从而限制了其实际应用。解决方案的关键在于提出了一种名为Anchor-based Fair Clustering Framework (AFCF) 的通用、可插拔框架,其核心创新包括:1)通过一种新颖的公平采样策略从原始数据中选取少量代表性锚点(anchor);2)设计了一个锚点图构建模块,将公平聚类问题建模为一个优化问题,通过引入组标签联合约束(group-label joint constraint)来传播标签并保持公平性;3)利用基于ADMM(交替方向乘子法)的高效算法求解该优化问题。理论证明表明,该方法能确保整个数据集上的最终聚类公平性与锚点聚类一致,同时实现线性时间复杂度,显著加速现有最优方法且不牺牲性能和公平性保障。
链接: https://arxiv.org/abs/2511.09889
作者: Shengfei Wei,Suyuan Liu,Jun Wang,Ke Liang,Miaomiao Li,Lei Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fair clustering is crucial for mitigating bias in unsupervised learning, yet existing algorithms often suffer from quadratic or super-quadratic computational complexity, rendering them impractical for large-scale datasets. To bridge this gap, we introduce the Anchor-based Fair Clustering Framework (AFCF), a novel, general, and plug-and-play framework that empowers arbitrary fair clustering algorithms with linear-time scalability. Our approach first selects a small but representative set of anchors using a novel fair sampling strategy. Then, any off-the-shelf fair clustering algorithm can be applied to this small anchor set. The core of our framework lies in a novel anchor graph construction module, where we formulate an optimization problem to propagate labels while preserving fairness. This is achieved through a carefully designed group-label joint constraint, which we prove theoretically ensures that the fairness of the final clustering on the entire dataset matches that of the anchor clustering. We solve this optimization efficiently using an ADMM-based algorithm. Extensive experiments on multiple large-scale benchmarks demonstrate that AFCF drastically accelerates state-of-the-art methods, which reduces computational time by orders of magnitude while maintaining strong clustering performance and fairness guarantees.
zh
[AI-70] Quantum Artificial Intelligence (QAI): Foundations Architectural Elements and Future Directions
【速读】:该论文旨在解决关键任务(Mission Critical, MC)系统在面对不确定性时,对可靠、确定性、低延迟决策的严苛需求与传统机器学习(Machine Learning, ML)方法在鲁棒性、时效性、可解释性和安全性方面难以满足之间的矛盾。其解决方案的关键在于引入量子人工智能(Quantum Artificial Intelligence, QAI),通过融合量子计算(Quantum Computing, QC)与机器学习,构建量子增强的学习流水线、量子不确定性量化机制和量子可解释性框架,从而提升MC系统在航空航天、国防、网络安全、智能电网及灾害管理等场景中的容错能力、实时智能响应与自适应性,并提出基于时效约束的量子资源调度模型以实现工业级部署。
链接: https://arxiv.org/abs/2511.09884
作者: Siva Sai,Rajkumar Buyya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, submitted to an Elsevier journal
Abstract:Mission critical (MC) applications such as defense operations, energy management, cybersecurity, and aerospace control require reliable, deterministic, and low-latency decision making under uncertainty. Although the classical Machine Learning (ML) approaches are effective, they often struggle to meet the stringent constraints of robustness, timing, explainability, and safety in the MC domains. Quantum Artificial Intelligence (QAI), the fusion of machine learning and quantum computing (QC), can provide transformative solutions to the challenges faced by classical ML models. In this paper, we provide a comprehensive exploration of QAI for MC systems. We begin with a conceptual background to quantum computing, MC systems, and quantum machine learning (QAI). We then examine the core mechanisms and algorithmic principles of QAI in MC systems, including quantum-enhanced learning pipelines, quantum uncertainty quantification, and quantum explainability frameworks. Subsequently, we discuss key application areas like aerospace, defense, cybersecurity, smart grids, and disaster management, focusing on the role of QA in enhancing fault tolerance, real-time intelligence, and adaptability. We provide an exploration of the positioning of QAI for MC systems in the industry in terms of deployment. We also propose a model for management of quantum resources and scheduling of applications driven by timeliness constraints. We discuss multiple challenges, including trainability limits, data access, and loading bottlenecks, verification of quantum components, and adversarial QAI. Finally, we outline future research directions toward achieving interpretable, scalable, and hardware-feasible QAI models for MC application deployment.
zh
[AI-71] aught by the Flawed: How Dataset Insecurity Breeds Vulnerable AI Code
【速读】:该论文试图解决的问题是:AI编程助手在生成代码时容易引入基础安全漏洞,这可能源于训练数据中包含的漏洞代码。解决方案的关键在于通过筛选训练数据,仅保留经静态分析工具检测为无漏洞的代码片段,从而构建一个安全的训练语料库,并在此基础上训练Transformer模型。实验表明,使用该安全数据集训练的模型在保持功能正确性的同时显著降低了生成代码中的安全问题,验证了高质量、无漏洞训练数据对提升AI编程助手可靠性的关键作用。
链接: https://arxiv.org/abs/2511.09879
作者: Catherine Xia,Manar H. Alalfi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI programming assistants have demonstrated a tendency to generate code containing basic security vulnerabilities. While developers are ultimately responsible for validating and reviewing such outputs, improving the inherent quality of these generated code snippets remains essential. A key contributing factor to insecure outputs is the presence of vulnerabilities in the training datasets used to build large language models (LLMs). To address this issue, we propose curating training data to include only code that is free from detectable vulnerabilities. In this study, we constructed a secure dataset by filtering an existing Python corpus using a static analysis tool to retain only vulnerability-free functions. We then trained two transformer-based models: one on the curated dataset and one on the original, unfiltered dataset. The models were evaluated on both the correctness and security of the code they generated in response to natural language function descriptions. Our results show that the model trained on the curated dataset produced outputs with fewer security issues, while maintaining comparable functional correctness. These findings highlight the importance of secure training data in improving the reliability of AI-based programming assistants, though further enhancements to model architecture and evaluation are needed to reinforce these outcomes.
zh
[AI-72] Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning AAAI AAAI2026
【速读】:该论文旨在解决持续学习(continual learning)中神经网络在顺序处理任务时无法有效利用跨任务关系的问题,这导致模型重复学习相似特征或过度区分本可共享的特征,从而引发灾难性遗忘和性能下降。解决方案的关键在于提出一种全可微、无需样本存档(exemplar-free)且可扩展的记忆机制,包含两个互补的记忆模块:一个用于学习跨任务共有的通用特征,另一个则将这些共享特征组合以提取每个样本的独特判别特征;通过自适应剪枝关键槽位和最小化容量扩展来调整记忆结构,并引入正交正则化强制保留与新学记忆组件间的几何分离,从而减少干扰。该方法在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明其显著优于14种先进方法,实现了更高的最终准确率和更接近理论上限的特征提取效果。
链接: https://arxiv.org/abs/2511.09871
作者: Hyung-Jun Moon,Sung-Bae Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in AAAI 2026 (The 40th AAAI Conference on Artificial Intelligence)
Abstract:Continual learning methods used to force neural networks to process sequential tasks in isolation, preventing them from leveraging useful inter-task relationships and causing them to repeatedly relearn similar features or overly differentiate them. To address this problem, we propose a fully differentiable, exemplar-free expandable method composed of two complementary memories: One learns common features that can be used across all tasks, and the other combines the shared features to learn discriminative characteristics unique to each sample. Both memories are differentiable so that the network can autonomously learn latent representations for each sample. For each task, the memory adjustment module adaptively prunes critical slots and minimally expands capacity to accommodate new concepts, and orthogonal regularization enforces geometric separation between preserved and newly learned memory components to prevent interference. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that the proposed method outperforms 14 state-of-the-art methods for class-incremental learning, achieving final accuracies of 55.13%, 37.24%, and 30.11%, respectively. Additional analysis confirms that, through effective integration and utilization of knowledge, the proposed method can increase average performance across sequential tasks, and it produces feature extraction results closest to the upper bound, thus establishing a new milestone in continual learning.
zh
[AI-73] hermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
【速读】:该论文旨在解决当前对抗补丁(adversarial patches)在真实场景中因视觉显著性过高而难以部署的问题,以实现对AI驱动监控系统的有效隐私保护。其解决方案的关键在于设计了一种热激活的对抗可穿戴设备,通过将热致变色染料(thermochromic dyes)与柔性加热单元集成于衣物表面,在默认状态下呈现普通黑色T恤外观,加热后则激活隐藏的对抗图案,从而在可见光和红外模态下均能有效规避检测。该方法实现了物理层面的动态可控性与高适应性,实验表明可在50秒内完成纹理激活,并在多种现实环境中保持超过80%的对抗成功率。
链接: https://arxiv.org/abs/2511.09829
作者: Jiahuan Long,Tingsong Jiang,Hanqing Liu,Chao Ma,Wen Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.
zh
[AI-74] Robust Watermarking on Gradient Boosting Decision Trees AAAI AAAI-26
【速读】:该论文旨在解决生成式 AI (Generative AI) 与深度学习模型中常见的知识产权保护问题,即如何在梯度提升决策树(Gradient Boosting Decision Trees, GBDT)模型中嵌入鲁棒水印,以实现模型来源追踪和防篡改能力。相较于神经网络,GBDT模型的水印研究长期处于空白状态。解决方案的关键在于提出首个专为GBDT设计的鲁棒水印框架,采用就地微调(in-place fine-tuning)策略,在不影响模型性能的前提下嵌入不可感知且抗干扰的水印信息。作者进一步设计了四种嵌入策略,有效平衡了水印嵌入率、模型精度损失和对部署后微调攻击的抵抗能力,从而实现了高实用性与安全性兼备的模型水印方案。
链接: https://arxiv.org/abs/2511.09822
作者: Jun Woo Chung,Yingjie Lao,Weijie Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)
Abstract:Gradient Boosting Decision Trees (GBDTs) are widely used in industry and academia for their high accuracy and efficiency, particularly on structured data. However, watermarking GBDT models remains underexplored compared to neural networks. In this work, we present the first robust watermarking framework tailored to GBDT models, utilizing in-place fine-tuning to embed imperceptible and resilient watermarks. We propose four embedding strategies, each designed to minimize impact on model accuracy while ensuring watermark robustness. Through experiments across diverse datasets, we demonstrate that our methods achieve high watermark embedding rates, low accuracy degradation, and strong resistance to post-deployment fine-tuning.
zh
[AI-75] On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks
【速读】:该论文旨在解决神经网络的组合结构如何影响其优化景观(optimization landscape)和训练动态的问题。解决方案的关键在于通过分析过参数化优化问题对应的梯度流(gradient flow),证明对于任意适当且实解析(proper and real analytic)的成本函数,均可推导出全局收敛性;进一步在标量值成本函数情形下,揭示了鞍点的位置与稳定性等关键几何特征具有普适性——仅依赖于过参数化表示本身,而非具体问题细节;同时提出一种不平衡度量(imbalance metric)来量化初始条件对收敛速度的加速作用,从而为理解神经网络训练动力学提供了理论依据,并拓展至带Sigmoid激活函数的情形以验证部分几何与动力学特性仍具鲁棒性。
链接: https://arxiv.org/abs/2511.09810
作者: Arthur Castello Branco de Oliveira,Dhruv Jatkar,Eduardo Sontag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features – such as the location and stability of saddle points – are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.
zh
[AI-76] Constrained Best Arm Identification with Tests for Feasibility AAAI2026
【速读】:该论文致力于解决**可行最优臂识别(Feasible Best Arm Identification, Feasible BAI)**问题,即在存在多个可行性约束的情况下,从 $ K $ 个臂中识别出性能最高且满足所有约束的臂。传统多臂赌博机(Multi-Armed Bandits, MAB)中的最优臂识别(Best Arm Identification, BAI)通常假设性能与约束可同时观测,但在实际场景如药物发现中,性能(如药效)和可行性(如毒性、溶解度)需分别测试,且二者不可同时获取。为此,论文提出一种新的决策机制:每轮可选择一个臂 $ i \in [K] $ 和一个任务类型 $ \ell $,其中 $ \ell = 0 $ 表示测试性能,$ \ell \in [N] $ 表示测试第 $ \ell $ 个可行性约束。关键解决方案在于设计了一个高效算法,在固定置信度设置下(即以至少 $ 1-\delta $ 的概率识别正确可行最优臂),能够自适应地根据问题难度决定优先淘汰低性能或不可行臂,并通过理论分析证明其样本复杂度上界在 $ \delta \to 0 $ 时渐近最优,实验证明其优于现有先进BAI算法。
链接: https://arxiv.org/abs/2511.09808
作者: Ting Cai,Kirthevasan Kandasamy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:Best arm identification (BAI) aims to identify the highest-performance arm among a set of K arms by collecting stochastic samples from each arm. In real-world problems, the best arm needs to satisfy additional feasibility constraints. While there is limited prior work on BAI with feasibility constraints, they typically assume the performance and constraints are observed simultaneously on each pull of an arm. However, this assumption does not reflect most practical use cases, e.g., in drug discovery, we wish to find the most potent drug whose toxicity and solubility are below certain safety thresholds. These safety experiments can be conducted separately from the potency measurement. Thus, this requires designing BAI algorithms that not only decide which arm to pull but also decide whether to test for the arm’s performance or feasibility. In this work, we study feasible BAI which allows a decision-maker to choose a tuple (i,\ell) , where i\in [K] denotes an arm and \ell denotes whether she wishes to test for its performance ( \ell=0 ) or any of its N feasibility constraints ( \ell\in[N] ). We focus on the fixed confidence setting, which is to identify the \textitfeasible arm with the \textithighest performance, with a probability of at least 1-\delta . We propose an efficient algorithm and upper-bound its sample complexity, showing our algorithm can naturally adapt to the problem’s difficulty and eliminate arms by worse performance or infeasibility, whichever is easier. We complement this upper bound with a lower bound showing that our algorithm is \textitasymptotically ( \delta\rightarrow 0 ) optimal. Finally, we empirically show that our algorithm outperforms other state-of-the-art BAI algorithms in both synthetic and real-world datasets.
zh
[AI-77] SlideBot: A Multi-Agent Framework for Generating Informative Reliable Multi-Modal Presentations
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育场景中生成高质量教学演示文稿(presentation slides)时面临的挑战,包括多模态内容复杂性、领域知识准确性不足以及输出可靠性差等问题。现有LLM方法难以生成既信息丰富又结构清晰、符合教学设计原则的幻灯片,限制了其在高等教育中的实际应用价值。解决方案的关键在于提出SlideBot——一个模块化、多智能体协作的幻灯片生成框架,其核心创新包括:(1)基于认知负荷理论(Cognitive Load Theory, CLT)和多媒体学习认知理论(Cognitive Theory of Multimedia Learning, CTML)构建结构化规划机制,以控制内在认知负荷并减少外在认知负荷;(2)通过引入检索增强机制确保内容的可靠性和领域相关性;(3)利用LaTeX代码生成与教师交互式迭代优化,实现个性化定制与教学适配。这一系统实现了从内容生成到视觉呈现的全流程自动化,并显著提升了幻灯片的概念准确性、清晰度与教学有效性。
链接: https://arxiv.org/abs/2511.09804
作者: Eric Xie,Danielle Waterfield,Michael Kennedy,Aidong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 14 figures, accepted into EAAI 2026
Abstract:Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization. However, generating effective presentation slides introduces unique challenges due to the complexity of multimodal content creation and the need for precise, domain-specific information. Existing LLM-based solutions often fail to produce reliable and informative outputs, limiting their educational value. To address these limitations, we introduce SlideBot - a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation. SlideBot is organized around three pillars: informativeness, ensuring deep and contextually grounded content; reliability, achieved by incorporating external sources through retrieval; and practicality, which enables customization and iterative feedback through instructor collaboration. It incorporates evidence-based instructional design principles from Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML), using structured planning to manage intrinsic load and consistent visual macros to reduce extraneous load and enhance dual-channel learning. Within the system, specialized agents collaboratively retrieve information, summarize content, generate figures, and format slides using LaTeX, aligning outputs with instructor preferences through interactive refinement. Evaluations from domain experts and students in AI and biomedical education show that SlideBot consistently enhances conceptual accuracy, clarity, and instructional value. These findings demonstrate SlideBot’s potential to streamline slide preparation while ensuring accuracy, relevance, and adaptability in higher education.
zh
[AI-78] Why Open Small AI Models Matter for Interactive Art
【速读】:该论文试图解决当前交互式艺术实践中因依赖大型封闭源代码企业AI系统而产生的创造性受限问题,这些问题包括内容过滤限制、作品保存困难以及延迟高、接口有限等技术挑战。解决方案的关键在于采用可本地部署的开源小型AI模型,这类模型赋予艺术家对基础设施和代码的自主控制权,支持长期可持续的艺术创作与展示,并通过自定义模型训练或接口集成实现技术上的自我决定,从而保障交互式艺术作品在复杂AI环境下的独立性与可维护性。
链接: https://arxiv.org/abs/2511.09788
作者: Mar Canet Sola,Varvara Guljajeva
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages
Abstract:This position paper argues for the importance of open small AI models in creative independence for interactive art practices. Deployable locally, these models offer artists vital control over infrastructure and code, unlike dominant large, closed-source corporate systems. Such centralized platforms function as opaque black boxes, imposing severe limitations on interactive artworks, including restrictive content filters, preservation issues, and technical challenges such as increased latency and limited interfaces. In contrast, small AI models empower creators with more autonomy, control, and sustainability for these artistic processes. They enable the ability to use a model as long as they want, create their own custom model, either by making code changes to integrate new interfaces, or via new datasets by re-training or fine-tuning the model. This fosters technological self-determination, offering greater ownership and reducing reliance on corporate AI ill-suited for interactive art’s demands. Critically, this approach empowers the artist and supports long-term preservation and exhibition of artworks with AI components. This paper explores the practical applications and implications of using open small AI models in interactive art, contrasting them with closed-source alternatives.
zh
[AI-79] AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在学习互动标注中可靠性不足的问题,从而限制其在教育数据分析中的应用。解决方案的关键在于引入验证导向的编排策略——通过自验证(self-verification,即模型检查自身标签)和交叉验证(cross-verification,即模型相互审核标签),显著提升质性编码的一致性与准确性。实验表明,该方法使Cohen’s kappa指标平均提升58%,其中自验证对复杂教学行为的改进最为显著,而交叉验证的效果则依赖于验证者与标注者配对及具体标注构念,体现出可调优的灵活性。
链接: https://arxiv.org/abs/2511.09785
作者: Bakhtawar Ahtisham,Kirk Vanacore,Jinsook Lee,Zhuqian Zhou,Doug Pietrzak,Rene F. Kizilcec
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen’s kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human “gold” labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.
zh
[AI-80] Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures
【速读】:该论文旨在解决联合嵌入预测架构(Joint-Embedding Predictive Architectures, JEPAs)在无监督学习中对时间序列数据按其潜在动力学模式自动聚类的未解现象。作者提出,JEPA 的预测目标隐式地驱动其学习系统 Koopman 算子的不变子空间,并证明理想化的 JEPA 损失函数在编码器表示为系统状态指示函数(即 Koopman 特征函数)时达到最小值。解决方案的关键在于对线性预测器施加近恒等约束(near-identity constraint),这一归纳偏置强制编码器学习可解释的动力学不变量,从而从数学上等价但纠缠的多个最优解中选择出结构清晰、可解释的表示,揭示了预测器在表征解耦中的核心作用。
链接: https://arxiv.org/abs/2511.09783
作者: Pablo Ruiz-Morales,Dries Vanoost,Davy Pissoort,Mathias Verbeke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:Joint-Embedding Predictive Architectures (JEPAs), a powerful class of self-supervised models, exhibit an unexplained ability to cluster time-series data by their underlying dynamical regimes. We propose a novel theoretical explanation for this phenomenon, hypothesizing that JEPA’s predictive objective implicitly drives it to learn the invariant subspace of the system’s Koopman operator. We prove that an idealized JEPA loss is minimized when the encoder represents the system’s regime indicator functions, which are Koopman eigenfunctions. This theory was validated on synthetic data with known dynamics, demonstrating that constraining the JEPA’s linear predictor to be a near-identity operator is the key inductive bias that forces the encoder to learn these invariants. We further discuss that this constraint is critical for selecting this interpretable solution from a class of mathematically equivalent but entangled optima, revealing the predictor’s role in representation disentanglement. This work demystifies a key behavior of JEPAs, provides a principled connection between modern self-supervised learning and dynamical systems theory, and informs the design of more robust and interpretable time-series models.
zh
[AI-81] Privacy-Preserving Explainable AIoT Application via SHAP Entropy Regularization
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)在智能物联网(AIoT)应用中引发的隐私泄露问题,特别是基于SHAP(SHapley Additive exPlanations)等后处理解释方法可能暴露用户敏感属性和行为模式的风险。解决方案的关键在于提出一种基于SHAP熵正则化的隐私保护机制:通过在模型训练过程中引入熵-based 正则化目标,惩罚低熵的SHAP特征贡献分布,从而促使特征重要性分配更加均匀,有效抑制敏感信息从解释结果中泄露,同时保持模型预测性能与解释忠实性(explanation fidelity)。
链接: https://arxiv.org/abs/2511.09775
作者: Dilli Prasad Sharma,Xiaowei Sun,Liang Xue,Xiaodong Lin,Pulei Xiong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:The widespread integration of Artificial Intelligence of Things (AIoT) in smart home environments has amplified the demand for transparent and interpretable machine learning models. To foster user trust and comply with emerging regulatory frameworks, the Explainable AI (XAI) methods, particularly post-hoc techniques such as SHapley Additive exPlanations (SHAP), and Local Interpretable Model-Agnostic Explanations (LIME), are widely employed to elucidate model behavior. However, recent studies have shown that these explanation methods can inadvertently expose sensitive user attributes and behavioral patterns, thereby introducing new privacy risks. To address these concerns, we propose a novel privacy-preserving approach based on SHAP entropy regularization to mitigate privacy leakage in explainable AIoT applications. Our method incorporates an entropy-based regularization objective that penalizes low-entropy SHAP attribution distributions during training, promoting a more uniform spread of feature contributions. To evaluate the effectiveness of our approach, we developed a suite of SHAP-based privacy attacks that strategically leverage model explanation outputs to infer sensitive information. We validate our method through comparative evaluations using these attacks alongside utility metrics on benchmark smart home energy consumption datasets. Experimental results demonstrate that SHAP entropy regularization substantially reduces privacy leakage compared to baseline models, while maintaining high predictive accuracy and faithful explanation fidelity. This work contributes to the development of privacy-preserving explainable AI techniques for secure and trustworthy AIoT applications.
zh
[AI-82] ProbLog4Fairness: A Neurosymbolic Approach to Modeling and Mitigating Bias AAAI2026
【速读】:该论文旨在解决算法偏见(algorithmic bias)的建模与缓解问题,尤其是在多个公平性定义可能存在冲突且难以在实践中统一操作的情况下。其核心挑战在于如何灵活、可解释地将特定任务背景下的偏见假设融入训练过程,而非依赖固定不变的公平性标准。解决方案的关键在于提出 ProbLog4Fairness 框架,通过 ProbLog(一种概率逻辑编程语言)形式化偏见假设为程序,并借助神经符号扩展实现这些假设与神经网络训练的无缝集成。该方法能够表达多种类型的偏见模板,并在合成和真实世界数据(包括表格和图像)上有效减轻偏见,相较于传统方法更具灵活性和可解释性。
链接: https://arxiv.org/abs/2511.09768
作者: Rik Adriaensen,Lucas Van Praet,Jessa Bekker,Robin Manhaeve,Pieter Delobelle,Maarten Buyl
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026
Abstract:Operationalizing definitions of fairness is difficult in practice, as multiple definitions can be incompatible while each being arguably desirable. Instead, it may be easier to directly describe algorithmic bias through ad-hoc assumptions specific to a particular real-world task, e.g., based on background information on systemic biases in its context. Such assumptions can, in turn, be used to mitigate this bias during training. Yet, a framework for incorporating such assumptions that is simultaneously principled, flexible, and interpretable is currently lacking. Our approach is to formalize bias assumptions as programs in ProbLog, a probabilistic logic programming language that allows for the description of probabilistic causal relationships through logic. Neurosymbolic extensions of ProbLog then allow for easy integration of these assumptions in a neural network’s training process. We propose a set of templates to express different types of bias and show the versatility of our approach on synthetic tabular datasets with known biases. Using estimates of the bias distortions present, we also succeed in mitigating algorithmic bias in real-world tabular and image data. We conclude that ProbLog4Fairness outperforms baselines due to its ability to flexibly model the relevant bias assumptions, where other methods typically uphold a fixed bias type or notion of fairness. Comments: Accepted at AAAI 2026 Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.09768 [cs.AI] (or arXiv:2511.09768v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.09768 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-83] Ksurf-Drone: Attention Kalman Filter for Contextual Bandit Optimization in Cloud Resource Allocation
【速读】:该论文旨在解决云数据中心中基于容器的基础设施资源编排与配置参数搜索问题,尤其针对大规模配置空间和云环境不确定性带来的挑战。传统方法如Drone调度器虽采用上下文多臂老虎机(contextual multi-armed bandit)技术进行资源调度,但在虚拟机数量变化导致工作负载和资源指标高度波动时,其决策准确性下降,表现为非线性增强与噪声增加。解决方案的关键在于引入Ksurf——一种先进的方差最小化估计方法(variance-minimizing estimator),该方法适用于高变异性云数据场景,并将其作为上下文多臂老虎机的目标函数模型嵌入Drone调度框架中,从而显著降低延迟方差(p95降低41%,p99降低47%),同时减少CPU使用率(4%)和主节点内存占用(7 MB),在VarBench Kubernetes基准测试中实现平均工作负载Pod数量减少7%,带来整体成本优化。
链接: https://arxiv.org/abs/2511.09766
作者: Michael Dang’ana,Yuqiu Zhang,Hans-Arno Jacobsen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 14 pages, 22 figures, 2 tables
Abstract:Resource orchestration and configuration parameter search are key concerns for container-based infrastructure in cloud data centers. Large configuration search space and cloud uncertainties are often mitigated using contextual bandit techniques for resource orchestration including the state-of-the-art Drone orchestrator. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces variability in workloads and resource metrics, making orchestration decisions less accurate due to increased nonlinearity and noise. Ksurf, a state-of-the-art variance-minimizing estimator method ideal for highly variable cloud data, enables optimal resource estimation under conditions of high cloud variability. This work evaluates the performance of Ksurf on estimation-based resource orchestration tasks involving highly variable workloads when employed as a contextual multi-armed bandit objective function model for cloud scenarios using Drone. Ksurf enables significantly lower latency variance of 41% at p95 and 47% at p99, demonstrates a 4% reduction in CPU usage and 7 MB reduction in master node memory usage on Kubernetes, resulting in a 7% cost savings in average worker pod count on VarBench Kubernetes benchmark. Comments: 14 pages, 22 figures, 2 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) ACMclasses: I.2; C.4 Cite as: arXiv:2511.09766 [cs.DC] (or arXiv:2511.09766v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.09766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-84] History Rhymes: Macro-Contextual Retrieval for Robust Financial Forecasting
【速读】:该论文旨在解决金融市场的非平稳性问题,即结构突变和宏观经济 regime shifts 导致传统预测模型在分布外(Out-of-Distribution, OOD)场景下性能显著下降的问题。现有融合数值指标与文本情绪的多模态方法难以适应此类变化。其解决方案的关键在于提出“宏观情境检索”(macro-contextual retrieval)机制——通过将宏观指标(如CPI、失业率、收益率利差、GDP增长)与金融新闻情绪嵌入到共享相似性空间中,在推理阶段无需重新训练即可因果性地检索历史上相似的宏观经济周期作为预测依据。该方法有效缩小了训练集与OOD测试集之间的性能差距,并在AAPL(2024)和XOM(2024)上实现了正向交易收益(如AAPL: PF=1.18, Sharpe=0.95),同时提供了可解释的证据链,对应于通胀或收益率曲线倒挂等明确的宏观情境,从而增强了模型的因果可解释性和透明度。
链接: https://arxiv.org/abs/2511.09754
作者: Sarthak Khanna,Armin Berger,Muskaan Chopra,Rafet Sifa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE BigData 2025
Abstract:Financial markets are inherently non-stationary: structural breaks and macroeconomic regime shifts often cause forecasting models to fail when deployed out of distribution (OOD). Conventional multimodal approaches that simply fuse numerical indicators and textual sentiment rarely adapt to such shifts. We introduce macro-contextual retrieval, a retrieval-augmented forecasting framework that grounds each prediction in historically analogous macroeconomic regimes. The method jointly embeds macro indicators (e.g., CPI, unemployment, yield spread, GDP growth) and financial news sentiment in a shared similarity space, enabling causal retrieval of precedent periods during inference without retraining. Trained on seventeen years of SP 500 data (2007-2023) and evaluated OOD on AAPL (2024) and XOM (2024), the framework consistently narrows the CV to OOD performance gap. Macro-conditioned retrieval achieves the only positive out-of-sample trading outcomes (AAPL: PF=1.18, Sharpe=0.95; XOM: PF=1.16, Sharpe=0.61), while static numeric, text-only, and naive multimodal baselines collapse under regime shifts. Beyond metric gains, retrieved neighbors form interpretable evidence chains that correspond to recognizable macro contexts, such as inflationary or yield-curve inversion phases, supporting causal interpretability and transparency. By operationalizing the principle that “financial history may not repeat, but it often rhymes,” this work demonstrates that macro-aware retrieval yields robust, explainable forecasts under distributional change. All datasets, models, and source code are publicly available. Comments: Accepted in IEEE BigData 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.09754 [cs.LG] (or arXiv:2511.09754v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09754 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-85] awPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training AAAI2026
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中因设备内存限制和昂贵的跨设备通信导致的效率瓶颈问题,尤其针对流水线并行(Pipeline Parallelism)在长文本上下文训练时激活值(Activation)通信开销随序列长度线性增长的问题。解决方案的关键在于提出一种拓扑感知的权重流水线并行方法(Topology-aware Weight Pipeline Parallelism, TawPipe):首先基于设备拓扑结构分组,优化节点内集体通信与节点间点对点(P2P)通信;其次为每个设备分配固定的模型权重和梯度分片,避免冗余传输;最后通过通信与计算重叠隐藏延迟。相较于全分片数据并行(Fully Sharded Data Parallelism, FSDP)中的全局集体操作,TawPipe 将大部分通信限制在节点内部,显著减少跨节点流量,从而在最多 24 张 GPU 上实现更高的吞吐量和可扩展性。
链接: https://arxiv.org/abs/2511.09741
作者: Houming Wu,Ling Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by AAAI 2026, 9 pages, and 6 figures
Abstract:Training large language models (LLMs) is fundamentally constrained by limited device memory and costly inter-device communication. Although pipeline parallelism alleviates memory pressure by partitioning models across devices, it incurs activation communication overhead that scales linearly with sequence length, limiting efficiency in long-context training. Recent weight-passing approaches (e.g., WeiPipe) mitigate this by transmitting model weights instead of activations, but suffer from redundant peer-to-peer (P2P) transfers and underutilized intra-node bandwidth. We propose TawPipe–topology-aware weight pipeline parallelism, which exploits hierarchical bandwidth in distributed clusters for improved communication efficiency. TawPipe: (i) groups devices based on topology to optimize intra-node collective and inter-node P2P communication; (ii) assigns each device a fixed shard of model weights and gradients, avoiding redundant transfers; and (iii) overlaps communication with computation to hide latency. Unlike global collective operations used in fully sharded data parallelism (FSDP), TawPipe confines most communication within node boundaries, significantly reducing cross-node traffic. Extensive experiments on up to 24 GPUs with LLaMA-style models show that TawPipe achieves superior throughput and scalability compared to state-of-the-art baselines.
zh
[AI-86] Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy AAAI2026
【速读】:该论文旨在解决机器人控制领域中在未见过的环境(out-of-distribution, OOD)下如何实现可靠泛化的问题,尤其关注于在测试时无法获取显式上下文信息(context)的情形。传统方法通常采用两阶段训练策略,即先训练上下文编码器(context encoder),再训练历史适应模块(history adaptation module),但该方式实现复杂且训练困难。本文提出了一种单阶段适应方法 SPARC(Single-Phase Adaptation for Robust Control),其关键在于将上下文建模与策略优化统一在一个端到端的学习框架中,从而简化训练流程并显著提升模型在OOD场景下的鲁棒性表现。
链接: https://arxiv.org/abs/2511.09737
作者: Bram Grooten,Patrick MacAlpine,Kaushik Subramanian,Peter Stone,Peter R. Wurman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as an oral at AAAI 2026. For code and videos, please see this https URL
Abstract:Generalization to unseen environments is a significant challenge in the field of robotics and control. In this work, we focus on contextual reinforcement learning, where agents act within environments with varying contexts, such as self-driving cars or quadrupedal robots that need to operate in different terrains or weather conditions than they were trained for. We tackle the critical task of generalizing to out-of-distribution (OOD) settings, without access to explicit context information at test time. Recent work has addressed this problem by training a context encoder and a history adaptation module in separate stages. While promising, this two-phase approach is cumbersome to implement and train. We simplify the methodology and introduce SPARC: single-phase adaptation for robust control. We test SPARC on varying contexts within the high-fidelity racing simulator Gran Turismo 7 and wind-perturbed MuJoCo environments, and find that it achieves reliable and robust OOD generalization.
zh
[AI-87] Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard
【速读】:该论文旨在解决机器人在无外部监督情况下实现自主多模态感知与动作协调学习的问题,特别是模仿婴儿从随机运动探索到有目的行为发展的过程。解决方案的关键在于构建一个基于强化学习(Reinforcement Learning, RL)的框架,利用内在奖励机制模拟婴儿的好奇心驱动探索,使机器人Baby Sophia在BabyBench仿真环境中通过自触觉(self-touch)和手注视(hand regard)行为的学习,逐步建立身体意识与跨模态整合能力;其中,高维触觉输入被压缩为紧凑有意义的表征以提升学习效率,并结合课程学习(curriculum learning)策略引导行为从简单到复杂演进,最终实现无需外部标注的、自发的、协调一致的多模态技能习得。
链接: https://arxiv.org/abs/2511.09727
作者: Stelios Zarifis,Ioannis Chalkiadakis,Artemis Chardouveli,Vasiliki Moutzouri,Aggelos Sotirchos,Katerina Papadimitriou,Panagiotis Filntisis,Niki Efthymiou,Petros Maragos,Katerina Pastra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 tables
Abstract:Inspired by infant development, we propose a Reinforcement Learning (RL) framework for autonomous self-exploration in a robotic agent, Baby Sophia, using the BabyBench simulation environment. The agent learns self-touch and hand regard behaviors through intrinsic rewards that mimic an infant’s curiosity-driven exploration of its own body. For self-touch, high-dimensional tactile inputs are transformed into compact, meaningful representations, enabling efficient learning. The agent then discovers new tactile contacts through intrinsic rewards and curriculum learning that encourage broad body coverage, balance, and generalization. For hand regard, visual features of the hands, such as skin-color and shape, are learned through motor babbling. Then, intrinsic rewards encourage the agent to perform novel hand motions, and follow its hands with its gaze. A curriculum learning setup from single-hand to dual-hand training allows the agent to reach complex visual-motor coordination. The results of this work demonstrate that purely curiosity-based signals, with no external supervision, can drive coordinated multimodal learning, imitating an infant’s progression from random motor babbling to purposeful behaviors.
zh
[AI-88] Echoing: Identity Failures when LLM Agents Talk to Each Other
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)代理间自主交互(agent-agent conversation, AxA)中出现的行为漂移(behavioral drift)问题,特别是“回响”(echoing)现象——即代理在对话中放弃其预设角色,转而模仿对话对方,从而损害任务目标的实现。研究表明,这种现象在不同LLM提供商、多个领域和数千次对话中普遍存在,且不随推理能力增强而缓解,提示其根源在于交互动态而非提示设计缺陷。解决方案的关键在于引入协议层面的结构化响应机制,通过强制代理采用特定格式的输出来约束行为,将回响率从高达70%显著降低至9%,从而提升多代理系统的一致性和可靠性。
链接: https://arxiv.org/abs/2511.09710
作者: Sarath Shekkizhar,Romain Cosentino,Adam Earle,Silvio Savarese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across 60 AxA configurations, 3 domains, and 2000+ conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from 5% to 70% depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ( 32.8% ) that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer ( 7+ turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to 9% .
zh
[AI-89] Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models
【速读】:该论文旨在解决音频生成式 AI(Audio Reasoning Models, ARMs)在面对高级越狱攻击(jailbreak attacks)时的安全性问题,即如何在不损害良性任务性能的前提下提升模型对有害响应的防御能力。解决方案的关键在于提出一种名为Rebellion的鲁棒推理训练方法(robust reasoning training),其核心思想是通过模拟最坏情况下的表示漂移(representation drift),使ARMs在训练中学习到对极端攻击场景的适应性,从而有效抵御先进越狱攻击,同时保持原有任务性能,并显著优化准确率与安全性之间的权衡关系。
链接: https://arxiv.org/abs/2511.09682
作者: Tiansheng Huang,Virat Shejwalkar,Oscar Chang,Milad Nasr,Ling Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs’ performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.
zh
[AI-90] SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, VRL)代理在面对对抗性扰动时的脆弱性问题,尤其是针对图像输入的黑盒攻击在连续控制任务中因动作空间大和环境查询成本高而导致的效率低下问题。解决方案的关键在于提出SEBA框架,其核心创新包括:(1)引入一个阴影Q模型(shadow Q model),用于估计对抗条件下的累积奖励;(2)结合生成对抗网络(Generative Adversarial Network, GAN)生成视觉上难以察觉的扰动;(3)利用世界模型(world model)模拟环境动态以减少真实环境交互次数。通过两阶段迭代训练机制交替优化阴影模型与生成器,SEBA在保持攻击效果的同时显著提升了样本效率。
链接: https://arxiv.org/abs/2511.09681
作者: Tairan Huang,Yulin Jin,Junxu Liu,Qingqing Ye,Haibo Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing black-box attacks focus on vector-based or discrete-action RL, and their effectiveness on image-based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real-world queries. Through a two-stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black-box and white-box methods.
zh
[AI-91] Alignment Debt: The Hidden Work of Making AI Usable
【速读】:该论文试图解决当前前沿大语言模型(Large Language Models, LLMs)在非洲等全球南方地区部署时存在的适配性问题,即这些模型通常基于高资源假设进行优化,难以匹配当地的文化、语言、基础设施和知识体系,导致用户需额外投入劳动来使系统可用,形成“对齐债务”(alignment debt)。其解决方案的关键在于提出并验证了一个四维分类法(文化与语言、基础设施、认知范式、交互),通过针对肯尼亚和尼日利亚411名用户的调研实证量化了对齐债务的分布与影响,发现不同类型的对齐债务与用户补偿性劳动(如输出验证)存在显著差异:认知层面的债务最易引发高强度验证行为,而基础设施与交互类债务则难以通过用户端验证缓解。这表明公平性评估必须纳入用户负担维度,并推动面向具体情境的AI设计规范与治理框架,以降低全球南方用户的使用门槛。
链接: https://arxiv.org/abs/2511.09663
作者: Cumi Oyemike,Elizabeth Akpan,Pierre Hervé-Berdys
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 19 pages, 3 figures, 3 tables
Abstract:Frontier LLMs are optimised around high-resource assumptions about language, knowledge, devices, and connectivity. Whilst widely accessible, they often misfit conditions in the Global South. As a result, users must often perform additional work to make these systems usable. We term this alignment debt: the user-side burden that arises when AI systems fail to align with cultural, linguistic, infrastructural, or epistemic contexts. We develop and validate a four-part taxonomy of alignment debt through a survey of 411 AI users in Kenya and Nigeria. Among respondents measurable on this taxonomy (n = 385), prevalence is: Cultural and Linguistic (51.9%), Infrastructural (43.1%), Epistemic (33.8%), and Interaction (14.0%). Country comparisons show a divergence in Infrastructural and Interaction debt, challenging one-size-fits-Africa assumptions. Alignment debt is associated with compensatory labour, but responses vary by debt type: users facing Epistemic challenges verify outputs at significantly higher rates (91.5% vs. 80.8%; p = 0.037), and verification intensity correlates with cumulative debt burden (Spearmans rho = 0.147, p = 0.004). In contrast, Infrastructural and Interaction debts show weak or null associations with verification, indicating that some forms of misalignment cannot be resolved through verification alone. These findings show that fairness must be judged not only by model metrics but also by the burden imposed on users at the margins, compelling context-aware safeguards that alleviate alignment debt in Global South settings. The alignment debt framework provides an empirically grounded way to measure user burden, informing both design practice and emerging African AI governance efforts.
zh
[AI-92] Optimistic Reinforcement Learning with Quantile Objectives
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中目标函数风险敏感性不足的问题,尤其是在医疗和金融等对风险敏感的领域,传统RL方法无法有效优化累积奖励分布的特定分位数(quantile)。解决方案的关键在于提出UCB-QRL算法,该算法通过在每轮迭代中先估计环境的转移概率,并在其置信球(confidence ball)内优化τ-分位数价值函数,从而实现对风险敏感目标的高效学习。理论分析表明,UCB-QRL在 episodic 设置下具有高概率的 regret 上界,其复杂度与状态数、动作数、总轮次及时间步长相关,且依赖于一个刻画MDP分位数价值敏感性的问题相关常数κ。
链接: https://arxiv.org/abs/2511.09652
作者: Mohammad Alipour-Vaezi,Huaiyang Zhong,Kwok-Leung Tsui,Sajad Khodadadian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the \tau -quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound \mathcal O\left((2/\kappa)^H+1H\sqrtSATH\log(2SATH/\delta)\right) in the episodic setting with S states, A actions, T episodes, and H horizons. Here, \kappa0 is a problem-dependent constant that captures the sensitivity of the underlying MDP’s quantile value.
zh
[AI-93] An explainable Recursive Feature Elimination to detect Advanced Persistent Threats using Random Forest classifier
【速读】:该论文旨在解决传统入侵检测系统(Intrusion Detection Systems, IDS)在应对高级持续性威胁(Advanced Persistent Threats, APTs)时存在的检测精度不足、误报率高及模型可解释性差的问题。其解决方案的关键在于构建一个可解释的入侵检测框架,该框架结合递归特征消除(Recursive Feature Elimination, RFE)与随机森林(Random Forest, RF)算法:首先通过RFE从CICIDS2017数据集中筛选出最具判别力的特征子集,降低维度并减少冗余信息;随后在精简后的特征空间上训练RF模型,并利用SHapley Additive exPlanations(SHAP)方法对各特征的重要性进行量化解释,从而实现高精度(99.9%)且具备透明性的检测效果,同时显著降低计算开销和误报率。
链接: https://arxiv.org/abs/2511.09603
作者: Noor Hazlina Abdul Mutalib,Aznul Qalid Md Sabri,Ainuddin Wahid Abdul Wahab,Erma Rahayu Mohd Faizal Abdullah,Nouar AlDahoul
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Intrusion Detection Systems (IDS) play a vital role in modern cybersecurity frameworks by providing a primary defense mechanism against sophisticated threat actors. In this paper, we propose an explainable intrusion detection framework that integrates Recursive Feature Elimination (RFE) with Random Forest (RF) to enhance detection of Advanced Persistent Threats (APTs). By using CICIDS2017 dataset, the approach begins with comprehensive data preprocessing and narrows down the most significant features via RFE. A Random Forest (RF) model was trained on the refined feature set, with SHapley Additive exPlanations (SHAP) used to interpret the contribution of each selected feature. Our experiment demonstrates that the explainable RF-RFE achieved a detection accuracy of 99.9%, reducing false positive and computational cost in comparison to traditional classifiers. The findings underscore the effectiveness of integrating explainable AI and feature selection to develop a robust, transparent, and deployable IDS solution.
zh
[AI-94] Cogent argument extensions are weakly admissible but not vice versa
【速读】:该论文旨在解决非可接受论证框架语义中,强可接受性(cogent)与弱可接受性(weakly admissible)两种语义之间的关系问题。其解决方案的关键在于证明:所有强可接受扩展(cogent extensions)均属于弱可接受扩展(weakly admissible extensions),但反之不成立,即存在弱可接受扩展并非强可接受扩展,从而明确了二者之间的严格包含关系。
链接: https://arxiv.org/abs/2511.09600
作者: Gustavo Bodanza
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Research note, 4 pages, no figures
Abstract:In this research note, we show the relationship between two non-admissible argumentation framework semantics: cogent and weakly admissible semantics. We prove that, while cogent extensions are weakly admissible, the converse is not true.
zh
[AI-95] Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey NEURIPS2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在实现自适应行为和长期决策能力时所面临的瓶颈问题,即依赖静态人类知识数据集进行训练存在局限性,这些数据集成本高、缺乏动态性和现实感。解决方案的关键在于提出并形式化“生成-执行-反馈”(Generation-Execution-Feedback, GEF)循环机制,强调通过智能体与环境的直接交互来获取经验数据,并利用强化学习从这种交互中持续学习。该框架将环境视为体验数据的核心生产者,从而推动环境向更高复杂度、真实性和互动性演进,是提升代理智能的核心路径。
链接: https://arxiv.org/abs/2511.09586
作者: Yuchen Huang,Sijia Li,Minghao Liu,Wei Liu,Shijue Huang,Zhiyuan Fan,Hou Pong Chan,Yi R. Fung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, SEA Workshop @ NeurIPS 2025
Abstract:LLM-based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long-term decision-making, training on static datasets built from human-level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents’ actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment-centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze benchmarks, implementation strategies, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.
zh
[AI-96] HeatGen: A Guided Diffusion Framework for Multiphysics Heat Sink Design Optimization
【速读】:该论文旨在解决电子散热器(heat sink)设计中如何在满足表面温度约束的前提下最小化压降的问题。传统方法如CMA-ES等黑箱优化或拓扑优化存在计算成本高、难以扩展或约束变更后需重新训练的局限性。其解决方案的关键在于构建一个基于引导去噪扩散概率模型(guided denoising diffusion probabilistic model, DDPM)的生成式优化框架:首先利用多保真度数据集训练DDPM以生成符合物理特性的几何结构,同时训练两个残差神经网络作为代理模型(surrogate models)预测压降与表面温度;随后通过这些代理模型对设计变量的梯度信息,在推理阶段引导生成过程,使新设计既满足温控要求又显著降低压降(最高达10%)。该方法兼具高效性与可扩展性,且在新约束条件下无需重新训练即可快速生成优化方案。
链接: https://arxiv.org/abs/2511.09578
作者: Hadi Keramati,Morteza Sadeghi,Rajeev K. Jaiman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:This study presents a generative optimization framework based on a guided denoising diffusion probabilistic model (DDPM) that leverages surrogate gradients to generate heat sink designs minimizing pressure drop while maintaining surface temperatures below a specified threshold. Geometries are represented using boundary representations of multiple fins, and a multi-fidelity approach is employed to generate training data. Using this dataset, along with vectors representing the boundary representation geometries, we train a denoising diffusion probabilistic model to generate heat sinks with characteristics consistent with those observed in the data. We train two different residual neural networks to predict the pressure drop and surface temperature for each geometry. We use the gradients of these surrogate models with respect to the design variables to guide the geometry generation process toward satisfying the low-pressure and surface temperature constraints. This inference-time guidance directs the generative process toward heat sink designs that not only prevent overheating but also achieve lower pressure drops compared to traditional optimization methods such as CMA-ES. In contrast to traditional black-box optimization approaches, our method is scalable, provided sufficient training data is available. Unlike traditional topology optimization methods, once the model is trained and the heat sink world model is saved, inference under new constraints (e.g., temperature) is computationally inexpensive and does not require retraining. Samples generated using the guided diffusion model achieve pressure drops up to 10 percent lower than the limits obtained by traditional black-box optimization methods. This work represents a step toward building a foundational generative model for electronics cooling.
zh
[AI-97] Proceedings of the Second International Workshop on Next-Generation Language Models for Knowledge Representation and Reasoning (NeLaMKRR 2025)
【速读】:该论文试图解决的问题是:当前基于Transformer的生成式AI(Generative AI)语言模型在推理能力上的表现尚不明确,且其推理机制与传统的基于逻辑的知识表示(Knowledge Representation, KR)方法之间缺乏有效衔接,导致难以量化评估其推理能力并实现可靠的应用。解决方案的关键在于通过跨学科协作,探索将知识表示(KR)风格的推理能力注入语言模型的方法(包括神经符号(neuro-symbolic)手段),并建立形式化框架以刻画语言模型所执行的推理类型,从而推动生成式AI在需要高精度和高可靠性场景下的实际应用。
链接: https://arxiv.org/abs/2511.09575
作者: Ha-Thanh Nguyen,Ken Satoh,Francesca Toni,Randy Goebel,Kostas Stathis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning is an essential component of human intelligence in that it plays a fundamental role in our ability to think critically, support responsible decisions, and solve challenging problems. Traditionally, AI has addressed reasoning in the context of logic-based representations of knowledge. However, the recent leap forward in natural language processing, with the emergence of language models based on transformers, is hinting at the possibility that these models exhibit reasoning abilities, particularly as they grow in size and are trained on more and more data. Still, despite ongoing discussions about what reasoning is in language models, it is still not easy to articulate to what extent these models are actually capable of reasoning. The goal of this workshop is to create a platform for researchers from different disciplines and/or AI perspectives to explore approaches and techniques with the aim to reconcile reasoning between language models using transformers and logic-based representations. The specific objectives include analysing the reasoning abilities of language models measured alongside KR methods, injecting KR-style reasoning abilities into language models (including by neuro-symbolic means), and formalising the kind of reasoning language models carry out. This exploration aims to uncover how language models can effectively integrate and leverage knowledge and reasoning with it, thus improving their application and utility in areas where precision and reliability are key requirements. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.09575 [cs.AI] (or arXiv:2511.09575v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.09575 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-98] SynthTools: A Framework for Scaling Synthetic Tools for Agent Development
【速读】:该论文旨在解决当前AI代理(AI agent)在复杂、长周期任务中依赖外部工具时面临的评估与训练难题,尤其是真实世界API在可用性、领域覆盖范围和稳定性方面的局限性,如需访问密钥、存在速率限制等问题,难以支撑大规模、可复现的训练与评估。解决方案的关键在于提出SynthTools框架,其核心由三个模块构成:工具生成(Tool Generation),用于自动且可扩展地创建多样化工具;工具仿真(Tool Simulation),用于模拟逼真的工具行为;以及工具审计(Tool Audit),用于保障仿真结果的正确性和一致性。该框架显著提升了工具生态系统的规模与多样性(达到前人工作的两倍以上),并在仿真与审计环节分别实现了94%和99%的准确率,从而为构建可靠、可扩展的工具使用环境提供了有效路径。
链接: https://arxiv.org/abs/2511.09572
作者: Tommaso Castellani,Naimeng Ye,Daksh Mittal,Thomson Yen,Hongseok Namkoong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:AI agents increasingly rely on external tools to solve complex, long-horizon tasks. Advancing such agents requires reproducible evaluation and large-scale training in controllable, diverse, and realistic tool-use environments. However, real-world APIs are limited in availability, domain coverage, and stability, often requiring access keys and imposing rate limits, which render them impractical for stable evaluation or scalable training. To address these challenges, we introduce SynthTools, a flexible and scalable framework for generating synthetic tool ecosystems. Our framework consists of three core components: Tool Generation for automatic and scalable creation of diverse tools, Tool Simulation to emulate realistic tool behaviors, and Tool Audit to ensure correctness and consistency of tool simulation. To illustrate its scalability, we show that SynthTools can readily produce toolsets that span twice as many domains and twice as many tools per domain as prior work. Furthermore, the tool simulation and tool audit components demonstrate strong reliability, achieving 94% and 99% accuracy respectively. Finally, we construct downstream tasks from the generated tools that even state-of-the-art models struggle to complete. By enabling scalable, diverse, and reliable tool ecosystems, SynthTools provides a practical path toward large-scale training and stable evaluation of tool-use agents. Our code is available at this https URL.
zh
[AI-99] Variable Neighborhood Search for the Electric Vehicle Routing Problem
【速读】:该论文旨在解决电动车辆路径问题(Electric Vehicle Routing Problem, EVRP)中因约束多样性导致不同算法难以比较的挑战,聚焦于一个简化的变体——容量限制型绿色车辆路径问题(Capacitated Green Vehicle Routing Problem, CGVRP)。其解决方案的关键在于采用基于可变邻域搜索(Variable Neighborhood Search, VNS)的元启发式算法,该方法在2020年IEEE计算智能世界大会举办的CEC-12竞赛数据集上取得了最优结果,并优于后续发表的其他先进算法。
链接: https://arxiv.org/abs/2511.09570
作者: David Woller,Viktor Kozák,Miroslav Kulich,Libor Přeučil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:The Electric Vehicle Routing Problem (EVRP) extends the classical Vehicle Routing Problem (VRP) to reflect the growing use of electric and hybrid vehicles in logistics. Due to the variety of constraints considered in the literature, comparing approaches across different problem variants remains challenging. A minimalistic variant of the EVRP, known as the Capacitated Green Vehicle Routing Problem (CGVRP), was the focus of the CEC-12 competition held during the 2020 IEEE World Congress on Computational Intelligence. This paper presents the competition-winning approach, based on the Variable Neighborhood Search (VNS) metaheuristic. The method achieves the best results on the full competition dataset and also outperforms a more recent algorithm published afterward.
zh
[AI-100] An Efficient and Almost Optimal Solver for the Joint Routing-Assignment Problem via Partial JRA and Large-α Optimization
【速读】:该论文致力于解决联合路由与分配(Joint Routing-Assignment, JRA)优化问题,即同时确定物品到占位符的分配关系以及一条访问每对节点恰好一次的哈密顿回路,以最小化总旅行成本。此前的方法虽能保证最优性,但在大规模实例上计算效率低下;为克服此限制,研究提出一种新颖高效的Partial Path Reconstruction (PPR)求解器,其核心在于首先识别关键物品-占位符对构成简化子问题,进而高效求解并精炼全局解。通过迭代使用PPR求解器沿优化路径持续改进,可将初始启发式解的偏差减半,并结合全局Large-α约束进一步提升解的质量。实验表明,在n=300、500和1000规模的基准数据集上,该方法平均偏离真实最优值仅为0.00%,兼具高精度与高效率。
链接: https://arxiv.org/abs/2511.09563
作者: Qilong Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: 20 pages, 4 figures
Abstract:The Joint Routing-Assignment (JRA) optimization problem simultaneously determines the assignment of items to placeholders and a Hamiltonian cycle that visits each node pair exactly once, with the objective of minimizing total travel cost. Previous studies introduced an exact mixed-integer programming (MIP) solver, along with datasets and a Gurobi implementation, showing that while the exact approach guarantees optimality, it becomes computationally inefficient for large-scale instances. To overcome this limitation, heuristic methods based on merging algorithms and shaking procedures were proposed, achieving solutions within approximately 1% deviation from the optimum. This work presents a novel and more efficient approach that attains high-accuracy, near-optimal solutions for large-scale JRA problems. The proposed method introduces a Partial Path Reconstructon (PPR) solver that first identifies key item-placeholder pairs to form a reduced subproblem, which is solved efficiently to refine the global solution. Using this PJAR framework, the initial heuristic merging solutions can be further improved, reducing the deviation by half. Moreover, the solution can be iteratively polished with PPR based solver along the optimization path to yield highly accurate tours. Additionally, a global Large-\alpha constraint is incorporated into the JRA model to further enhance solution optimality. Experimental evaluations on benchmark datasets with n = 300, 500, and 1000 demonstrate that the proposed method consistently delivers almost optimal solutions, achieving an average deviation of 0.00% from the ground truth while maintaining high computational efficiency. Beyond the JRA problem, the proposed framework and methodologies exhibit strong potential for broader applications. The Framework can be applied to TSP and related optimization problems.
zh
[AI-101] From Euler to Today: Universal Mathematical Fallibility A Large-Scale Computational Analysis of Errors in ArXiv Papers
【速读】:该论文旨在解决数学文献中普遍存在但难以系统识别的错误问题,以及传统人工同行评审在效率和覆盖范围上的局限性。其解决方案的关键在于构建一个大规模自动化分析系统,能够对来自ArXiv的37,000余篇数学论文进行端到端的自动审查,不仅检测数学错误(如在数值分析(math.NA)中发现9.6%的错误率),还能生成包含期刊层级推荐的完整审稿报告(例如0.4%推荐至顶级综合期刊,15.5%推荐至领域顶尖期刊)。该方法基于可扩展的计算框架,具备跨学科适用性,为实现生成式 AI (Generative AI) 驱动的自动化学术质量评估提供了实证基础与技术路径。
链接: https://arxiv.org/abs/2511.10543
作者: Igor Rivin
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:We present the results of a large-scale computational analysis of mathematical papers from the ArXiv repository, demonstrating a comprehensive system that not only detects mathematical errors but provides complete referee reports with journal tier recommendations. Our automated analysis system processed over 37,000 papers across multiple mathematical categories, revealing significant error rates and quality distributions. Remarkably, the system identified errors in papers spanning three centuries of mathematics, including works by Leonhard Euler (1707-1783) and Peter Gustav Lejeune Dirichlet (1805-1859), as well as contemporary Fields medalists. In Numerical Analysis (math.NA), we observed an error rate of 9.6% (2,271 errors in 23,761 papers), while Geometric Topology (math.GT) showed 6.5% (862 errors in 13,209 papers). Strikingly, Category Theory (math.CT) showed 0% errors in 93 papers analyzed, with evidence suggesting these results are ``easier’’ for automated analysis. Beyond error detection, the system evaluated papers for journal suitability, recommending 0.4% for top generalist journals, 15.5% for top field-specific journals, and categorizing the remainder across specialist venues. These findings demonstrate both the universality of mathematical error across all eras and the feasibility of automated comprehensive mathematical peer review at scale. This work demonstrates that the methodology, while applied here to mathematics, is discipline-agnostic and could be readily extended to physics, computer science, and other fields represented in the ArXiv repository. Subjects: History and Overview (math.HO); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL) MSC classes: 01A85, 68T05, 00A35, 01A55, 68Q32 Cite as: arXiv:2511.10543 [math.HO] (or arXiv:2511.10543v1 [math.HO] for this version) https://doi.org/10.48550/arXiv.2511.10543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-102] Completion of partial structures using Patterson maps with the CrysFormer machine learning model
【速读】:该论文旨在解决蛋白质结构解析中传统晶体学方法与深度学习模型之间缺乏直接融合的问题,即如何将实验测量数据(如X射线晶体衍射数据)更有效地与基于深度学习的结构预测方法相结合。其解决方案的关键在于提出了一种混合3D视觉Transformer与卷积神经网络的架构,通过同时输入两类信息:一是直接从晶体学数据获得的Patterson图(Patterson maps),二是基于AlphaFold蛋白结构数据库中预测结构生成的“部分结构”模板图(partial structure template maps),从而训练模型预测电子密度图(electron density maps)。该方法能够在相位优化、缺失区域补全以及电子密度图与真实原子结构的一致性提升方面取得显著效果,实现了晶体学与生成式AI(Generative AI)的紧密耦合。
链接: https://arxiv.org/abs/2511.10440
作者: Tom Pan,Evan Dramko,Mitchell D. Miller,Anastasios Kyrillidis,George N. Phillips Jr
机构: 未知
类目: Biological Physics (physics.bio-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, accepted at Acta Crystallographic section D
Abstract:Protein structure determination has long been one of the primary challenges of structural biology, to which deep machine learning (ML)-based approaches have increasingly been applied. However, these ML models generally do not incorporate the experimental measurements directly, such as X-ray crystallographic diffraction data. To this end, we explore an approach that more tightly couples these traditional crystallographic and recent ML-based methods, by training a hybrid 3-d vision transformer and convolutional network on inputs from both domains. We make use of two distinct input constructs / Patterson maps, which are directly obtainable from crystallographic data, and ``partial structure’’ template maps derived from predicted structures deposited in the AlphaFold Protein Structure Database with subsequently omitted residues. With these, we predict electron density maps that are then post-processed into atomic models through standard crystallographic refinement processes. Introducing an initial dataset of small protein fragments taken from Protein Data Bank entries and placing them in hypothetical crystal settings, we demonstrate that our method is effective at both improving the phases of the crystallographic structure factors and completing the regions missing from partial structure templates, as well as improving the agreement of the electron density maps with the ground truth atomic structures.
zh
[AI-103] MATAI: A Generalist Machine Learning Framework for Property Prediction and Inverse Design of Advanced Alloys
链接: https://arxiv.org/abs/2511.10108
作者: Yanchen Deng,Chendong Zhao,Yixuan Li,Bijun Tang,Xinrun Wang,Zhonghan Zhang,Yuhao Lu,Penghui Yang,Jianguo Huang,Yushan Xiao,Cuntai Guan,Zheng Liu,Bo An
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
[AI-104] fastbmRAG : A Fast Graph-Based RAG Framework for Efficient Processing of Large-Scale Biomedical Literature
【速读】:该论文旨在解决当前基于图结构的检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理生物医学文献时计算复杂度高、难以扩展至大规模数据集的问题。其解决方案的关键在于提出一种名为fastbmRAG的优化方法,通过将知识图谱构建分为两个阶段:首先利用论文摘要快速草拟图结构,其次借助向量驱动的实体链接技术,以主文内容进行精细化修正,从而显著减少冗余计算并降低整体负载。实验表明,fastbmRAG相比现有图-RAG工具速度提升超过10倍,同时保持更高的知识覆盖度和准确性。
链接: https://arxiv.org/abs/2511.10014
作者: Guofeng Meng,Li Shen,Qiuyan Zhong,Wei Wang,Haizhou Zhang,Xiaozhen Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figure, 1 table
Abstract:Large language models (LLMs) are rapidly transforming various domains, including biomedicine and healthcare, and demonstrate remarkable potential from scientific research to new drug discovery. Graph-based retrieval-augmented generation (RAG) systems, as a useful application of LLMs, can improve contextual reasoning through structured entity and relationship identification from long-context knowledge, e.g. biomedical literature. Even though many advantages over naive RAGs, most of graph-based RAGs are computationally intensive, which limits their application to large-scale dataset. To address this issue, we introduce fastbmRAG, an fast graph-based RAG optimized for biomedical literature. Utilizing well organized structure of biomedical papers, fastbmRAG divides the construction of knowledge graph into two stages, first drafting graphs using abstracts; and second, refining them using main texts guided by vector-based entity linking, which minimizes redundancy and computational load. Our evaluations demonstrate that fastbmRAG is over 10x faster than existing graph-RAG tools and achieve superior coverage and accuracy to input knowledge. FastbmRAG provides a fast solution for quickly understanding, summarizing, and answering questions about biomedical literature on a large scale. FastbmRAG is public available in this https URL.
zh
[AI-105] Multiple Treatments Causal Effects Estimation with Task Embeddings and Balanced Representation Learning
【速读】:该论文旨在解决多治疗同时应用场景下单个治疗效应与交互效应的准确估计问题(即Treatment Effect Estimation in Multi-Treatment Settings)。现有方法要么缺乏相关治疗间的参数共享,导致效率低下;要么引入冗余潜在变量,降低因果效应估计精度。其解决方案的关键在于提出一种融合任务嵌入网络(Task Embedding Network)与带平衡惩罚的表示学习网络(Representation Learning Network with Balancing Penalty)的深度学习框架:前者通过编码单效应共性与交互效应特异性实现跨治疗模式的参数共享,后者则在不依赖参数假设的前提下,从观测协变量中非参数地学习表示,并通过最小化不同治疗模式间表示分布距离来缓解选择偏差和模型误设问题,从而提升因果效应估计的准确性与鲁棒性。
链接: https://arxiv.org/abs/2511.09814
作者: Yuki Murakami,Takumi Hattori,Kohsuke Kubota
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: 33 pages
Abstract:The simultaneous application of multiple treatments is increasingly common in many fields, such as healthcare and marketing. In such scenarios, it is important to estimate the single treatment effects and the interaction treatment effects that arise from treatment combinations. Previous studies have proposed using independent outcome networks with subnetworks for interactions, or combining task embedding networks that capture treatment similarity with variational autoencoders. However, these methods suffer from the lack of parameter sharing among related treatments, or the estimation of unnecessary latent variables reduces the accuracy of causal effect estimation. To address these issues, we propose a novel deep learning framework that incorporates a task embedding network and a representation learning network with the balancing penalty. The task embedding network enables parameter sharing across related treatment patterns because it encodes elements common to single effects and contributions specific to interaction effects. The representation learning network with the balancing penalty learns representations nonparametrically from observed covariates while reducing distances in representation distributions across different treatment patterns. This process mitigates selection bias and avoids model misspecification. Simulation studies demonstrate that the proposed method outperforms existing baselines, and application to real-world marketing datasets confirms the practical implications and utility of our framework.
zh
[AI-106] Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction
【速读】:该论文旨在解决小分子溶解度预测中实验测量成本高、现有机器学习方法依赖量子力学描述符或缺乏可解释性的问题。其核心解决方案是提出Solvaformer,一种具有几何感知能力的图变压器架构,通过引入SE(3)等变注意力机制建模分子内相互作用,并结合分子间标量注意力实现跨分子通信,从而在不强制施加虚假相对几何关系的前提下捕捉溶液中多分子间的复杂相互作用。该模型采用混合数据训练策略,在量子力学数据(CombiSolv-QM)与实验数据(BigSolDB 2.0)之间交替优化,显著提升了预测准确性与可解释性,尤其在区分位置异构体溶解度差异时能准确识别分子内与分子间氢键模式。
链接: https://arxiv.org/abs/2511.09774
作者: Jonathan Broadbent,Michael Bailey,Mingxuan Li,Abhishek Paul,Louis De Lescure,Paul Chauvin,Lorenzo Kogler-Anele,Yasser Jangjou,Sven Jager
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of small molecule solubility using material-sparing approaches is critical for accelerating synthesis and process optimization, yet experimental measurement is costly and many learning approaches either depend on quantumderived descriptors or offer limited interpretability. We introduce Solvaformer, a geometry-aware graph transformer that models solutions as multiple molecules with independent SE(3) symmetries. The architecture combines intramolecular SE(3)-equivariant attention with intermolecular scalar attention, enabling cross-molecular communication without imposing spurious relative geometry. We train Solvaformer in a multi-task setting to predict both solubility (log S) and solvation free energy, using an alternating-batch regimen that trains on quantum-mechanical data (CombiSolv-QM) and on experimental measurements (BigSolDB 2.0). Solvaformer attains the strongest overall performance among the learned models and approaches a DFT-assisted gradient-boosting baseline, while outperforming an EquiformerV2 ablation and sequence-based alternatives. In addition, token-level attention produces chemically coherent attributions: case studies recover known intra- vs. inter-molecular hydrogen-bonding patterns that govern solubility differences in positional isomers. Taken together, Solvaformer provides an accurate, scalable, and interpretable approach to solution-phase property prediction by uniting geometric inductive bias with a mixed dataset training strategy on complementary computational and experimental data.
zh
[AI-107] Brian Intensify: An Adaptive Machine Learning Framework for Auditory EEG Stimulation and Cognitive Enhancement in FXS
【速读】:该论文旨在解决神经发育障碍(如脆性X综合征 FXS)患者因皮层振荡活动异常(尤其在α和γ频段)导致的注意力、感知处理及认知功能缺陷问题。其解决方案的关键在于构建一种基于自适应机器学习的脑-机接口(BCI)系统,通过频率特异性听觉刺激调节神经振荡,并利用EEG特征(如峰值α功率、峰值γ功率及每秒每通道α功率)作为生物标志物,实现对个体脑电响应的实时预测与刺激参数的动态优化,从而提升认知准备度。
链接: https://arxiv.org/abs/2511.09765
作者: Zag ElSayed,Grace Westerkamp,Jack Yanchen Liu,Ernest Pedapati
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 7 pages, 4 figures
Abstract:Neurodevelopmental disorders such as Fragile X Syndrome (FXS) and Autism Spectrum Disorder (ASD) are characterized by disrupted cortical oscillatory activity, particularly in the alpha and gamma frequency bands. These abnormalities are linked to deficits in attention, sensory processing, and cognitive function. In this work, we present an adaptive machine learning-based brain-computer interface (BCI) system designed to modulate neural oscillations through frequency-specific auditory stimulation to enhance cognitive readiness in individuals with FXS. EEG data were recorded from 38 participants using a 128-channel system under a stimulation paradigm consisting of a 30-second baseline (no stimulus) followed by 60-second auditory entrainment episodes at 7Hz, 9Hz, 11Hz, and 13Hz. A comprehensive analysis of power spectral features (Alpha, Gamma, Delta, Theta, Beta) and cross-frequency coupling metrics (Alpha-Gamma, Alpha-Beta, etc.) was conducted. The results identified Peak Alpha Power, Peak Gamma Power, and Alpha Power per second per channel as the most discriminative biomarkers. The 13Hz stimulation condition consistently elicited a significant increase in Alpha activity and suppression of Gamma activity, aligning with our optimization objective. A supervised machine learning framework was developed to predict EEG responses and dynamically adjust stimulation parameters, enabling real-time, subject-specific adaptation. This work establishes a novel EEG-driven optimization framework for cognitive neuromodulation, providing a foundational model for next-generation AI-integrated BCI systems aimed at personalized neurorehabilitation in FXS and related disorders.
zh
[AI-108] omoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks
【速读】:该论文旨在解决3D医学图像分类中因复杂空间关系和长程依赖性导致的挑战,以及现有基于2D视觉基础模型的方法在处理3D体数据时存在的局限性,如切片策略无法有效捕捉目标结构的空间范围、切片聚合方式缺乏对体积结构的保持等问题。其解决方案的关键在于提出TomoGraphView框架,通过引入全方位体素切片(omnidirectional volume slicing)与基于球面图的特征聚合机制,实现更精准的3D空间信息建模与结构一致性保留,从而提升医学图像分析的性能。
链接: https://arxiv.org/abs/2511.09605
作者: Johannes Kiechle,Stefan M. Fischer,Daniel M. Lang,Cosmin I. Bercea,Matthew J. Nyflot,Lina Felsner,Julia A. Schnabel,Jan C. Peeken
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Preprint submitted to Medical Image Analysis (MedIA)
Abstract:The growing number of medical tomography examinations has necessitated the development of automated methods capable of extracting comprehensive imaging features to facilitate downstream tasks such as tumor characterization, while assisting physicians in managing their growing workload. However, 3D medical image classification remains a challenging task due to the complex spatial relationships and long-range dependencies inherent in volumetric data. Training models from scratch suffers from low data regimes, and the absence of 3D large-scale multimodal datasets has limited the development of 3D medical imaging foundation models. Recent studies, however, have highlighted the potential of 2D vision foundation models, originally trained on natural images, as powerful feature extractors for medical image analysis. Despite these advances, existing approaches that apply 2D models to 3D volumes via slice-based decomposition remain suboptimal. Conventional volume slicing strategies, which rely on canonical planes such as axial, sagittal, or coronal, may inadequately capture the spatial extent of target structures when these are misaligned with standardized viewing planes. Furthermore, existing slice-wise aggregation strategies rarely account for preserving the volumetric structure, resulting in a loss of spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. We publicly share our accessible code base at this http URL and provide a user-friendly library for omnidirectional volume slicing at this https URL.
zh
[AI-109] Prostate-VarBench: A Benchmark with Interpretable TabNet Framework for Prostate Cancer Variant Classification
【速读】:该论文旨在解决前列腺癌基因组学中意义未明变异(Variants of Uncertain Significance, VUS)导致的临床应用受限问题,其核心挑战在于现有注释不一致且缺乏前列腺特异性基准数据集,从而阻碍了诊断与治疗的及时性。解决方案的关键在于构建了Prostate-VarBench这一前列腺特异性基准资源,通过整合COSMIC(体细胞癌症突变)、ClinVar(专家注释临床变异)和TCGA-PRAD(癌症基因组图谱前列腺肿瘤基因组数据)形成193,278个变异的标准化数据集,并采用患者或基因感知的划分策略防止数据泄露;同时修正了Variant Effect Predictor (VEP) 中因多转录本记录合并引发的临床意义歧义问题,标准化了56个可解释特征并引入AlphaMissense路径变异评分以提升错义变异分类准确性。基于此资源训练的可解释TabNet模型在独立测试集中达到89.9%准确率,且VUS比例降低6.5%,其分步稀疏掩码机制提供符合分子肿瘤委员会评审实践的个体病例推理依据。
链接: https://arxiv.org/abs/2511.09576
作者: Abraham Francisco Arellano Tavara,Umesh Kumar,Jathurshan Pradeepkumar,Jimeng Sun
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Variants of Uncertain Significance (VUS) limit the clinical utility of prostate cancer genomics by delaying diagnosis and therapy when evidence for pathogenicity or benignity is incomplete. Progress is further limited by inconsistent annotations across sources and the absence of a prostate-specific benchmark for fair comparison. We introduce Prostate-VarBench, a curated pipeline for creating prostate-specific benchmarks that integrates COSMIC (somatic cancer mutations), ClinVar (expert-curated clinical variants), and TCGA-PRAD (prostate tumor genomics from The Cancer Genome Atlas) into a harmonized dataset of 193,278 variants supporting patient- or gene-aware splits to prevent data leakage. To ensure data integrity, we corrected a Variant Effect Predictor (VEP) issue that merged multiple transcript records, introducing ambiguity in clinical significance fields. We then standardized 56 interpretable features across eight clinically relevant tiers, including population frequency, variant type, and clinical context. AlphaMissense pathogenicity scores were incorporated to enhance missense variant classification and reduce VUS uncertainty. Building on this resource, we trained an interpretable TabNet model to classify variant pathogenicity, whose step-wise sparse masks provide per-case rationales consistent with molecular tumor board review practices. On the held-out test set, the model achieved 89.9% accuracy with balanced class metrics, and the VEP correction yields an 6.5% absolute reduction in VUS.
zh
[AI-110] General Intelligence-based Frag mentation (GIF): A framework for peak-labeled spectra simulation
【速读】:该论文旨在解决代谢组学中质谱(Mass Spectrometry, MS)谱图注释率低的问题,这一瓶颈限制了代谢组学研究的进展。其解决方案的关键在于提出一种名为General Intelligence-based Fragmentation (GIF) 的新框架,该框架通过结构化提示(structured prompting)和推理机制引导预训练大语言模型(Large Language Models, LLMs)进行谱图模拟。GIF结合标签化处理、结构化输入/输出设计、系统提示(system prompts)、指令式提示(instruction-based prompts)以及迭代优化策略,为LLMs在复杂科学任务中提供系统性指导,从而显著提升质谱碎片化模拟与强度预测的准确性。实验表明,GIF框架下GPT-4o和GPT-4o-mini在模拟谱图与真实谱图间的余弦相似度分别达到0.36和0.35,优于多个主流深度学习基线模型和其它LLMs,验证了其在可解释推理与人机协同工作流中的潜力。
链接: https://arxiv.org/abs/2511.09571
作者: Margaret R. Martin,Soha Hassoun
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite growing reference libraries and advanced computational tools, progress in the field of metabolomics remains constrained by low rates of annotating measured spectra. The recent developments of large language models (LLMs) have led to strong performance across a wide range of generation and reasoning tasks, spurring increased interest in LLMs’ application to domain-specific scientific challenges, such as mass spectra annotation. Here, we present a novel framework, General Intelligence-based Fragmentation (GIF), that guides pretrained LLMs through spectra simulation using structured prompting and reasoning. GIF utilizes tagging, structured inputs/outputs, system prompts, instruction-based prompts, and iterative refinement. Indeed, GIF offers a structured alternative to ad hoc prompting, underscoring the need for systematic guidance of LLMs on complex scientific tasks. Using GIF, we evaluate current generalist LLMs’ ability to use reasoning towards fragmentation and to perform intensity prediction after fine-tuning. We benchmark performance on a novel QA dataset, the MassSpecGym QA-sim dataset, that we derive from the MassSpecGym dataset. Through these implementations of GIF, we find that GPT-4o and GPT-4o-mini achieve a cosine similarity of 0.36 and 0.35 between the simulated and true spectra, respectively, outperforming other pretrained models including GPT-5, Llama-3.1, and ChemDFM, despite GPT-5’s recency and ChemDFM’s domain specialization. GIF outperforms several deep learning baselines. Our evaluation of GIF highlights the value of using LLMs not only for spectra simulation but for enabling human-in-the-loop workflows and structured, explainable reasoning in molecular fragmentation.
zh
[AI-111] Mamba-driven multi-perspective structural understanding for molecular ground-state conformation prediction
【速读】:该论文旨在解决分子基态构象预测中对分子结构理解不足的问题,特别是如何高效捕捉原子类型、原子位置及原子间连接关系这三个核心结构要素。解决方案的关键在于提出一种基于状态空间模型(State Space Model, SSM)Mamba的多视角结构理解框架(MPSU-Mamba),通过三种专用扫描策略构建复杂分子的全面感知,并引入亮通道引导机制以识别与构象相关的关键原子信息,从而显著提升预测性能,尤其在训练样本稀少的情况下仍保持优越表现。
链接: https://arxiv.org/abs/2511.09564
作者: Yuxin Gou,Aming Wu,Richang Hong,Meng Wang
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:A comprehensive understanding of molecular structures is important for the prediction of molecular ground-state conformation involving property information. Meanwhile, state space model (e.g., Mamba) has recently emerged as a promising mechanism for long sequence modeling and has achieved remarkable results in various language and vision tasks. However, towards molecular ground-state conformation prediction, exploiting Mamba to understand molecular structure is underexplored. To this end, we strive to design a generic and efficient framework with Mamba to capture critical components. In general, molecular structure could be considered to consist of three elements, i.e., atom types, atom positions, and connections between atoms. Thus, considering the three elements, an approach of Mamba-driven multi-perspective structural understanding (MPSU-Mamba) is proposed to localize molecular ground-state conformation. Particularly, for complex and diverse molecules, three different kinds of dedicated scanning strategies are explored to construct a comprehensive perception of corresponding molecular structures. And a bright-channel guided mechanism is defined to discriminate the critical conformation-related atom information. Experimental results on QM9 and Molecule3D datasets indicate that MPSU-Mamba significantly outperforms existing methods. Furthermore, we observe that for the case of few training samples, MPSU-Mamba still achieves superior performance, demonstrating that our method is indeed beneficial for understanding molecular structures.
zh
机器学习
[LG-0] Robot Crash Course: Learning Soft and Stylized Falling
链接: https://arxiv.org/abs/2511.10635
作者: Pascal Strauch,David Müller,Sammy Christen,Agon Serifi,Ruben Grandia,Espen Knoop,Moritz Bächer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot’s end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
[LG-1] Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem
链接: https://arxiv.org/abs/2511.10619
作者: Avrim Blum,Marten Garicano,Kavya Ravichandran,Dravyansh Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages
Abstract:The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for improving bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of \Omega(k) and \Omega(\sqrtk) multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where k is the number of bandit arms. In this work, we propose two new parameterized families of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from each family using offline data. The first family we define includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on k , when the arm reward curves satisfy additional properties related to the strength of concavity. Our second family contains algorithms that both guarantee best-arm identification on well-behaved instances and revert to worst case guarantees on poorly-behaved instances. Taking a statistical learning perspective on the bandit rewards optimization problem, we achieve stronger data-dependent guarantees without the need for actually verifying whether the assumptions are satisfied.
[LG-2] Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs
链接: https://arxiv.org/abs/2511.10590
作者: Miles Wang-Henderson,Ben Kaufman,Edward Williams,Ryan Pederson,Matteo Rossi,Owen Howell,Carl Underkoffler,Narbe Mardirossian,John Parkhill
类目: Machine Learning (cs.LG)
*备注:
Abstract:Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.
[LG-3] ght Robustness Certification through the Convex Hull of ell_0 Attacks
链接: https://arxiv.org/abs/2511.10576
作者: Yuval Shapira,Dana Drachsler-Cohen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Few-pixel attacks mislead a classifier by modifying a few pixels of an image. Their perturbation space is an \ell_0 -ball, which is not convex, unlike \ell_p -balls for p\geq1 . However, existing local robustness verifiers typically scale by relying on linear bound propagation, which captures convex perturbation spaces. We show that the convex hull of an \ell_0 -ball is the intersection of its bounding box and an asymmetrically scaled \ell_1 -like polytope. The volumes of the convex hull and this polytope are nearly equal as the input dimension increases. We then show a linear bound propagation that precisely computes bounds over the convex hull and is significantly tighter than bound propagations over the bounding box or our \ell_1 -like polytope. This bound propagation scales the state-of-the-art \ell_0 verifier on its most challenging robustness benchmarks by 1.24x-7.07x, with a geometric mean of 3.16.
[LG-4] Semi-Unified Sparse Dictionary Learning with Learnable Top-K LISTA and FISTA Encoders
链接: https://arxiv.org/abs/2511.10575
作者: Fengsheng Lin,Shengyi Yan,Trac Duy Tran
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a semi-unified sparse dictionary learning framework that bridges the gap between classical sparse models and modern deep architectures. Specifically, the method integrates strict Top- K LISTA and its convex FISTA-based variant (LISTAConv) into the discriminative LC-KSVD2 model, enabling co-evolution between the sparse encoder and the dictionary under supervised or unsupervised regimes. This unified design retains the interpretability of traditional sparse coding while benefiting from efficient, differentiable training. We further establish a PALM-style convergence analysis for the convex variant, ensuring theoretical stability under block alternation. Experimentally, our method achieves 95.6% on CIFAR-10, 86.3% on CIFAR-100, and 88.5% on TinyImageNet with faster convergence and lower memory cost ( 4GB GPU). The results confirm that the proposed LC-KSVD2 + LISTA/LISTAConv pipeline offers an interpretable and computationally efficient alternative for modern deep architectures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.10575 [cs.LG] (or arXiv:2511.10575v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.10575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Belief Net: A Filter-Based Framework for Learning Hidden Markov Models from Observations
链接: https://arxiv.org/abs/2511.10571
作者: Reginald Zhiyan Chen,Heng-Sheng Chang,Prashant G. Mehta
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR)
*备注: 19 pages, 7 pages, submitted to conference: L4DC 2026
Abstract:Hidden Markov Models (HMMs) are fundamental for modeling sequential data, yet learning their parameters from observations remains challenging. Classical methods like the Baum-Welch (EM) algorithm are computationally intensive and prone to local optima, while modern spectral algorithms offer provable guarantees but may produce probability outputs outside valid ranges. This work introduces Belief Net, a novel framework that learns HMM parameters through gradient-based optimization by formulating the HMM’s forward filter as a structured neural network. Unlike black-box Transformer models, Belief Net’s learnable weights are explicitly the logits of the initial distribution, transition matrix, and emission matrix, ensuring full interpretability. The model processes observation sequences using a decoder-only architecture and is trained end-to-end with standard autoregressive next-observation prediction loss. On synthetic HMM data, Belief Net achieves superior convergence speed compared to Baum-Welch, successfully recovering parameters in both undercomplete and overcomplete settings where spectral methods fail. Comparisons with Transformer-based models are also presented on real-world language data.
[LG-6] Oya: Deep Learning for Accurate Global Precipitation Estimation
链接: https://arxiv.org/abs/2511.10562
作者: Emmanuel Asiedu Brempong,Mohammed Alewi Hassen,MohamedElfatih MohamedKhair,Vusumuzi Dube,Santiago Hincapie Potes,Olivia Graham,Amanie Brik,Amy McGovern,George Huffman,Jason Hickey
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Accurate precipitation estimation is critical for hydrological applications, especially in the Global South where ground-based observation networks are sparse and forecasting skill is limited. Existing satellite-based precipitation products often rely on the longwave infrared channel alone or are calibrated with data that can introduce significant errors, particularly at sub-daily timescales. This study introduces Oya, a novel real-time precipitation retrieval algorithm utilizing the full spectrum of visible and infrared (VIS-IR) observations from geostationary (GEO) satellites. Oya employs a two-stage deep learning approach, combining two U-Net models: one for precipitation detection and another for quantitative precipitation estimation (QPE), to address the inherent data imbalance between rain and no-rain events. The models are trained using high-resolution GPM Combined Radar-Radiometer Algorithm (CORRA) v07 data as ground truth and pre-trained on IMERG-Final retrievals to enhance robustness and mitigate overfitting due to the limited temporal sampling of CORRA. By leveraging multiple GEO satellites, Oya achieves quasi-global coverage and demonstrates superior performance compared to existing competitive regional and global precipitation baselines, offering a promising pathway to improved precipitation monitoring and forecasting.
[LG-7] Maximizing Efficiency of Dataset Compression for Machine Learning Potentials With Information Theory
链接: https://arxiv.org/abs/2511.10561
作者: Benjamin Yu,Vincenzo Lordi,Daniel Schwalbe-Koda
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: main text + SI; code at this https URL
Abstract:Machine learning interatomic potentials (MLIPs) balance high accuracy and lower costs compared to density functional theory calculations, but their performance often depends on the size and diversity of training datasets. Large datasets improve model accuracy and generalization but are computationally expensive to produce and train on, while smaller datasets risk discarding rare but important atomic environments and compromising MLIP accuracy/reliability. Here, we develop an information-theoretical framework to quantify the efficiency of dataset compression methods and propose an algorithm that maximizes this efficiency. By framing atomistic dataset compression as an instance of the minimum set cover (MSC) problem over atom-centered environments, our method identifies the smallest subset of structures that contains as much information as possible from the original dataset while pruning redundant information. The approach is extensively demonstrated on the GAP-20 and TM23 datasets, and validated on 64 varied datasets from the ColabFit repository. Across all cases, MSC consistently retains outliers, preserves dataset diversity, and reproduces the long-tail distributions of forces even at high compression rates, outperforming other subsampling methods. Furthermore, MLIPs trained on MSC-compressed datasets exhibit reduced error for out-of-distribution data even in low-data regimes. We explain these results using an outlier analysis and show that such quantitative conclusions could not be achieved with conventional dimensionality reduction methods. The algorithm is implemented in the open-source QUESTS package and can be used for several tasks in atomistic modeling, from data subsampling, outlier detection, and training improved MLIPs at a lower cost.
[LG-8] wo Americas of Well-Being: Divergent Rural-Urban Patterns of Life Satisfaction and Happiness from 2.6 B Social Media Posts
链接: https://arxiv.org/abs/2511.10542
作者: Stefano Maria Iacus,Giuseppe Porro
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Using 2.6 billion geolocated social-media posts (2014-2022) and a fine-tuned generative language model, we construct county-level indicators of life satisfaction and happiness for the United States. We document an apparent rural-urban paradox: rural counties express higher life satisfaction while urban counties exhibit greater happiness. We reconcile this by treating the two as distinct layers of subjective well-being, evaluative vs. hedonic, showing that each maps differently onto place, politics, and time. Republican-leaning areas appear more satisfied in evaluative terms, but partisan gaps in happiness largely flatten outside major metros, indicating context-dependent political effects. Temporal shocks dominate the hedonic layer: happiness falls sharply during 2020-2022, whereas life satisfaction moves more modestly. These patterns are robust across logistic and OLS specifications and align with well-being theory. Interpreted as associations for the population of social-media posts, the results show that large-scale, language-based indicators can resolve conflicting findings about the rural-urban divide by distinguishing the type of well-being expressed, offering a transparent, reproducible complement to traditional surveys.
[LG-9] Holonorm ALT
链接: https://arxiv.org/abs/2511.10504
作者: Daryl Noupa Yongueng,Hamidou Tembine
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures, 10 tables, 2 datasets. A stable geometric alternative to LayerNorm and Tanh normalization in deep neural networks
Abstract:Normalization is a key point in transformer training . In Dynamic Tanh (DyT), the author demonstrated that Tanh can be used as an alternative layer normalization (LN) and confirmed the effectiveness of the idea. But Tanh itself faces orthogonality, linearity and distortion problems. Due to that, his proposition cannot be reliable. So we propose a Holonorm (hn) which has residual connections and nonlinearity. Holonorm is suitable for replacing Tanh in the context of normalization. Although the HoloNorm expression could be similar to the softsign function in dimension one, softsign is a componentwise function which is not good for tensors and vectors of great dimension. Holonorm preserves the orthogonality, the direction, the invertibility of the signal. Holonorm is also a suitable metric, maps all vectors into the open unit ball. This prevents exploding activations and improves stability in deep Transformer models. In this work, we have meticulously examined the normalization in transformers and say that Holonorm, a generalized form of softsign function suited as a normalization function this http URL, defined between 0 and 1 hn serves as a percentage, and 1 - \textHolonorm is its complement, making it better understandable in evaluating a model.
[LG-10] Weak Relation Enforcement for Kinematic-Informed Long-Term Stock Prediction with Artificial Neural Networks
链接: https://arxiv.org/abs/2511.10494
作者: Stanislav Selitskiy
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose loss function week enforcement of the velocity relations between time-series points in the Kinematic-Informed artificial Neural Networks (KINN) for long-term stock prediction. Problems of the series volatility, Out-of-Distribution (OOD) test data, and outliers in training data are addressed by (Artificial Neural Networks) ANN’s learning not only future points prediction but also by learning velocity relations between the points, such a way as avoiding unrealistic spurious predictions. The presented loss function penalizes not only errors between predictions and supervised label data, but also errors between the next point prediction and the previous point plus velocity prediction. The loss function is tested on the multiple popular and exotic AR ANN architectures, and around fifteen years of Dow Jones function demonstrated statistically meaningful improvement across the normalization-sensitive activation functions prone to spurious behaviour in the OOD data conditions. Results show that such architecture addresses the issue of the normalization in the auto-regressive models that break the data topology by weakly enforcing the data neighbourhood proximity (relation) preservation during the ANN transformation.
[LG-11] Dont Waste It: Guiding Generative Recommenders with Structured Human Priors via Multi-head Decoding
链接: https://arxiv.org/abs/2511.10492
作者: Yunkai Zhang,Qiang Zhang,Feng(Ryan)Lin,Ruizhong Qiu,Hanchao Yu,Jason Liu,Yinglong Xia,Zhuoran Yu,Zeyu Zheng,Diji Yang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Optimizing recommender systems for objectives beyond accuracy, such as diversity, novelty, and personalization, is crucial for long-term user satisfaction. To this end, industrial practitioners have accumulated vast amounts of structured domain knowledge, which we term human priors (e.g., item taxonomies, temporal patterns). This knowledge is typically applied through post-hoc adjustments during ranking or post-ranking. However, this approach remains decoupled from the core model learning, which is particularly undesirable as the industry shifts to end-to-end generative recommendation foundation models. On the other hand, many methods targeting these beyond-accuracy objectives often require architecture-specific modifications and discard these valuable human priors by learning user intent in a fully unsupervised manner. Instead of discarding the human priors accumulated over years of practice, we introduce a backbone-agnostic framework that seamlessly integrates these human priors directly into the end-to-end training of generative recommenders. With lightweight, prior-conditioned adapter heads inspired by efficient LLM decoding strategies, our approach guides the model to disentangle user intent along human-understandable axes (e.g., interaction types, long- vs. short-term interests). We also introduce a hierarchical composition strategy for modeling complex interactions across different prior types. Extensive experiments on three large-scale datasets demonstrate that our method significantly enhances both accuracy and beyond-accuracy objectives. We also show that human priors allow the backbone model to more effectively leverage longer context lengths and larger model sizes. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2511.10492 [cs.IR] (or arXiv:2511.10492v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.10492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Panda: Test-Time Adaptation with Negative Data Augmentation AAAI2026
链接: https://arxiv.org/abs/2511.10481
作者: Ruxi Deng,Wenxuan Bao,Tianxin Wei,Jingrui He
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026
Abstract:Pretrained VLMs exhibit strong zero-shot classification capabilities, but their predictions degrade significantly under common image corruptions. To improve robustness, many test-time adaptation (TTA) methods adopt positive data augmentation (PDA), which generates multiple views of each test sample to reduce prediction variance. However, these methods suffer from two key limitations. First, it introduces considerable computational overhead due to the large number of augmentations required per image. Second, it fails to mitigate prediction bias, where the model tends to predict certain classes disproportionately under corruption, as PDA operates on corrupted inputs and typically does not remove the corruption itself. To address these challenges, we propose Panda, a novel TTA method based on negative data augmentation (NDA). Unlike positive augmentations that preserve object semantics, Panda generates negative augmentations by disrupting semantic content. It divides images into patches and randomly assembles them from a shared patch pool. These negatively augmented images retain corruption-specific features while discarding object-relevant signals. We then subtract the mean feature of these negative samples from the original image feature, effectively suppressing corruption-related components while preserving class-relevant information. This mitigates prediction bias under distribution shifts. Panda allows augmentation to be shared across samples within a batch, resulting in minimal computational overhead. Panda can be seamlessly integrated into existing test-time adaptation frameworks and substantially improve their robustness. Our experiments indicate that Panda delivers superior performance compared to PDA methods, and a wide range of TTA methods exhibit significantly enhanced performance when integrated with Panda. Our code is available at this https URL .
[LG-13] Neuronal Fluctuations: Learning Rates vs Participating Neurons
链接: https://arxiv.org/abs/2511.10435
作者: Darsh Pareek,Umesh Kumar,Ruthu Rao,Ravi Janjam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep Neural Networks (DNNs) rely on inherent fluctuations in their internal parameters (weights and biases) to effectively navigate the complex optimization landscape and achieve robust performance. While these fluctuations are recognized as crucial for escaping local minima and improving generalization, their precise relationship with fundamental hyperparameters remains underexplored. A significant knowledge gap exists concerning how the learning rate, a critical parameter governing the training process, directly influences the dynamics of these neural fluctuations. This study systematically investigates the impact of varying learning rates on the magnitude and character of weight and bias fluctuations within a neural network. We trained a model using distinct learning rates and analyzed the corresponding parameter fluctuations in conjunction with the network’s final accuracy. Our findings aim to establish a clear link between the learning rate’s value, the resulting fluctuation patterns, and overall model performance. By doing so, we provide deeper insights into the optimization process, shedding light on how the learning rate mediates the crucial exploration-exploitation trade-off during training. This work contributes to a more nuanced understanding of hyperparameter tuning and the underlying mechanics of deep learning.
[LG-14] Unlocking Dynamic Inter-Client Spatial Dependencies: A Federated Spatio-Temporal Graph Learning Method for Traffic Flow Forecasting
链接: https://arxiv.org/abs/2511.10434
作者: Feng Wang,Tianxiang Chen,Shuyue Wei,Qian Chu,Yi Zhang,Yifan Sun,Zhiming Zheng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Spatio-temporal graphs are powerful tools for modeling complex dependencies in traffic time series. However, the distributed nature of real-world traffic data across multiple stakeholders poses significant challenges in modeling and reconstructing inter-client spatial dependencies while adhering to data locality constraints. Existing methods primarily address static dependencies, overlooking their dynamic nature and resulting in suboptimal performance. In response, we propose Federated Spatio-Temporal Graph with Dynamic Inter-Client Dependencies (FedSTGD), a framework designed to model and reconstruct dynamic inter-client spatial dependencies in federated learning. FedSTGD incorporates a federated nonlinear computation decomposition module to approximate complex graph operations. This is complemented by a graph node embedding augmentation module, which alleviates performance degradation arising from the decomposition. These modules are coordinated through a client-server collective learning protocol, which decomposes dynamic inter-client spatial dependency learning tasks into lightweight, parallelizable subtasks. Extensive experiments on four real-world datasets demonstrate that FedSTGD achieves superior performance over state-of-the-art baselines in terms of RMSE, MAE, and MAPE, approaching that of centralized baselines. Ablation studies confirm the contribution of each module in addressing dynamic inter-client spatial dependencies, while sensitivity analysis highlights the robustness of FedSTGD to variations in hyperparameters.
[LG-15] Product distribution learning with imperfect advice NEURIPS2025
链接: https://arxiv.org/abs/2511.10366
作者: Arnab Bhattacharyya,Davin Choo,Philips George John,Themis Gouleakis
类目: Machine Learning (cs.LG)
*备注: Full version (11 pages). To be published in NeurIPS 2025
Abstract:Given i.i.d.~samples from an unknown distribution P , the goal of distribution learning is to recover the parameters of a distribution that is close to P . When P belongs to the class of product distributions on the Boolean hypercube \0,1^d , it is known that \Omega(d/\varepsilon^2) samples are necessary to learn P within total variation (TV) distance \varepsilon . We revisit this problem when the learner is also given as advice the parameters of a product distribution Q . We show that there is an efficient algorithm to learn P within TV distance \varepsilon that has sample complexity \tildeO(d^1-\eta/\varepsilon^2) , if |\mathbfp - \mathbfq|_1 \varepsilon d^0.5 - \Omega(\eta) . Here, \mathbfp and \mathbfq are the mean vectors of P and Q respectively, and no bound on |\mathbfp - \mathbfq|_1 is known to the algorithm a priori.
[LG-16] Gradient Flow Equations for Deep Linear Neural Networks: A Survey from a Network Perspective
链接: https://arxiv.org/abs/2511.10362
作者: Joel Wendin,Claudio Altafini
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注: Manuscript accepted for publication in SIAM Review (SIREV)
Abstract:The paper surveys recent progresses in understanding the dynamics and loss landscape of the gradient flow equations associated to deep linear neural networks, i.e., the gradient descent training dynamics (in the limit when the step size goes to 0) of deep neural networks missing the activation functions and subject to quadratic loss functions. When formulated in terms of the adjacency matrix of the neural network, as we do in the paper, these gradient flow equations form a class of converging matrix ODEs which is nilpotent, polynomial, isospectral, and with conservation laws. The loss landscape is described in detail. It is characterized by infinitely many global minima and saddle points, both strict and nonstrict, but lacks local minima and maxima. The loss function itself is a positive semidefinite Lyapunov function for the gradient flow, and its level sets are unbounded invariant sets of critical points, with critical values that correspond to the amount of singular values of the input-output data learnt by the gradient along a certain trajectory. The adjacency matrix representation we use in the paper allows to highlight the existence of a quotient space structure in which each critical value of the loss function is represented only once, while all other critical points with the same critical value belong to the fiber associated to the quotient space. It also allows to easily determine stable and unstable submanifolds at the saddle points, even when the Hessian fails to obtain them.
[LG-17] Robust Decentralized Multi-armed Bandits: From Corruption-Resilience to Byzantine-Resilience
链接: https://arxiv.org/abs/2511.10344
作者: Zicheng Hu,Yuchen Wang,Cheng Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decentralized cooperative multi-agent multi-armed bandits (DeCMA2B) considers how multiple agents collaborate in a decentralized multi-armed bandit setting. Though this problem has been extensively studied in previous work, most existing methods remain susceptible to various adversarial attacks. In this paper, we first study DeCMA2B with adversarial corruption, where an adversary can corrupt reward observations of all agents with a limited corruption budget. We propose a robust algorithm, called DeMABAR, which ensures that each agent’s individual regret suffers only an additive term proportional to the corruption budget. Then we consider a more realistic scenario where the adversary can only attack a small number of agents. Our theoretical analysis shows that the DeMABAR algorithm can also almost completely eliminate the influence of adversarial attacks and is inherently robust in the Byzantine setting, where an unknown fraction of the agents can be Byzantine, i.e., may arbitrarily select arms and communicate wrong information. We also conduct numerical experiments to illustrate the robustness and effectiveness of the proposed method.
[LG-18] EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training
链接: https://arxiv.org/abs/2511.10333
作者: Qingao Yi,Jiaang Duan,Hanwen Hu,Qin Hua,Haiyan Zhao,Shiyou Qian,Dingyu Yang,Jian Cao,Jinghua Tang,Yinghao Yu,Chenzhi Liao,Kangjin Wang,Liping Zhang
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key this http URL, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.
[LG-19] PITE: Multi-Prototype Alignment for Individual Treatment Effect Estimation
链接: https://arxiv.org/abs/2511.10320
作者: Fuyuan Cao,Jiaxuan Zhang,Xiaoli Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating Individual Treatment Effects (ITE) from observational data is challenging due to confounding bias. Most studies tackle this bias by balancing distributions globally, but ignore individual heterogeneity and fail to capture the local structure that represents the natural clustering among individuals, which ultimately compromises ITE estimation. While instance-level alignment methods consider heterogeneity, they similarly overlook the local structure information. To address these issues, we propose an end-to-end Multi-\textbfPrototype alignment method for \textbfITE estimation (\textbfPITE). PITE effectively captures local structure within groups and enforces cross-group alignment, thereby achieving robust ITE estimation. Specifically, we first define prototypes as cluster centroids based on similar individuals under the same treatment. To identify local similarity and the distribution consistency, we perform instance-to-prototype matching to assign individuals to the nearest prototype within groups, and design a multi-prototype alignment strategy to encourage the matched prototypes to be close across treatment arms in the latent space. PITE not only reduces distribution shift through fine-grained, prototype-level alignment, but also preserves the local structures of treated and control groups, which provides meaningful constraints for ITE estimation. Extensive evaluations on benchmark datasets demonstrate that PITE outperforms 13 state-of-the-art methods, achieving more accurate and robust ITE estimation.
[LG-20] Fault Detection in Solar Thermal Systems using Probabilistic Reconstructions
链接: https://arxiv.org/abs/2511.10296
作者: Florian Ebmeier,Nicole Ludwig,Jannik Thuemmel,Georg Martius,Volker H. Franz
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Solar thermal systems (STS) present a promising avenue for low-carbon heat generation, with a well-running system providing heat at minimal cost and carbon emissions. However, STS can exhibit faults due to improper installation, maintenance, or operation, often resulting in a substantial reduction in efficiency or even damage to the system. As monitoring at the individual level is economically prohibitive for small-scale systems, automated monitoring and fault detection should be used to address such issues. Recent advances in data-driven anomaly detection, particularly in time series analysis, offer a cost-effective solution by leveraging existing sensors to identify abnormal system states. Here, we propose a probabilistic reconstruction-based framework for anomaly detection. We evaluate our method on the publicly available PaSTS dataset of operational domestic STS, which features real-world complexities and diverse fault types. Our experiments show that reconstruction-based methods can detect faults in domestic STS both qualitatively and quantitatively, while generalizing to previously unseen systems. We also demonstrate that our model outperforms both simple and more complex deep learning baselines. Additionally, we show that heteroscedastic uncertainty estimation is essential to fault detection performance. Finally, we discuss the engineering overhead required to unlock these improvements and make a case for simple deep learning models. Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2511.10296 [eess.SY] (or arXiv:2511.10296v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2511.10296 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Causal Model-Based Reinforcement Learning for Sample-Efficient IoT Channel Access
链接: https://arxiv.org/abs/2511.10291
作者: Aswin Arun,Christo Kurisummoottil Thomas,Rimalpudi Sarvendranath,Walid Saad
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Despite the advantages of multi-agent reinforcement learning (MARL) for wireless use case such as medium access control (MAC), their real-world deployment in Internet of Things (IoT) is hindered by their sample inefficiency. To alleviate this challenge, one can leverage model-based reinforcement learning (MBRL) solutions, however, conventional MBRL approaches rely on black-box models that are not interpretable and cannot reason. In contrast, in this paper, a novel causal model-based MARL framework is developed by leveraging tools from causal learn- ing. In particular, the proposed model can explicitly represent causal dependencies between network variables using structural causal models (SCMs) and attention-based inference networks. Interpretable causal models are then developed to capture how MAC control messages influence observations, how transmission actions determine outcomes, and how channel observations affect rewards. Data augmentation techniques are then used to generate synthetic rollouts using the learned causal model for policy optimization via proximal policy optimization (PPO). Analytical results demonstrate exponential sample complexity gains of causal MBRL over black-box approaches. Extensive simulations demonstrate that, on average, the proposed approach can reduce environment interactions by 58%, and yield faster convergence compared to model-free baselines. The proposed approach inherently is also shown to provide interpretable scheduling decisions via attention-based causal attribution, revealing which network conditions drive the policy. The resulting combination of sample efficiency and interpretability establishes causal MBRL as a practical approach for resource-constrained wireless systems.
[LG-22] Unitho: A Unified Multi-Task Framework for Computational Lithography
链接: https://arxiv.org/abs/2511.10255
作者: Qian Jin,Yumeng Liu,Yuqi Jiang,Qi Sun,Cheng Zhuo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reliable, generalizable data foundations are critical for enabling large-scale models in computational lithography. However, essential tasks-mask generation, rule violation detection, and layout optimization-are often handled in isolation, hindered by scarce datasets and limited modeling approaches. To address these challenges, we introduce Unitho, a unified multi-task large vision model built upon the Transformer architecture. Trained on a large-scale industrial lithography simulation dataset with hundreds of thousands of cases, Unitho supports end-to-end mask generation, lithography simulation, and rule violation detection. By enabling agile and high-fidelity lithography simulation, Unitho further facilitates the construction of robust data foundations for intelligent EDA. Experimental results validate its effectiveness and generalizability, with performance substantially surpassing academic baselines.
[LG-23] FedCure: Mitigating Participation Bias in Semi-Asynchronous Federated Learning with Non-IID Data
链接: https://arxiv.org/abs/2511.10227
作者: Yue Chen,Jianfeng Lu,Shuqing Cao,Wei Wang,Gang Li,Guanghui Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:While semi-asynchronous federated learning (SAFL) combines the efficiency of synchronous training with the flexibility of asynchronous updates, it inherently suffers from participation bias, which is further exacerbated by non-IID data distributions. More importantly, hierarchical architecture shifts participation from individual clients to client groups, thereby further intensifying this issue. Despite notable advancements in SAFL research, most existing works still focus on conventional cloud-end architectures while largely overlooking the critical impact of non-IID data on scheduling across the cloud-edge-client hierarchy. To tackle these challenges, we propose FedCure, an innovative semi-asynchronous Federated learning framework that leverages coalition construction and participation-aware scheduling to mitigate participation bias with non-IID data. Specifically, FedCure operates through three key rules: (1) a preference rule that optimizes coalition formation by maximizing collective benefits and establishing theoretically stable partitions to reduce non-IID-induced performance degradation; (2) a scheduling rule that integrates the virtual queue technique with Bayesian-estimated coalition dynamics, mitigating efficiency loss while ensuring mean rate stability; and (3) a resource allocation rule that enhances computational efficiency by optimizing client CPU frequencies based on estimated coalition dynamics while satisfying delay requirements. Comprehensive experiments on four real-world datasets demonstrate that FedCure improves accuracy by up to 5.1x compared with four state-of-the-art baselines, while significantly enhancing efficiency with the lowest coefficient of variation 0.0223 for per-round latency and maintaining long-term balance across diverse scenarios.
[LG-24] Out-of-Context Misinformation Detection via Variational Domain-Invariant Learning with Test-Time Training AAAI
链接: https://arxiv.org/abs/2511.10213
作者: Xi Yang,Han Zhang,Zhijian Lin,Yibiao Hu,Hong Han
类目: Machine Learning (cs.LG)
*备注: accepted by the AAAI Conference on Artificial Intelligence (AAAI) 2026
Abstract:Out-of-context misinformation (OOC) is a low-cost form of misinformation in news reports, which refers to place authentic images into out-of-context or fabricated image-text pairings. This problem has attracted significant attention from researchers in recent years. Current methods focus on assessing image-text consistency or generating explanations. However, these approaches assume that the training and test data are drawn from the same distribution. When encountering novel news domains, models tend to perform poorly due to the lack of prior knowledge. To address this challenge, we propose \textbfVDT to enhance the domain adaptation capability for OOC misinformation detection by learning domain-invariant features and test-time training mechanisms. Domain-Invariant Variational Align module is employed to jointly encodes source and target domain data to learn a separable distributional space domain-invariant features. For preserving semantic integrity, we utilize domain consistency constraint module to reconstruct the source and target domain latent distribution. During testing phase, we adopt the test-time training strategy and confidence-variance filtering module to dynamically updating the VAE encoder and classifier, facilitating the model’s adaptation to the target domain distribution. Extensive experiments conducted on the benchmark dataset NewsCLIPpings demonstrate that our method outperforms state-of-the-art baselines under most domain adaptation settings.
[LG-25] Beyond MSE: Ordinal Cross-Entropy for Probabilistic Time Series Forecasting
链接: https://arxiv.org/abs/2511.10200
作者: Jieting Wang,Huimei Shi,Feijiang Li,Xiaolei Shang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting is an important task that involves analyzing temporal dependencies and underlying patterns (such as trends, cyclicality, and seasonality) in historical data to predict future values or trends. Current deep learning-based forecasting models primarily employ Mean Squared Error (MSE) loss functions for regression modeling. Despite enabling direct value prediction, this method offers no uncertainty estimation and exhibits poor outlier robustness. To address these limitations, we propose OCE-TS, a novel ordinal classification approach for time series forecasting that replaces MSE with Ordinal Cross-Entropy (OCE) loss, preserving prediction order while quantifying uncertainty through probability output. Specifically, OCE-TS begins by discretizing observed values into ordered intervals and deriving their probabilities via a parametric distribution as supervision signals. Using a simple linear model, we then predict probability distributions for each timestep. The OCE loss is computed between the cumulative distributions of predicted and ground-truth probabilities, explicitly preserving ordinal relationships among forecasted values. Through theoretical analysis using influence functions, we establish that cross-entropy (CE) loss exhibits superior stability and outlier robustness compared to MSE loss. Empirically, we compared OCE-TS with five baseline models-Autoformer, DLinear, iTransformer, TimeXer, and TimeBridge-on seven public time series datasets. Using MSE and Mean Absolute Error (MAE) as evaluation metrics, the results demonstrate that OCE-TS consistently outperforms benchmark models. The code will be published.
[LG-26] owards Leverag ing Sequential Structure in Animal Vocalizations NEURIPS
链接: https://arxiv.org/abs/2511.10190
作者: Eklavya Sarkar,Mathew Magimai.-Doss
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS workshop (AI for Non-Human Animal Communication)
Abstract:Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using k -Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.
[LG-27] EPO: Diverse and Realistic Protein Ensemble Generation via Energy Preference Optimization AAAI2026
链接: https://arxiv.org/abs/2511.10165
作者: Yuancheng Sun,Yuxuan Ren,Zhaoming Chen,Xu Han,Kang Liu,Qiwei Ye
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted as AAAI 2026 Poster
Abstract:Accurate exploration of protein conformational ensembles is essential for uncovering function but remains hard because molecular-dynamics (MD) simulations suffer from high computational costs and energy-barrier trapping. This paper presents Energy Preference Optimization (EPO), an online refinement algorithm that turns a pretrained protein ensemble generator into an energy-aware sampler without extra MD trajectories. Specifically, EPO leverages stochastic differential equation sampling to explore the conformational landscape and incorporates a novel energy-ranking mechanism based on list-wise preference optimization. Crucially, EPO introduces a practical upper bound to efficiently approximate the intractable probability of long sampling trajectories in continuous-time generative models, making it easily adaptable to existing pretrained generators. On Tetrapeptides, ATLAS, and Fast-Folding benchmarks, EPO successfully generates diverse and physically realistic ensembles, establishing a new state-of-the-art in nine evaluation metrics. These results demonstrate that energy-only preference signals can efficiently steer generative models toward thermodynamically consistent conformational ensembles, providing an alternative to long MD simulations and widening the applicability of learned potentials in structural biology and drug discovery.
[LG-28] RI-Loss: A Learnable Residual-Informed Loss for Time Series Forecasting
链接: https://arxiv.org/abs/2511.10130
作者: Jieting Wang,Xiaolei Shang,Feijiang Li,Furong Peng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting relies on predicting future values from historical data, yet most state-of-the-art approaches-including transformer and multilayer perceptron-based models-optimize using Mean Squared Error (MSE), which has two fundamental weaknesses: its point-wise error computation fails to capture temporal relationships, and it does not account for inherent noise in the data. To overcome these limitations, we introduce the Residual-Informed Loss (RI-Loss), a novel objective function based on the Hilbert-Schmidt Independence Criterion (HSIC). RI-Loss explicitly models noise structure by enforcing dependence between the residual sequence and a random time series, enabling more robust, noise-aware representations. Theoretically, we derive the first non-asymptotic HSIC bound with explicit double-sample complexity terms, achieving optimal convergence rates through Bernstein-type concentration inequalities and Rademacher complexity analysis. This provides rigorous guarantees for RI-Loss optimization while precisely quantifying kernel space interactions. Empirically, experiments across eight real-world benchmarks and five leading forecasting models demonstrate improvements in predictive performance, validating the effectiveness of our approach. Code will be made publicly available to ensure reproducibility.
[LG-29] Physics-informed Machine Learning for Static Friction Modeling in Robotic Manipulators Based on Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2511.10079
作者: Yizheng Wang,Timon Rabczuk,Yinghua Liu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Friction modeling plays a crucial role in achieving high-precision motion control in robotic operating systems. Traditional static friction models (such as the Stribeck model) are widely used due to their simple forms; however, they typically require predefined functional assumptions, which poses significant challenges when dealing with unknown functional structures. To address this issue, this paper proposes a physics-inspired machine learning approach based on the Kolmogorov Arnold Network (KAN) for static friction modeling of robotic joints. The method integrates spline activation functions with a symbolic regression mechanism, enabling model simplification and physical expression extraction through pruning and attribute scoring, while maintaining both high prediction accuracy and interpretability. We first validate the method’s capability to accurately identify key parameters under known functional models, and further demonstrate its robustness and generalization ability under conditions with unknown functional structures and noisy data. Experiments conducted on both synthetic data and real friction data collected from a six-degree-of-freedom industrial manipulator show that the proposed method achieves a coefficient of determination greater than 0.95 across various tasks and successfully extracts concise and physically meaningful friction expressions. This study provides a new perspective for interpretable and data-driven robotic friction modeling with promising engineering applicability.
[LG-30] ree-Based Stochastic Optimization for Solving Large-Scale Urban Network Security Games
链接: https://arxiv.org/abs/2511.10072
作者: Shuxin Zhuang,Linjian Meng,Shuxin Li,Minming Li,Youzhi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Urban Network Security Games (UNSGs), which model the strategic allocation of limited security resources on city road networks, are critical for urban safety. However, finding a Nash Equilibrium (NE) in large-scale UNSGs is challenging due to their massive and combinatorial action spaces. One common approach to addressing these games is the Policy-Space Response Oracle (PSRO) framework, which requires computing best responses (BR) at each iteration. However, precisely computing exact BRs is impractical in large-scale games, and employing reinforcement learning to approximate BRs inevitably introduces errors, which limits the overall effectiveness of the PSRO methods. Recent advancements in leveraging non-convex stochastic optimization to approximate an NE offer a promising alternative to the burdensome BR computation. However, utilizing existing stochastic optimization techniques with an unbiased loss function for UNSGs remains challenging because the action spaces are too vast to be effectively represented by neural networks. To address these issues, we introduce Tree-based Stochastic Optimization (TSO), a framework that bridges the gap between the stochastic optimization paradigm for NE-finding and the demands of UNSGs. Specifically, we employ the tree-based action representation that maps the whole action space onto a tree structure, addressing the challenge faced by neural networks in representing actions when the action space cannot be enumerated. We then incorporate this representation into the loss function and theoretically demonstrate its equivalence to the unbiased loss function. To further enhance the quality of the converged solution, we introduce a sample-and-prune mechanism that reduces the risk of being trapped in suboptimal local optima. Extensive experimental results indicate the superiority of TSO over other baseline algorithms in addressing the UNSGs.
[LG-31] FAQNAS: FLOPs-aware Hybrid Quantum Neural Architecture Search using Genetic Algorithm
链接: https://arxiv.org/abs/2511.10062
作者: Muhammad Kashif,Shaf Khalid,Alberto Marchisio,Nouhaila Innan,Muhammad Shafique
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hybrid Quantum Neural Networks (HQNNs), which combine parameterized quantum circuits with classical neural layers, are emerging as promising models in the noisy intermediate-scale quantum (NISQ) era. While quantum circuits are not naturally measured in floating point operations (FLOPs), most HQNNs (in NISQ era) are still trained on classical simulators where FLOPs directly dictate runtime and scalability. Hence, FLOPs represent a practical and viable metric to measure the computational complexity of HQNNs. In this work, we introduce FAQNAS, a FLOPs-aware neural architecture search (NAS) framework that formulates HQNN design as a multi-objective optimization problem balancing accuracy and FLOPs. Unlike traditional approaches, FAQNAS explicitly incorporates FLOPs into the optimization objective, enabling the discovery of architectures that achieve strong performance while minimizing computational cost. Experiments on five benchmark datasets (MNIST, Digits, Wine, Breast Cancer, and Iris) show that quantum FLOPs dominate accuracy improvements, while classical FLOPs remain largely fixed. Pareto-optimal solutions reveal that competitive accuracy can often be achieved with significantly reduced computational cost compared to FLOPs-agnostic baselines. Our results establish FLOPs-awareness as a practical criterion for HQNN design in the NISQ era and as a scalable principle for future HQNN systems.
[LG-32] From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization NEURIPS2025
链接: https://arxiv.org/abs/2511.10056
作者: Zijing Liu,Bin Feng,He Cao,Yu Li
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 AI for Science Workshop
Abstract:Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural “language”. The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as “structural synonyms”. This redundancy, rather than being a flaw, can be exploited with a simple “synonym swap” strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state-of-the-art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near-instantaneous method for modeling protein dynamics. Source code is available in this https URL.
[LG-33] Multi-agent In-context Coordination via Decentralized Memory Retrieval
链接: https://arxiv.org/abs/2511.10030
作者: Tao Jiang,Zichuan Lin,Lihe Li,Yi-Chen Li,Cong Guan,Lei Yuan,Zongzhang Zhang,Yang Yu,Deheng Ye
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents’ current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at this https URL.
[LG-34] SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels AAAI-26
链接: https://arxiv.org/abs/2511.10025
作者: Noam Koren,Ralf J. J. Mackenbach,Ruud J. G. van Sloun,Kira Radinsky,Daniel Freedman
类目: Machine Learning (cs.LG)
*备注: AAAI-26
Abstract:Neural operators have emerged as a promising paradigm for learning solution operators of partial differential equa- tions (PDEs) directly from data. Existing methods, such as those based on Fourier or graph techniques, make strong as- sumptions about the structure of the kernel integral opera- tor, assumptions which may limit expressivity. We present SVD-NO, a neural operator that explicitly parameterizes the kernel by its singular-value decomposition (SVD) and then carries out the integral directly in the low-rank basis. Two lightweight networks learn the left and right singular func- tions, a diagonal parameter matrix learns the singular values, and a Gram-matrix regularizer enforces orthonormality. As SVD-NO approximates the full kernel, it obtains a high de- gree of expressivity. Furthermore, due to its low-rank struc- ture the computational complexity of applying the operator remains reasonable, leading to a practical system. In exten- sive evaluations on five diverse benchmark equations, SVD- NO achieves a new state of the art. In particular, SVD-NO provides greater performance gains on PDEs whose solutions are highly spatially variable. The code of this work is publicly available at this https URL.
[LG-35] GraphSB: Boosting Imbalanced Node Classification on Graphs through Structural Balance
链接: https://arxiv.org/abs/2511.10022
作者: Chaofan Zhu,Xiaobing Rui,Zhixiao Wang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Imbalanced node classification is a critical challenge in graph learning, where most existing methods typically utilize Graph Neural Networks (GNNs) to learn node representations. These methods can be broadly categorized into the data-level and the algorithm-level. The former aims to synthesize minority-class nodes to mitigate quantity imbalance, while the latter tries to optimize the learning process to highlight minority classes. However, neither category addresses the inherently imbalanced graph structure, which is a fundamental factor that incurs majority-class dominance and minority-class assimilation in GNNs. Our theoretical analysis further supports this critical insight. Therefore, we propose GraphSB (Graph Structural Balance), a novel framework that incorporates Structural Balance as a key strategy to address the underlying imbalanced graph structure before node synthesis. Structural Balance performs a two-stage structure optimization: Structure Enhancement that adaptively builds similarity-based edges to strengthen connectivity of minority-class nodes, and Relation Diffusion that captures higher-order dependencies while amplifying signals from minority classes. Thus, GraphSB balances structural distribution before node synthesis, enabling more effective learning in GNNs. Extensive experiments demonstrate that GraphSB significantly outperforms the state-of-the-art methods. More importantly, the proposed Structural Balance can be seamlessly integrated into state-of-the-art methods as a simple plug-and-play module, increasing their accuracy by an average of 3.67%.
[LG-36] Interaction as Interference: A Quantum-Inspired Aggregation Approach
链接: https://arxiv.org/abs/2511.10018
作者: Pilsung Kang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Classical approaches often treat interaction as engineered product terms or as emergent patterns in flexible models, offering little control over how synergy or antagonism arises. We take a quantum-inspired view: following the Born rule (probability as squared amplitude), \emphcoherent aggregation sums complex amplitudes before squaring, creating an interference cross-term, whereas an \emphincoherent proxy sums squared magnitudes and removes it. In a minimal linear-amplitude model, this cross-term equals the standard potential-outcome interaction contrast (\Delta_\mathrmINT) in a (2\times 2) factorial design, giving relative phase a direct, mechanism-level control over synergy versus antagonism. We instantiate this idea in a lightweight \emphInterference Kernel Classifier (IKC) and introduce two diagnostics: \emphCoherent Gain (log-likelihood gain of coherent over the incoherent proxy) and \emphInterference Information (the induced Kullback-Leibler gap). A controlled phase sweep recovers the identity. On a high-interaction synthetic task (XOR), IKC outperforms strong baselines under paired, budget-matched comparisons; on real tabular data (\emphAdult and \emphBank Marketing) it is competitive overall but typically trails the most capacity-rich baseline in paired differences. Holding learned parameters fixed, toggling aggregation from incoherent to coherent consistently improves negative log-likelihood, Brier score, and expected calibration error, with positive Coherent Gain on both datasets. Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2511.10018 [cs.LG] (or arXiv:2511.10018v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.10018 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-37] DemoTuner: Efficient DBMS Knobs Tuning via LLM -Assisted Demonstration Reinforcement Learning
链接: https://arxiv.org/abs/2511.09998
作者: Hui Dou,Lei Jin,Yuxuan Zhou,Jiang He,Yiwen Zhang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 14 pages, 9 figures
Abstract:The performance of modern DBMSs such as MySQL and PostgreSQL heavily depends on the configuration of performance-critical knobs. Manual tuning these knobs is laborious and inefficient due to the complex and high-dimensional nature of the configuration space. Among the automated tuning methods, reinforcement learning (RL)-based methods have recently sought to improve the DBMS knobs tuning process from several different perspectives. However, they still encounter challenges with slow convergence speed during offline training. In this paper, we mainly focus on how to leverage the valuable tuning hints contained in various textual documents such as DBMS manuals and web forums to improve the offline training of RL-based methods. To this end, we propose an efficient DBMS knobs tuning framework named DemoTuner via a novel LLM-assisted demonstration reinforcement learning method. Specifically, to comprehensively and accurately mine tuning hints from documents, we design a structured chain of thought prompt to employ LLMs to conduct a condition-aware tuning hints extraction task. To effectively integrate the mined tuning hints into RL agent training, we propose a hint-aware demonstration reinforcement learning algorithm HA-DDPGfD in DemoTuner. As far as we know, DemoTuner is the first work to introduce the demonstration reinforcement learning algorithm for DBMS knobs tuning. Experimental evaluations conducted on MySQL and PostgreSQL across various workloads demonstrate the significant advantages of DemoTuner in both performance improvement and online tuning cost reduction over three representative baselines including DB-BERT, GPTuner and CDBTune. Additionally, DemoTuner also exhibits superior adaptability to application scenarios with unknown workloads.
[LG-38] A Novel Data-Dependent Learning Paradigm for Large Hypothesis Classes
链接: https://arxiv.org/abs/2511.09996
作者: Alireza F. Pour,Shai Ben-David
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We address the general task of learning with a set of candidate models that is too large to have a uniform convergence of empirical estimates to true losses. While the common approach to such challenges is SRM (or regularization) based learning algorithms, we propose a novel learning paradigm that relies on stronger incorporation of empirical data and requires less algorithmic decisions to be based on prior assumptions. We analyze the generalization capabilities of our approach and demonstrate its merits in several common learning assumptions, including similarity of close points, clustering of the domain into highly label-homogeneous regions, Lipschitzness assumptions of the labeling rule, and contrastive learning assumptions. Our approach allows utilizing such assumptions without the need to know their true parameters a priori.
[LG-39] owards Robust Multimodal Learning in the Open World
链接: https://arxiv.org/abs/2511.09989
作者: Fushuo Huo
类目: Machine Learning (cs.LG)
*备注: Thesis
Abstract:The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.
[LG-40] Rediscovering the Lunar Equation of the Centre with AI Feynman via Embedded Physical Biases
链接: https://arxiv.org/abs/2511.09979
作者: Saumya Shah,Zi-Yu Khoo,Abel Yang,Stéphane Bressan
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 7 pages, 1 figure, 4 tables
Abstract:This work explores using the physics-inspired AI Feynman symbolic regression algorithm to automatically rediscover a fundamental equation in astronomy – the Equation of the Centre. Through the introduction of observational and inductive biases corresponding to the physical nature of the system through data preprocessing and search space restriction, AI Feynman was successful in recovering the first-order analytical form of this equation from lunar ephemerides data. However, this manual approach highlights a key limitation in its reliance on expert-driven coordinate system selection. We therefore propose an automated preprocessing extension to find the canonical coordinate system. Results demonstrate that targeted domain knowledge embedding enables symbolic regression to rediscover physical laws, but also highlight further challenges in constraining symbolic regression to derive physics equations when leveraging domain knowledge through tailored biases.
[LG-41] Autonomous Concept Drift Threshold Determination AAAI2026
链接: https://arxiv.org/abs/2511.09953
作者: Pengqian Lu,Jie Lu,Anjin Liu,En Yu,Guangquan Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted By AAAI 2026
Abstract:Existing drift detection methods focus on designing sensitive test statistics. They treat the detection threshold as a fixed hyperparameter, set once to balance false alarms and late detections, and applied uniformly across all datasets and over time. However, maintaining model performance is the key objective from the perspective of machine learning, and we observe that model performance is highly sensitive to this threshold. This observation inspires us to investigate whether a dynamic threshold could be provably better. In this paper, we prove that a threshold that adapts over time can outperform any single fixed threshold. The main idea of the proof is that a dynamic strategy, constructed by combining the best threshold from each individual data segment, is guaranteed to outperform any single threshold that apply to all segments. Based on the theorem, we propose a Dynamic Threshold Determination algorithm. It enhances existing drift detection frameworks with a novel comparison phase to inform how the threshold should be adjusted. Extensive experiments on a wide range of synthetic and real-world datasets, including both image and tabular data, validate that our approach substantially enhances the performance of state-of-the-art drift detectors.
[LG-42] owards Multiple Missing Values-resistant Unsupervised Graph Anomaly Detection AAAI AAAI2026
链接: https://arxiv.org/abs/2511.09917
作者: Jiazhen Chen,Xiuqin Liang,Sichao Fu,Zheng Ma,Weihua Ou
类目: Machine Learning (cs.LG)
*备注: Accepted by 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Unsupervised graph anomaly detection (GAD) has received increasing attention in recent years, which aims to identify data anomalous patterns utilizing only unlabeled node information from graph-structured data. However, prevailing unsupervised GAD methods typically presuppose complete node attributes and structure information, a condition hardly satisfied in real-world scenarios owing to privacy, collection errors or dynamic node arrivals. Existing standard imputation schemes risk “repairing” rare anomalous nodes so that they appear normal, thereby introducing imputation bias into the detection process. In addition, when both node attributes and edges are missing simultaneously, estimation errors in one view can contaminate the other, causing cross-view interference that further undermines the detection performance. To overcome these challenges, we propose M ^2 V-UGAD, a multiple missing values-resistant unsupervised GAD framework on incomplete graphs. Specifically, a dual-pathway encoder is first proposed to independently reconstruct missing node attributes and graph structure, thereby preventing errors in one view from propagating to the other. The two pathways are then fused and regularized in a joint latent space so that normals occupy a compact inner manifold while anomalies reside on an outer shell. Lastly, to mitigate imputation bias, we sample latent codes just outside the normal region and decode them into realistic node features and subgraphs, providing hard negative examples that sharpen the decision boundary. Experiments on seven public benchmarks demonstrate that M ^2 V-UGAD consistently outperforms existing unsupervised GAD methods across varying missing rates.
[LG-43] Incremental Generation is Necessity and Sufficient for Universality in Flow-Based Modelling
链接: https://arxiv.org/abs/2511.09902
作者: Hossein Rouhvarzi,Anastasis Kratsios
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:Incremental flow-based denoising models have reshaped generative modelling, but their empirical advantage still lacks a rigorous approximation-theoretic foundation. We show that incremental generation is necessary and sufficient for universal flow-based generation on the largest natural class of self-maps of [0,1]^d compatible with denoising pipelines, namely the orientation-preserving homeomorphisms of [0,1]^d . All our guarantees are uniform on the underlying maps and hence imply approximation both samplewise and in distribution. Using a new topological-dynamical argument, we first prove an impossibility theorem: the class of all single-step autonomous flows, independently of the architecture, width, depth, or Lipschitz activation of the underlying neural network, is meagre and therefore not universal in the space of orientation-preserving homeomorphisms of [0,1]^d . By exploiting algebraic properties of autonomous flows, we conversely show that every orientation-preserving Lipschitz homeomorphism on [0,1]^d can be approximated at rate \mathcalO(n^-1/d) by a composition of at most K_d such flows, where K_d depends only on the dimension. Under additional smoothness assumptions, the approximation rate can be made dimension-free, and K_d can be chosen uniformly over the class being approximated. Finally, by linearly lifting the domain into one higher dimension, we obtain structured universal approximation results for continuous functions and for probability measures on [0,1]^d , the latter realized as pushforwards of empirical measures with vanishing 1 -Wasserstein error. Subjects: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML) Cite as: arXiv:2511.09902 [cs.LG] (or arXiv:2511.09902v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training AAAI-2026
链接: https://arxiv.org/abs/2511.09901
作者: Weilin Wan,Fan Yi,Weizhong Zhang,Quan Zhou,Cheng Jin
类目: Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, aaai-2026 camera-ready version
Abstract:Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.
[LG-45] Uncertainty-Guided Checkpoint Selection for Reinforcement Finetuning of Large Language Models
链接: https://arxiv.org/abs/2511.09864
作者: Manh Nguyen,Dung Nguyen,Dai Do,Svetha Venkatesh,Hung Le
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable and exhibits high variance across model checkpoints. In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set, while relying on the final checkpoint provides no guarantee of good performance. We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls. Our method identifies hard question-answer pairs using per-sample uncertainty and ranks checkpoints by how well they handle these challenging cases. By averaging the rewards of the top-uncertain samples over a short training window, our method produces a stable and discriminative signal without additional forward passes or significant computation overhead. Experiments across three datasets and three LLMs demonstrate that it consistently identifies checkpoints with stronger generalization, outperforming traditional strategies such as relying on training or validation performance. These results highlight that models solving their hardest tasks with low uncertainty are the most reliable overall.
[LG-46] Unlearning Imperative: Securing Trustworthy and Responsible LLM s through Engineered Forgetting
链接: https://arxiv.org/abs/2511.09855
作者: James Jin Kang,Dang Bui,Thanh Pham,Huo-Chong Ling
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures, 4 tables
Abstract:The growing use of large language models in sensitive domains has exposed a critical weakness: the inability to ensure that private information can be permanently forgotten. Yet these systems still lack reliable mechanisms to guarantee that sensitive information can be permanently removed once it has been used. Retraining from the beginning is prohibitively costly, and existing unlearning methods remain fragmented, difficult to verify, and often vulnerable to recovery. This paper surveys recent research on machine unlearning for LLMs and considers how far current approaches can address these challenges. We review methods for evaluating whether forgetting has occurred, the resilience of unlearned models against adversarial attacks, and mechanisms that can support user trust when model complexity or proprietary limits restrict transparency. Technical solutions such as differential privacy, homomorphic encryption, federated learning, and ephemeral memory are examined alongside institutional safeguards including auditing practices and regulatory frameworks. The review finds steady progress, but robust and verifiable unlearning is still unresolved. Efficient techniques that avoid costly retraining, stronger defenses against adversarial recovery, and governance structures that reinforce accountability are needed if LLMs are to be deployed safely in sensitive applications. By integrating technical and organizational perspectives, this study outlines a pathway toward AI systems that can be required to forget, while maintaining both privacy and public trust.
[LG-47] ConSurv: Multimodal Continual Learning for Survival Analysis AAAI2026
链接: https://arxiv.org/abs/2511.09853
作者: Dianzhi Yu,Conghao Xiong,Yankai Chen,Wenqian Cui,Xinni Zhang,Yifei Zhang,Hao Chen,Joseph J.Y. Sung,Irwin King
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures. This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Abstract:Survival prediction of cancers is crucial for clinical practice, as it informs mortality risks and influences treatment plans. However, a static model trained on a single dataset fails to adapt to the dynamically evolving clinical environment and continuous data streams, limiting its practical utility. While continual learning (CL) offers a solution to learn dynamically from new datasets, existing CL methods primarily focus on unimodal inputs and suffer from severe catastrophic forgetting in survival prediction. In real-world scenarios, multimodal inputs often provide comprehensive and complementary information, such as whole slide images and genomics; and neglecting inter-modal correlations negatively impacts the performance. To address the two challenges of catastrophic forgetting and complex inter-modal interactions between gigapixel whole slide images and genomics, we propose ConSurv, the first multimodal continual learning (MMCL) method for survival analysis. ConSurv incorporates two key components: Multi-staged Mixture of Experts (MS-MoE) and Feature Constrained Replay (FCR). MS-MoE captures both task-shared and task-specific knowledge at different learning stages of the network, including two modality encoders and the modality fusion component, learning inter-modal relationships. FCR further enhances learned knowledge and mitigates forgetting by restricting feature deviation of previous data at different levels, including encoder-level features of two modalities and the fusion-level representations. Additionally, we introduce a new benchmark integrating four datasets, Multimodal Survival Analysis Incremental Learning (MSAIL), for comprehensive evaluation in the CL setting. Extensive experiments demonstrate that ConSurv outperforms competing methods across multiple metrics.
[LG-48] Steering Pretrained Drafters during Speculative Decoding AAAI2026
链接: https://arxiv.org/abs/2511.09844
作者: Frédéric Berdoz,Peer Rheinboldt,Roger Wattenhofer
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted at AAAI 2026
Abstract:Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier’s hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.
[LG-49] ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking NEURIPS2025
链接: https://arxiv.org/abs/2511.09833
作者: Lequan Lin,Dai Shi,Andi Han,Feng Chen,Qiuzheng Chen,Jiawen Li,Zhaoyang Li,Jiyuan Li,Zhenbang Sun,Junbin Gao
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
Abstract:Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most “suspicious” cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.
[LG-50] Learning Intersections of Halfspaces under Factorizable Distribution COLT2025
链接: https://arxiv.org/abs/2511.09832
作者: Ilias Diakonikolas,Mingchen Ma,Lisheng Ren,Christos Tzamos
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Appeared at COLT 2025
Abstract:Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin \gamma of the data points and their dimensionality d . The best-known algorithms run in quasi-polynomial time d^O(\log(1/\gamma)) , and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents the CSQ hardness barrier. Our approach applies to a broad class of distributions satisfying a natural, previously studied, factorizability assumption. Factorizable distributions lie between distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. Under these distributions, we show that CSQ-based methods still require quasipolynomial time even for weakly learning, whereas our algorithm achieves poly(d,1/\gamma) time by leveraging more general statistical queries (SQ), establishing a strong separation between CSQ and SQ for this simple realizable PAC learning problem. Our result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, we propose new, efficient learning algorithms. These algorithms combine a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework. Comments: Appeared at COLT 2025 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2511.09832 [cs.LG] (or arXiv:2511.09832v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data AAAI2026
链接: https://arxiv.org/abs/2511.09828
作者: Mingkun Yang,Ran Zhu,Qing Wang,Jie Yang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Paper accepted by AAAI 2026
Abstract:Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25 \times ). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.
[LG-52] Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning AAAI2026
链接: https://arxiv.org/abs/2511.09792
作者: Tianmeng Hu,Yongzheng Cui,Rui Tang,Biao Luo,Ke Li
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted at AAAI 2026
Abstract:Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.
[LG-53] A Robust Task-Level Control Architecture for Learned Dynamical Systems
链接: https://arxiv.org/abs/2511.09790
作者: Eshika Pathak,Ahmed Aboudonia,Sandeep Banik,Naira Hovakimyan
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Dynamical system (DS)-based learning from demonstration (LfD) is a powerful tool for generating motion plans in the operation (`task’) space of robotic systems. However, the realization of the generated motion plans is often compromised by a ‘‘task-execution mismatch’’, where unmodeled dynamics, persistent disturbances, and system latency cause the robot’s actual task-space state to diverge from the desired motion trajectory. We propose a novel task-level robust control architecture, L1-augmented Dynamical Systems (L1-DS), that explicitly handles the task-execution mismatch in tracking a nominal motion plan generated by any DS-based LfD scheme. Our framework augments any DS-based LfD model with a nominal stabilizing controller and an L1 adaptive controller. Furthermore, we introduce a windowed Dynamic Time Warping (DTW)-based target selector, which enables the nominal stabilizing controller to handle temporal misalignment for improved phase-consistent tracking. We demonstrate the efficacy of our architecture on the LASA and IROS handwriting datasets.
[LG-54] CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting
链接: https://arxiv.org/abs/2511.09789
作者: Fulong Yao,Wanqing Zhao,Chao Zheng,Xiaofei Han
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in deep forecasting models have achieved remarkable performance, yet most approaches still struggle to provide both accurate predictions and interpretable insights into temporal dynamics. This paper proposes CaReTS, a novel multi-task learning framework that combines classification and regression tasks for multi-step time series forecasting problems. The framework adopts a dual-stream architecture, where a classification branch learns the stepwise trend into the future, while a regression branch estimates the corresponding deviations from the latest observation of the target variable. The dual-stream design provides more interpretable predictions by disentangling macro-level trends from micro-level deviations in the target variable. To enable effective learning in output prediction, deviation estimation, and trend classification, we design a multi-task loss with uncertainty-aware weighting to adaptively balance the contribution of each task. Furthermore, four variants (CaReTS1–4) are instantiated under this framework to incorporate mainstream temporal modelling encoders, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers. Experiments on real-world datasets demonstrate that CaReTS outperforms state-of-the-art (SOTA) algorithms in forecasting accuracy, while achieving higher trend classification performance.
[LG-55] Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
链接: https://arxiv.org/abs/2511.09780
作者: Nikolay Blagoev,Oğuzhan Ersoy,Lydia Yiyu Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.
[LG-56] NeuroLingua: A Language-Inspired Hierarchical Framework for Multimodal Sleep Stage Classification Using EEG and EOG
链接: https://arxiv.org/abs/2511.09773
作者: Mahdi Samaee,Mehran Yazdi,Daniel Massicotte
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Automated sleep stage classification from polysomnography remains limited by the lack of expressive temporal hierarchies, challenges in multimodal EEG and EOG fusion, and the limited interpretability of deep learning models. We propose NeuroLingua, a language-inspired framework that conceptualizes sleep as a structured physiological language. Each 30-second epoch is decomposed into overlapping 3-second subwindows (“tokens”) using a CNN-based tokenizer, enabling hierarchical temporal modeling through dual-level Transformers: intra-segment encoding of local dependencies and inter-segment integration across seven consecutive epochs (3.5 minutes) for extended context. Modality-specific embeddings from EEG and EOG channels are fused via a Graph Convolutional Network, facilitating robust multimodal integration. NeuroLingua is evaluated on the Sleep-EDF Expanded and ISRUC-Sleep datasets, achieving state-of-the-art results on Sleep-EDF (85.3% accuracy, 0.800 macro F1, and 0.796 Cohen’s kappa) and competitive performance on ISRUC (81.9% accuracy, 0.802 macro F1, and 0.755 kappa), matching or exceeding published baselines in overall and per-class metrics. The architecture’s attention mechanisms enhance the detection of clinically relevant sleep microevents, providing a principled foundation for future interpretability, explainability, and causal inference in sleep research. By framing sleep as a compositional language, NeuroLingua unifies hierarchical sequence modeling and multimodal fusion, advancing automated sleep staging toward more transparent and clinically meaningful applications.
[LG-57] Is nasty noise actually harder than malicious noise?
链接: https://arxiv.org/abs/2511.09763
作者: Guy Blanc,Yizhi Huang,Tal Malkin,Rocco A. Servedio
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注: To appear in SODA 2026
Abstract:We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner. We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings: For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class \cal C of functions is efficiently learnable in the presence of \eta -rate malicious noise, then it is also efficiently learnable in the presence of \eta -rate nasty noise. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value r there exists a concept class for which there is a ratio of r between the rate \eta_malicious of malicious noise that polynomial-time learning algorithms can tolerate, versus the rate \eta_nasty of nasty noise that such learning algorithms can tolerate. To offset the negative result for the fixed-distribution setting, we define a broad and natural class of algorithms, namely those that ignore contradictory examples (ICE). We show that for these algorithms, malicious noise and nasty noise are equivalent up to a factor of two in the noise rate: Any efficient ICE learner that succeeds with \eta -rate malicious noise can be converted to an efficient learner that succeeds with \eta/2 -rate nasty noise. We further show that the above factor of two is necessary, again under a standard cryptographic assumption. Comments: To appear in SODA 2026 Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2511.09763 [cs.LG] (or arXiv:2511.09763v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09763 Focus to learn more arXiv-issued DOI via DataCite
[LG-58] Data Heterogeneity and Forgotten Labels in Split Federated Learning AAAI2026
链接: https://arxiv.org/abs/2511.09736
作者: Joana Tirana,Dimitra Tsigkari,David Solans Noguero,Nicolas Kourtellis
类目: Machine Learning (cs.LG)
*备注: A shorter version of this paper will appear in the proceedings of AAAI 2026
Abstract:In Split Federated Learning (SFL), the clients collaboratively train a model with the help of a server by splitting the model into two parts. Part-1 is trained locally at each client and aggregated by the aggregator at the end of each round. Part-2 is trained at a server that sequentially processes the intermediate activations received from each client. We study the phenomenon of catastrophic forgetting (CF) in SFL in the presence of data heterogeneity. In detail, due to the nature of SFL, local updates of part-1 may drift away from global optima, while part-2 is sensitive to the processing sequence, similar to forgetting in continual learning (CL). Specifically, we observe that the trained model performs better in classes (labels) seen at the end of the sequence. We investigate this phenomenon with emphasis on key aspects of SFL, such as the processing order at the server and the cut layer. Based on our findings, we propose Hydra, a novel mitigation method inspired by multi-head neural networks and adapted for the SFL’s setting. Extensive numerical evaluations show that Hydra outperforms baselines and methods from the literature.
[LG-59] FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching
链接: https://arxiv.org/abs/2511.09731
作者: Bernardo Perrone Ribeiro,Jana Faganeli Pucer
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Radar-based precipitation nowcasting, the task of forecasting short-term precipitation fields from previous radar images, is a critical problem for flood risk management and decision-making. While deep learning has substantially advanced this field, two challenges remain fundamental: the uncertainty of atmospheric dynamics and the efficient modeling of high-dimensional data. Diffusion models have shown strong promise by producing sharp, reliable forecasts, but their iterative sampling process is computationally prohibitive for time-critical applications. We introduce FlowCast, the first model to apply Conditional Flow Matching (CFM) to precipitation nowcasting. Unlike diffusion, CFM learns a direct noise-to-data mapping, enabling rapid, high-fidelity sample generation with drastically fewer function evaluations. Our experiments demonstrate that FlowCast establishes a new state-of-the-art in predictive accuracy. A direct comparison further reveals the CFM objective is both more accurate and significantly more efficient than a diffusion objective on the same architecture, maintaining high performance with significantly fewer sampling steps. This work positions CFM as a powerful and practical alternative for high-dimensional spatiotemporal forecasting.
[LG-60] Generalizing PDE Emulation with Equation-Aware Neural Operators
链接: https://arxiv.org/abs/2511.09729
作者: Qian-Ze Zhu,Paul Raccuglia,Michael P. Brenner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving partial differential equations (PDEs) can be prohibitively expensive using traditional numerical methods. Deep learning-based surrogate models typically specialize in a single PDE with fixed parameters. We present a framework for equation-aware emulation that generalizes to unseen PDEs, conditioning a neural model on a vector encoding representing the terms in a PDE and their coefficients. We present a baseline of four distinct modeling technqiues, trained on a family of 1D PDEs from the APEBench suite. Our approach achieves strong performance on parameter sets held out from the training distribution, with strong stability for rollout beyond the training window, and generalization to an entirely unseen PDE. This work was developed as part of a broader effort exploring AI systems that automate the creation of expert-level empirical software for scorable scientific tasks. The data and codebase are available at this https URL.
[LG-61] Efficient Hyperdimensional Computing with Modular Composite Representations
链接: https://arxiv.org/abs/2511.09708
作者: Marco Angioli,Christopher J. Kymn,Antonello Rosato,Amy Loutfi,Mauro Olivieri,Denis Kleyko
类目: Machine Learning (cs.LG)
*备注:
Abstract:The modular composite representation (MCR) is a computing model that represents information with high-dimensional integer vectors using modular arithmetic. Originally proposed as a generalization of the binary spatter code model, it aims to provide higher representational power while remaining a lighter alternative to models requiring high-precision components. Despite this potential, MCR has received limited attention. Systematic analyses of its trade-offs and comparisons with other models are lacking, sustaining the perception that its added complexity outweighs the improved expressivity. In this work, we revisit MCR by presenting its first extensive evaluation, demonstrating that it achieves a unique balance of capacity, accuracy, and hardware efficiency. Experiments measuring capacity demonstrate that MCR outperforms binary and integer vectors while approaching complex-valued representations at a fraction of their memory footprint. Evaluation on 123 datasets confirms consistent accuracy gains and shows that MCR can match the performance of binary spatter codes using up to 4x less memory. We investigate the hardware realization of MCR by showing that it maps naturally to digital logic and by designing the first dedicated accelerator. Evaluations on basic operations and 7 selected datasets demonstrate a speedup of up to 3 orders of magnitude and significant energy reductions compared to software implementation. When matched for accuracy against binary spatter codes, MCR achieves on average 3.08x faster execution and 2.68x lower energy consumption. These findings demonstrate that, although MCR requires more sophisticated operations than binary spatter codes, its modular arithmetic and higher per-component precision enable lower dimensionality. When realized with dedicated hardware, this results in a faster, more energy-efficient, and high-precision alternative to existing models.
[LG-62] ConstrainedSQL: Training LLM s for Text2SQL via Constrained Reinforcement Learning
链接: https://arxiv.org/abs/2511.09693
作者: Weiqin Chen,Nhan Huu Pham,Michael Robert Glass,Long Hai Vu,Gaetano Rossiello,Dharmashankar Subramanian,Santiago Paternain
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has demonstrated significant promise in enhancing the reasoning capabilities of Text2SQL LLMs, especially with advanced algorithms such as GRPO and DAPO. However, the performance of these methods is highly sensitive to the design of reward functions. Inappropriate rewards can lead to reward hacking, where models exploit loopholes in the reward structure to achieve high scores without genuinely solving the task. This work considers a constrained RL framework for Text2SQL that incorporates natural and interpretable reward and constraint signals, while dynamically balancing trade-offs among them during the training. We establish the theoretical guarantees of our constrained RL framework and our numerical experiments on the well-known Text2SQL datasets substantiate the improvement of our approach over the state-of-the-art RL-trained LLMs.
[LG-63] Boosted GFlowNets: Improving Exploration via Sequential Learning
链接: https://arxiv.org/abs/2511.09677
作者: Pedro Dall’Antonia,Tiago da Silva,Daniel Augusto de Souza,César Lincoln C. Mattos,Diego Mesquita
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 3 figures (22 pages total including supplementary material)
Abstract:Generative Flow Networks (GFlowNets) are powerful samplers for compositional objects that, by design, sample proportionally to a given non-negative reward. Nonetheless, in practice, they often struggle to explore the reward landscape evenly: trajectories toward easy-to-reach regions dominate training, while hard-to-reach modes receive vanishing or uninformative gradients, leading to poor coverage of high-reward areas. We address this imbalance with Boosted GFlowNets, a method that sequentially trains an ensemble of GFlowNets, each optimizing a residual reward that compensates for the mass already captured by previous models. This residual principle reactivates learning signals in underexplored regions and, under mild assumptions, ensures a monotone non-degradation property: adding boosters cannot worsen the learned distribution and typically improves it. Empirically, Boosted GFlowNets achieve substantially better exploration and sample diversity on multimodal synthetic benchmarks and peptide design tasks, while preserving the stability and simplicity of standard trajectory-balance training.
[LG-64] GEM: Scalable State-of-the-Art Private Synthetic Data with Generator Networks
链接: https://arxiv.org/abs/2511.09672
作者: Samuel Maddock,Shripad Gade,Graham Cormode,Will Bullock
类目: Machine Learning (cs.LG)
*备注:
Abstract:State-of-the-art differentially private synthetic tabular data has been defined by adaptive ‘select-measure-generate’ frameworks, exemplified by methods like AIM. These approaches iteratively measure low-order noisy marginals and fit graphical models to produce synthetic data, enabling systematic optimisation of data quality under privacy constraints. Graphical models, however, are inefficient for high-dimensional data because they require substantial memory and must be retrained from scratch whenever the graph structure changes, leading to significant computational overhead. Recent methods, like GEM, overcome these limitations by using generator neural networks for improved scalability. However, empirical comparisons have mostly focused on small datasets, limiting real-world applicability. In this work, we introduce GEM+, which integrates AIM’s adaptive measurement framework with GEM’s scalable generator network. Our experiments show that GEM+ outperforms AIM in both utility and scalability, delivering state-of-the-art results while efficiently handling datasets with over a hundred columns, where AIM fails due to memory and computational overheads.
[LG-65] Generalization Can Emerge in Tabular Foundation Models From a Single Table
链接: https://arxiv.org/abs/2511.09665
作者: Junwei Ma,Nour Shaheen,Alex Labach,Amine Mhedhbi,Frank Hutter,Anthony L. Caterini,Valentin Thomas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of (x,y) pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emphsingle real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emphtasks one can construct from a dataset is key to downstream performance.
[LG-66] Parametric Expensive Multi-Objective Optimization via Generative Solution Modeling
链接: https://arxiv.org/abs/2511.09598
作者: Tingyang Wei,Jiao Liu,Abhishek Gupta,Chin Chun Ooi,Puay Siew Tan,Yew-Soon Ong
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This gives rise to parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for individual tasks. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. This demands learning an inverse model that can directly predict optimized solutions for any task-preference query without expensive re-evaluation. This paper introduces the first parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) acquisition-driven search leveraging inter-task synergies and (2) generative solution sampling via conditional generative models. This approach enables efficient optimization across related tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Meanwhile, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of our alternating framework.
[LG-67] Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off
链接: https://arxiv.org/abs/2511.09596
作者: Mingkuan Zhao,Wentao Hu,Jiayin Wang,Xin Lai,Tianchen Huang,Yuheng Min,Rui Yan,Xiaoyan Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H \cdot N^2) that grows quadratically with the context size ( N ) and linearly with the number of heads ( H ). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N^2) computations into a single, collaborative O(N^2) computation, fundamentally reducing complexity by a factor of H . The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.
[LG-68] DynamicRTL: RTL Representation Learning for Dynamic Circuit Behavior AAAI’2026
链接: https://arxiv.org/abs/2511.09593
作者: Ruiyang Ma,Yunhao Zhou,Yipeng Wang,Yi Liu,Zhengyuan Shi,Ziyang Zheng,Kexin Chen,Zhiqiang He,Lingwei Yan,Gang Chen,Qiang Xu,Guojie Luo
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI’2026
Abstract:There is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of circuits, focusing primarily on their static characteristics. However, these models fail to capture circuit runtime behavior, which is crucial for tasks like circuit verification and optimization. To address this limitation, we introduce DR-GNN (DynamicRTL-GNN), a novel approach that learns RTL circuit representations by incorporating both static structures and multi-cycle execution behaviors. DR-GNN leverages an operator-level Control Data Flow Graph (CDFG) to represent Register Transfer Level (RTL) circuits, enabling the model to capture dynamic dependencies and runtime execution. To train and evaluate DR-GNN, we build the first comprehensive dynamic circuit dataset, comprising over 6,300 Verilog designs and 63,000 simulation traces. Our results demonstrate that DR-GNN outperforms existing models in branch hit prediction and toggle rate prediction. Furthermore, its learned representations transfer effectively to related dynamic circuit tasks, achieving strong performance in power estimation and assertion prediction.
[LG-69] Group Averag ing for Physics Applications: Accuracy Improvements at Zero Training Cost NEURIPS2025
链接: https://arxiv.org/abs/2511.09573
作者: Valentino F. Foit,David W. Hogg,Soledad Villar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 2 figures, 1 table, Machine Learning and the Physical Sciences Workshop, NeurIPS 2025
Abstract:Many machine learning tasks in the natural sciences are precisely equivariant to particular symmetries. Nonetheless, equivariant methods are often not employed, perhaps because training is perceived to be challenging, or the symmetry is expected to be learned, or equivariant implementations are seen as hard to build. Group averaging is an available technique for these situations. It happens at test time; it can make any trained model precisely equivariant at a (often small) cost proportional to the size of the group; it places no requirements on model structure or training. It is known that, under mild conditions, the group-averaged model will have a provably better prediction accuracy than the original model. Here we show that an inexpensive group averaging can improve accuracy in practice. We take well-established benchmark machine learning models of differential equations in which certain symmetries ought to be obeyed. At evaluation time, we average the models over a small group of symmetries. Our experiments show that this procedure always decreases the average evaluation loss, with improvements of up to 37% in terms of the VRMSE. The averaging produces visually better predictions for continuous dynamics. This short paper shows that, under certain common circumstances, there are no disadvantages to imposing exact symmetries; the ML4PS community should consider group averaging as a cheap and simple way to improve model accuracy.
[LG-70] Filtering Jump Markov Systems with Partially Known Dynamics: A Model-Based Deep Learning Approach
链接: https://arxiv.org/abs/2511.09569
作者: George Stamatelis,George C. Alexandropoulos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to an IEEE transactions journal, copyright may be transfered upon acceptance
Abstract:This paper presents the Jump Markov Filtering Network (JMFNet), a novel model-based deep learning framework for real-time state-state estimation in jump Markov systems with unknown noise statistics and mode transition dynamics. A hybrid architecture comprising two Recurrent Neural Networks (RNNs) is proposed: one for mode prediction and another for filtering that is based on a mode-augmented version of the recently presented KalmanNet architecture. The proposed RNNs are trained jointly using an alternating least squares strategy that enables mutual adaptation without supervision of the latent modes. Extensive numerical experiments on linear and nonlinear systems, including target tracking, pendulum angle tracking, Lorenz attractor dynamics, and a real-life dataset demonstrate that the proposed JMFNet framework outperforms classical model-based filters (e.g., interacting multiple models and particle filters) as well as model-free deep learning baselines, particularly in non-stationary and high-noise regimes. It is also showcased that JMFNet achieves a small yet meaningful improvement over the KalmanNet framework, which becomes much more pronounced in complicated systems or long trajectories. Finally, the method’s performance is empirically validated to be consistent and reliable, exhibiting low sensitivity to initial conditions, hyperparameter selection, as well as to incorrect model knowledge
[LG-71] Let the Experts Speak: Improving Survival Prediction Calibration via Mixture-of-Experts Heads NEURIPS2025 ALT
链接: https://arxiv.org/abs/2511.09567
作者: Todd Morrill,Aahlad Puli,Murad Megjhani,Soojin Park,Richard Zemel
类目: Machine Learning (cs.LG)
*备注: Accepted as a proceedings paper at the 2025 Machine Learning for Health Symposium and as a workshop paper at the Learning from Time Series for Health workshop at NeurIPS 2025
Abstract:Deep mixture-of-experts models have attracted a lot of attention for survival analysis problems, particularly for their ability to cluster similar patients together. In practice, grouping often comes at the expense of key metrics such calibration error and predictive accuracy. This is due to the restrictive inductive bias that mixture-of-experts imposes, that predictions for individual patients must look like predictions for the group they’re assigned to. Might we be able to discover patient group structure, where it exists, while improving calibration and predictive accuracy? In this work, we introduce several discrete-time deep mixture-of-experts (MoE) based architectures for survival analysis problems, one of which achieves all desiderata: clustering, calibration, and predictive accuracy. We show that a key differentiator between this array of MoEs is how expressive their experts are. We find that more expressive experts that tailor predictions per patient outperform experts that rely on fixed group prototypes.
[LG-72] Global Solutions to Non-Convex Functional Constrained Problems with Hidden Convexity
链接: https://arxiv.org/abs/2511.10626
作者: Ilyas Fatkhullin,Niao He,Guanghui Lan,Florian Wolf
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
[LG-73] Edge Machine Learning for Cluster Counting in Next-Generation Drift Chambers NEURIPS2025
链接: https://arxiv.org/abs/2511.10540
作者: Deniz Yilmaz,Liangyu Wu,Julia Gonski
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 1 table. Machine Learning and the Physical Sciences Workshop, NeurIPS 2025
Abstract:Drift chambers have long been central to collider tracking, but future machines like a Higgs factory motivate higher granularity and cluster counting for particle ID, posing new data processing challenges. Machine learning (ML) at the “edge”, or in cell-level readout, can dramatically reduce the off-detector data rate for high-granularity drift chambers by performing cluster counting at-source. We present machine learning algorithms for cluster counting in real-time readout of future drift chambers. These algorithms outperform traditional derivative-based techniques based on achievable pion-kaon separation. When synthesized to FPGA resources, they can achieve latencies consistent with real-time operation in a future Higgs factory scenario, thus advancing both RD for future collider detectors as well as hardware-based ML for edge applications in high energy physics.
[LG-74] Continuum Dropout for Neural Differential Equations
链接: https://arxiv.org/abs/2511.10446
作者: Jonghun Lee,YongKyung Oh,Sungil Kim,Dong-Young Lim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural Differential Equations (NDEs) excel at modeling continuous-time dynamics, effectively handling challenges such as irregular observations, missing values, and noise. Despite their advantages, NDEs face a fundamental challenge in adopting dropout, a cornerstone of deep learning regularization, making them susceptible to overfitting. To address this research gap, we introduce Continuum Dropout, a universally applicable regularization technique for NDEs built upon the theory of alternating renewal processes. Continuum Dropout formulates the on-off mechanism of dropout as a stochastic process that alternates between active (evolution) and inactive (paused) states in continuous time. This provides a principled approach to prevent overfitting and enhance the generalization capabilities of NDEs. Moreover, Continuum Dropout offers a structured framework to quantify predictive uncertainty via Monte Carlo sampling at test time. Through extensive experiments, we demonstrate that Continuum Dropout outperforms existing regularization methods for NDEs, achieving superior performance on various time series and image classification tasks. It also yields better-calibrated and more trustworthy probability estimates, highlighting its effectiveness for uncertainty-aware modeling.
[LG-75] Operator Models for Continuous-Time Offline Reinforcement Learning
链接: https://arxiv.org/abs/2511.10383
作者: Nicolas Hoischen,Petar Bevanda,Max Beier,Stefan Sosnowski,Boris Houska,Sandra Hirche
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:Continuous-time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton-Jacobi-Bellman equation and proposing an operator-theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite-sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator-based approaches may hold promise in solving offline reinforcement learning using continuous-time optimal control.
[LG-76] Global Convergence of Four-Layer Matrix Factorization under Random Initialization
链接: https://arxiv.org/abs/2511.09925
作者: Minrui Luo,Weihang Xu,Xiang Gao,Maryam Fazel,Simon Shaolei Du
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.
[LG-77] Beyond empirical models: Discovering new constitutive laws in solids with graph-based equation discovery
链接: https://arxiv.org/abs/2511.09906
作者: Hao Xu,Yuntian Chen,Dongxiao Zhang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Constitutive models are fundamental to solid mechanics and materials science, underpinning the quantitative description and prediction of material responses under diverse loading conditions. Traditional phenomenological models, which are derived through empirical fitting, often lack generalizability and rely heavily on expert intuition and predefined functional forms. In this work, we propose a graph-based equation discovery framework for the automated discovery of constitutive laws directly from multisource experimental data. This framework expresses equations as directed graphs, where nodes represent operators and variables, edges denote computational relations, and edge features encode parametric dependencies. This enables the generation and optimization of free-form symbolic expressions with undetermined material-specific parameters. Through the proposed framework, we have discovered new constitutive models for strain-rate effects in alloy steel materials and the deformation behavior of lithium metal. Compared with conventional empirical models, these new models exhibit compact analytical structures and achieve higher accuracy. The proposed graph-based equation discovery framework provides a generalizable and interpretable approach for data-driven scientific modelling, particularly in contexts where traditional empirical formulations are inadequate for representing complex physical phenomena.
[LG-78] heory and computation for structured variational inference
链接: https://arxiv.org/abs/2511.09897
作者: Shunan Sheng,Bohan Wu,Bennett Zhu,Sinho Chewi,Aram-Alexandre Pooladian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 78 pages, 2 figures
Abstract:Structured variational inference constitutes a core methodology in modern statistical applications. Unlike mean-field variational inference, the approximate posterior is assumed to have interdependent structure. We consider the natural setting of star-structured variational inference, where a root variable impacts all the other ones. We prove the first results for existence, uniqueness, and self-consistency of the variational approximation. In turn, we derive quantitative approximation error bounds for the variational approximation to the posterior, extending prior work from the mean-field setting to the star-structured setting. We also develop a gradient-based algorithm with provable guarantees for computing the variational approximation using ideas from optimal transport theory. We explore the implications of our results for Gaussian measures and hierarchical Bayesian models, including generalized linear models with location family priors and spike-and-slab priors with one-dimensional debiasing. As a by-product of our analysis, we develop new stability results for star-separable transport maps which might be of independent interest.
[LG-79] Generalized infinite dimensional Alpha-Procrustes based geometries
链接: https://arxiv.org/abs/2511.09801
作者: Salvish Goomanee,Andi Han,Pratik Jawanpuria,Bamdev Mishra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
*备注:
Abstract:This work extends the recently introduced Alpha-Procrustes family of Riemannian metrics for symmetric positive definite (SPD) matrices by incorporating generalized versions of the Bures-Wasserstein (GBW), Log-Euclidean, and Wasserstein distances. While the Alpha-Procrustes framework has unified many classical metrics in both finite- and infinite- dimensional settings, it previously lacked the structural components necessary to realize these generalized forms. We introduce a formalism based on unitized Hilbert-Schmidt operators and an extended Mahalanobis norm that allows the construction of robust, infinite-dimensional generalizations of GBW and Log-Hilbert-Schmidt distances. Our approach also incorporates a learnable regularization parameter that enhances geometric stability in high-dimensional comparisons. Preliminary experiments reproducing benchmarks from the literature demonstrate the improved performance of our generalized metrics, particularly in scenarios involving comparisons between datasets of varying dimension and scale. This work lays a theoretical and computational foundation for advancing robust geometric methods in machine learning, statistical inference, and functional data analysis.
[LG-80] Symmetry aware Reynolds Averag ed Navier Stokes turbulence models with equivariant neural networks
链接: https://arxiv.org/abs/2511.09769
作者: Aaron Miller,Sahil Kommalapati,Robert Moser,Petros Koumoutsakos
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Accurate and generalizable Reynolds-averaged Navier-Stokes (RANS) models for turbulent flows rely on effective closures. We introduce tensor-based, symmetry aware closures using equivariant neural networks (ENNs) and present an algorithm for enforcing algebraic contraction relations among tensor components. The modeling approach builds on the structure tensor framework introduced by Kassinos and Reynolds to learn closures in the rapid distortion theory setting. Experiments show that ENNs can effectively learn relationships involving high-order tensors, meeting or exceeding the performance of existing models in tasks such as predicting the rapid pressure-strain correlation. Our results show that ENNs provide a physically consistent alternative to classical tensor basis models, enabling end-to-end learning of unclosed terms in RANS and fast exploration of model dependencies.
[LG-81] Modelos Empiricos de Pos-Dupla Selecao por LASSO: Discussoes para Estudos do Transporte Aereo
链接: https://arxiv.org/abs/2511.09767
作者: Alessandro V. M. Oliveira
类目: Methodology (stat.ME); Machine Learning (cs.LG); General Economics (econ.GN); Systems and Control (eess.SY); Applications (stat.AP)
*备注: Article in Portuguese
Abstract:This paper presents and discusses forms of estimation by regularized regression and model selection using the LASSO method - Least Absolute Shrinkage and Selection Operator. LASSO is recognized as one of the main supervised learning methods applied to high-dimensional econometrics, allowing work with large volumes of data and multiple correlated controls. Conceptual issues related to the consequences of high dimensionality in modern econometrics and the principle of sparsity, which underpins regularization procedures, are addressed. The study examines the main post-double selection and post-regularization models, including variations applied to instrumental variable models. A brief description of the lassopack routine package, its syntaxes, and examples of HD, HDS (High-Dimension Sparse), and IV-HDS models, with combinations involving fixed effects estimators, is also presented. Finally, the potential application of the approach in research focused on air transport is discussed, with emphasis on an empirical study on the operational efficiency of airlines and aircraft fuel consumption.
[LG-82] A Fourier-Based Global Denoising Model for Smart Artifacts Removing of Microscopy Images
链接: https://arxiv.org/abs/2511.09734
作者: Huanhuan Zhao,Connor Vernachio,Laxmi Bhurtel,Wooin Yang,Ruben Millan-Solsona,Spenser R. Brown,Marti Checa,Komal Sharma Agrawal,Adam M. Guss,Liam Collins,Wonhee Ko,Arpan Biswas
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures
Abstract:Microscopy such as Scanning Tunneling Microscopy (STM), Atomic Force Microscopy (AFM) and Scanning Electron Microscopy (SEM) are essential tools in material imaging at micro- and nanoscale resolutions to extract physical knowledge and materials structure-property relationships. However, tuning microscopy controls (e.g. scanning speed, current setpoint, tip bias etc.) to obtain a high-quality of images is a non-trivial and time-consuming effort. On the other hand, with sub-standard images, the key features are not accurately discovered due to noise and artifacts, leading to erroneous analysis. Existing denoising models mostly build on generalizing the weak signals as noises while the strong signals are enhanced as key features, which is not always the case in microscopy images, thus can completely erase a significant amount of hidden physical information. To address these limitations, we propose a global denoising model (GDM) to smartly remove artifacts of microscopy images while preserving weaker but physically important features. The proposed model is developed based on 1) first designing a two-imaging input channel of non-pair and goal specific pre-processed images with user-defined trade-off information between two channels and 2) then integrating a loss function of pixel- and fast Fourier-transformed (FFT) based on training the U-net model. We compared the proposed GDM with the non-FFT denoising model over STM-generated images of Copper(Cu) and Silicon(Si) materials, AFM-generated Pantoea sp.YR343 bio-film images and SEM-generated plastic degradation images. We believe this proposed workflow can be extended to improve other microscopy image quality and will benefit the experimentalists with the proposed design flexibility to smartly tune via domain-experts preferences.
[LG-83] he Data Fusion Labeler (dFL): Challenges and Solutions to Data Harmonization Labeling and Provenance in Fusion Energy
链接: https://arxiv.org/abs/2511.09725
作者: Craig Michoski,Matthew Waller,Brian Sammuli,Zeyu Li,Tapan Ganatma Nakkina,Raffi Nazikian,Sterling Smith,David Orozco,Dongyang Kuang,Martin Foltin,Erik Olofsson,Mike Fredrickson,Jerry Louis-Jeune,David R. Hatch,Todd A. Oliver,Mitchell Clark,Steph-Yves Louis
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Fusion energy research increasingly depends on the ability to integrate heterogeneous, multimodal datasets from high-resolution diagnostics, control systems, and multiscale simulations. The sheer volume and complexity of these datasets demand the development of new tools capable of systematically harmonizing and extracting knowledge across diverse modalities. The Data Fusion Labeler (dFL) is introduced as a unified workflow instrument that performs uncertainty-aware data harmonization, schema-compliant data fusion, and provenance-rich manual and automated labeling at scale. By embedding alignment, normalization, and labeling within a reproducible, operator-order-aware framework, dFL reduces time-to-analysis by greater than 50X (e.g., enabling 200 shots/hour to be consistently labeled rather than a handful per day), enhances label (and subsequently training) quality, and enables cross-device comparability. Case studies from DIII-D demonstrate its application to automated ELM detection and confinement regime classification, illustrating its potential as a core component of data-driven discovery, model validation, and real-time control in future burning plasma devices.
[LG-84] Masked Mineral Modeling: Continent-Scale Mineral Prospecting via Geospatial Infilling AAAI2026
链接: https://arxiv.org/abs/2511.09722
作者: Sujay Nair,Evan Coleman,Sherrie Wang,Elsa Olivetti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 7 pages, 6 figures, includes 23 pages of Supplementary Materials for paper accepted to AAAI2026
Abstract:Minerals play a critical role in the advanced energy technologies necessary for decarbonization, but characterizing mineral deposits hidden underground remains costly and challenging. Inspired by recent progress in generative modeling, we develop a learning method which infers the locations of minerals by masking and infilling geospatial maps of resource availability. We demonstrate this technique using mineral data for the conterminous United States, and train performant models, with the best achieving Dice coefficients of 0.31 \pm 0.01 and recalls of 0.22 \pm 0.02 on test data at 1 \times 1 mi ^2 spatial resolution. One major advantage of our approach is that it can easily incorporate auxiliary data sources for prediction which may be more abundant than mineral data. We highlight the capabilities of our model by adding input layers derived from geophysical sources, along with a nation-wide ground survey of soils originally intended for agronomic purposes. We find that employing such auxiliary features can improve inference performance, while also enabling model evaluation in regions with no recorded minerals.
[LG-85] Lithological Controls on the Permeability of Geologic Faults: Surrogate Modeling and Sensitivity Analysis
链接: https://arxiv.org/abs/2511.09674
作者: Hannah Lu,Lluıs Salo-Salgado,Ruben Juanes
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Fault zones exhibit complex and heterogeneous permeability structures influenced by stratigraphic, compositional, and structural factors, making them critical yet uncertain components in subsurface flow modeling. In this study, we investigate how lithological controls influence fault permeability using the PREDICT framework: a probabilistic workflow that couples stochastic fault geometry generation, physically constrained material placement, and flow-based upscaling. The flow-based upscaling step, however, is a very computationally expensive component of the workflow and presents a major bottleneck that makes global sensitivity analysis (GSA) intractable, as it requires millions of model evaluations. To overcome this challenge, we develop a neural network surrogate to emulate the flow-based upscaling step. This surrogate model dramatically reduces the computational cost while maintaining high accuracy, thereby making GSA feasible. The surrogate-model-enabled GSA reveals new insights into the effects of lithological controls on fault permeability. In addition to identifying dominant parameters and negligible ones, the analysis uncovers significant nonlinear interactions between parameters that cannot be captured by traditional local sensitivity methods.
[LG-86] Analysis of the TAIGA-HiSCORE Data Using the Latent Space of Autoencoders
链接: https://arxiv.org/abs/2511.09655
作者: Yu.Yu. Dubenskaya,S.P. Polyakov,A.P. Kryukov,A.P. Demichev,E.O. Gres,E.B. Postnikov,A.Yu. Razumov,P.A. Volchugov,D.P. Zhurov
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, Proceedings of The 9th International Conference on Deep Learning in Computational Physics, July 2-4, 2025, Moscow, Russia
Abstract:The aim of extensive air shower (EAS) analysis is to reconstruct the physical parameters of the primary particle that initiated the shower. The TAIGA experiment is a hybrid detector system that combines several imaging atmospheric Cherenkov telescopes (IACTs) and an array of non-imaging Cherenkov detectors (TAIGA-HiSCORE) for EAS detection. Because the signals recorded by different detector types differ in physical nature, the direct merging of data is unfeasible, which complicates multimodal analysis. Currently, to analyze data from the IACTs and TAIGA-HiSCORE, a set of auxiliary parameters specific to each detector type is calculated from the recorded signals. These parameters are chosen empirically, so there is no certainty that they retain all important information and are the best suited for the respective problems. We propose to use autoencoders (AE) for the analysis of TAIGA experimental data and replace the conventionally used auxiliary parameters with the parameters of the AE latent space. The advantage of the AE latent space parameters is that they preserve essential physics from experimental data without prior assumptions. This approach also holds potential for enabling seamless integration of heterogeneous IACT and HiSCORE data through a joint latent space. To reconstruct the parameters of the primary particle of the EAS from the latent space of the AE, a separate artificial neural network is used. In this paper, the proposed approach is used to reconstruct the energy of the EAS primary particles based on Monte Carlo simulation data for TAIGA-HiSCORE. The dependence of the energy determination accuracy on the dimensionality of the latent space is analyzed, and these results are also compared with the results obtained by the conventional technique. It is shown that when using the AE latent space, the energy of the primary particle is reconstructed with satisfactory accuracy.
[LG-87] Siegel Neural Networks
链接: https://arxiv.org/abs/2511.09577
作者: Xuan Son Nguyen,Aymeric Histace,Nistor Grozavu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Riemannian symmetric spaces (RSS) such as hyperbolic spaces and symmetric positive definite (SPD) manifolds have become popular spaces for representation learning. In this paper, we propose a novel approach for building discriminative neural networks on Siegel spaces, a family of RSS that is largely unexplored in machine learning tasks. For classification applications, one focus of recent works is the construction of multiclass logistic regression (MLR) and fully-connected (FC) layers for hyperbolic and SPD neural networks. Here we show how to build such layers for Siegel neural networks. Our approach relies on the quotient structure of those spaces and the notation of vector-valued distance on RSS. We demonstrate the relevance of our approach on two applications, i.e., radar clutter classification and node classification. Our results successfully demonstrate state-of-the-art performance across all datasets.
信息检索
[IR-0] GPR: Towards a Generative Pre-trained One-Model Paradigm for Large-Scale Advertising Recommendation
链接: https://arxiv.org/abs/2511.10138
作者: Jun Zhang,Yi Li,Yue Liu,Changping Wang,Yuan Wang,Yuling Xiong,Xun Liu,Haiyang Wu,Qian Li,Enming Zhang,Jiawei Sun,Xin Xu,Zishuai Zhang,Ruoran Liu,Suyuan Huang,Zhaoxin Zhang,Zhengkai Guo,Shuojin Yang,Meng-Hao Guo,Huan Yu,Jie Jiang,Shi-Min Hu
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 5 figures
Abstract:As an intelligent infrastructure connecting users with commercial content, advertising recommendation systems play a central role in information flow and value creation within the digital economy. However, existing multi-stage advertising recommendation systems suffer from objective misalignment and error propagation, making it difficult to achieve global optimality, while unified generative recommendation models still struggle to meet the demands of practical industrial applications. To address these issues, we propose GPR (Generative Pre-trained Recommender), the first one-model framework that redefines advertising recommendation as an end-to-end generative task, replacing the traditional cascading paradigm with a unified generative approach. To realize GPR, we introduce three key innovations spanning unified representation, network architecture, and training strategy. First, we design a unified input schema and tokenization method tailored to advertising scenarios, mapping both ads and organic content into a shared multi-level semantic ID space, thereby enhancing semantic alignment and modeling consistency across heterogeneous data. Second, we develop the Heterogeneous Hierarchical Decoder (HHD), a dual-decoder architecture that decouples user intent modeling from ad generation, achieving a balance between training efficiency and inference flexibility while maintaining strong modeling capacity. Finally, we propose a multi-stage joint training strategy that integrates Multi-Token Prediction (MTP), Value-Aware Fine-Tuning and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm, forming a complete generative recommendation pipeline that unifies interest modeling, value alignment, and policy optimization. GPR has been fully deployed in the Tencent Weixin Channels advertising system, delivering significant improvements in key business metrics including GMV and CTCVR.

